Word2003 poser: when do I run out of words?

ChrisGreaves · Post by **ChrisGreaves** » 21 Apr 2021, 14:47

Dim doc As Document
Set doc = ThisDocument
Dim wd As Range
Dim strCr() As String
ReDim strCr(0)
For Each wd In doc.Words
    If UBound(strCr) > 100 Then Stop
    wd.Select
    Debug.Print wd
    strCr(UBound(strCr)) = wd
    ReDim Preserve strCr(UBound(strCr) + 1)
Next wd

The attached document contains a stripped-down version of a procedure that was working fine until 11:53 this morning. I Process a Document by examining every word in the document. Hence "For Each wd In doc.Words".

The attached document contains a table, and only that table.
The document contains no text beyond that table, and when I Ctrl-End and tap Enter, I am given an extra row to the table.

I tried Table, spliT, added a regular paragraph of text and then deleted the second (split off) part of the table.

I have processed 620 documents to date, I believe that many of them hold tables.

I can fudge a solution by checking that the selection point or the wd.Range.Start has not changed, but am puzzled as to why MSWord2003 will suddenly go into an endless loop on such a basic piece of code.

I suspect that the answers will lie in the neighbourhood of the mandatory empty paragraph at the end of each document.

Thanks for any pointers.
Chris

Post by **HansV** » 21 Apr 2021, 15:24

When I open your document in Word 2019 and run the macro, it finishes normally.
By the way, why do you select each word?

PJ_in_FL · Post by **PJ_in_FL** » 21 Apr 2021, 16:01

Why don't you limit your loop to

Code: Select all

doc.Words.Count

ChrisGreaves · Post by **ChrisGreaves** » 21 Apr 2021, 18:34

HansV wrote: ↑
21 Apr 2021, 15:24
When I open your document in Word 2019 and run the macro, it finishes normally.

No, Hans. I'm not moving on until I have mastered Word2003

By the way, why do you select each word?

Good question.
The original code has no "If UBound(strCr) > 100 Then Stop", " wd.Select" or "Debug.Print wd".
I pushed them in there while I was trying to track down where Word2003 thought it was, and why it thought it was doing what it was doing.

If the code runs for you in Word2019 that suggests that this might be a bug with the unique distinction of having been fixed!

I confess to being puzzled that I haven't stumbled across this before in a quarter-century of processing strings in Word6/2003.

Thanks
Chris

ChrisGreaves · Post by **ChrisGreaves** » 21 Apr 2021, 18:41

PJ_in_FL wrote: ↑
21 Apr 2021, 16:01
Why don't you limit your loop to
Code: Select all
doc.Words.Count

Code: Select all

    Dim lng As Long
    For lng = 1 To doc.Words.Count
''''        If UBound(strCr) > 100 Then Stop
''''        wd.Select
''''        Debug.Print wd
        Set wd = doc.Words(lng)
        strCr(UBound(strCr)) = wd
        ReDim Preserve strCr(UBound(strCr) + 1)
    Next lng

Hi PJ.
I can easily answer your question right now: It's because I am not as smart as I think I am.
HTH

Chris

ChrisGreaves · Post by **ChrisGreaves** » 21 Apr 2021, 20:20

PJ_in_FL wrote: ↑
21 Apr 2021, 16:01
Why don't you limit your loop to
Code: Select all
doc.Words.Count

Hi PJ.
I am now back at my data drive root folder, so a long run.
I have an impression that the loop using Words.Count is taking about five times as long as the Each loop, but that is just an impression, and may be indicative of my stomach telling me that we are at supper time.
Time is not critical on this sort of job - if it takes a week, it can run overnight all week.
Once I have waded through the data drive I might go back and do a couple of timing runs.
Cheers
Chris

ChrisGreaves · Post by **ChrisGreaves** » 24 Apr 2021, 11:27

PJ_in_FL wrote: ↑
21 Apr 2021, 16:01
Why don't you limit your loop to
Code: Select all
doc.Words.Count

On The Other Hand, there are certain coding protocols that one should adopt, especially with documents containing many (e.g. 20,000) words.

Code: Select all

        Dim lngWordsCount As Long
        lngWordsCount = doc.Words.Count
        Dim strCr() As String
        ReDim strCr(lngWordsCount)
        Dim lng As Long
        For lng = 1 To lngWordsCount
            Application.Caption = lng & "/" & lngWordsCount
            DoEvents
            strCr(lng - 1) = doc.Words(lng)
        Next lng

Things went a LOT faster once I decided to make a call to doc.Words.Count just once, instead of twice each time through a loop of 20,000 words.
(It'll dawn on you about ten seconds after you read this!)
Thanks to lngWordsCount , I now have my life back.

(One hour later): STILL too slow, now bogged down on "doc.Words(lng)"
I am yet to do timing runs, and back when I was using "For Each wd In doc.Words" I was not into such a lengthy document.
I now suspect that "doc.Words" is the thing that soaks up more time than it is worth.
Good thing that I run this thing overnight!

(One half-hour later): More thoughts. I used to grab the document contents as a text string, and then parse it, word by word, with my strSplitAt() function, written before the days of the Split() function (which loads words to an array). That will be something else to test in timing runs.

Cheers
Chris

LisaGreen · Post by **LisaGreen** » 25 Apr 2021, 05:36

Chris,

I wonder if counting something else indicative of word boundaries would be faster.

My experience is that the shorter the item being examined the quicker the code, and also if the item is the same.

How about counting spaces?
Or.... a standard method of counting the occurence of strings within a string is to compare the length of a string before and after removing the string.

Lisa

ChrisGreaves · Post by **ChrisGreaves** » 25 Apr 2021, 12:46

LisaGreen wrote: ↑
25 Apr 2021, 05:36
I wonder if counting something else indicative of word boundaries would be faster ...

Hi Lisa, I agree with you.
Indeed for years a mantra "there is always a better way" has been my guide.
I have attached a document, (LATER: just realised that the document had not been attached!) a Work-In-Progress because it is ten o'clock, the sun is shining, we are destined to reach 12c and I must prepare the potato and artichoke beds!

(6) Splits the text string by looking for a lower-case-letter following a non-lower-case-letter, and obtaining all the lower-case letters until the next non lower-case-letter, but bypass arrays. Instead, for each found string:- if NOT INSTR then spell-check

Code: Select all

Function strNextLowerCase(strInput As String) As String
    While (Len(strInput) > 0) And (InStr(1, strcLowerAlpha, Left(strInput, 1)) = 0) ' still no LC
        strInput = Right(strInput, Len(strInput) - 1)
    Wend
    While (Len(strInput) > 0) And (InStr(1, strcLowerAlpha, Left(strInput, 1)) > 0) ' still in LC
        strNextLowerCase = strNextLowerCase & Left(strInput, 1)
        strInput = Right(strInput, Len(strInput) - 1)
    Wend
'Sub TESTstrNextLowerCase()
'    Debug.Assert "" = strNextLowerCase("")
'    Debug.Assert "" = strNextLowerCase("7")
'    Debug.Assert "a" = strNextLowerCase("a")
'    Debug.Assert "a" = strNextLowerCase("a7")
'    Debug.Assert "a" = strNextLowerCase("7a")
'    Debug.Assert "alpha" = strNextLowerCase("alpha")
'    Debug.Assert "alpha" = strNextLowerCase("-alpha")
'End Sub
End Function

I have inserted these snippets just to let you know that Great Minds continue to Think Alike!

The code is prepared around the concept of MY need for a word: "A WORD is a string that consists of only lower-case alphabetic characters that passes a spell-check test."

Cheers

Chris

BobH · Post by **BobH** » 25 Apr 2021, 17:48

In another lifetime, a WORD was 8 bytes!

ChrisGreaves · Post by **ChrisGreaves** » 25 Apr 2021, 19:23

BobH wrote: ↑
25 Apr 2021, 17:48
In another lifetime, a WORD was 8 bytes!

... and in a world before that, a word was a string of six-bit bytes delimited by word-marks iset by an instruction whose mnemonic operation-code was "SW", and whose object code was displayed as ",", which was, I think, an 0-3-8 punched code.

Back then, programmers had control of what programs did, and the operating system was called "Allan", and "Roy", and soon after that, "Frankie" who had a gorgeous smile!

I am, of course, too young to recall memory as a tube of mercury holding thirty-two bits. Now THAT was memory!
Cheers
Chris

macropod · Post by **macropod** » 25 Apr 2021, 21:36

Perhaps you could explain what you are trying to achieve?

ChrisGreaves · Post by **ChrisGreaves** » 26 Apr 2021, 09:00

macropod wrote: ↑
25 Apr 2021, 21:36
Perhaps you could explain what you are trying to achieve?

Hi MacroPod.
"I Process a Document by examining every word in the document" for every DOCument on a hard drive.
Documents have been typed in by me (hence my vocabulary) or have been harvested from other people (typically text copied from a web page or from a PDF file).
I have just realized that although I'd said "The Attached Document", I had failed to attach the document. Corrected now.. This document is a Work-In-progress and will be worked on today, once the rain starts again.

My latest project, based on Kevin Stroud’s the History of English Podcast, is to develop an algorithm to analyze the written word in Modern English, and determine whether or not it is a Loan Word or whether the word has survived from Old English. Kevin Stroud’s work is an audio-book, so he relies on both the spoken and written forms of a word. I rely only on the written form. Therein lies the challenge.

I define a "word" as a string of lower-case letters that passes a machine spell-check.
Lower-case because I will ignore proper words and gamble that capitalized leading words of a sentence will turn up mid-sentence elsewhere.

I must be fair to myself in testing my rules for determining if a "word" is Old English or a Loan Word (from French, Latin, Greek etc). Some of my documents contain Australian Aborigine names (Mukinbudin, Warralakin, Warrachupin, Boodarockin) or native American names (Cheektowaga, Tonawonda, Lackawanna) and so on, and while these pass spell-check on my machine, they do so because they are in Custom Dictionaries. I must disable, then enable custom dictionaries correctly. This poses a little problem.

I have found that at least one document on my machine sends <For Each wd in doc.Words> into an infinite loop
I have found that <doc.Words> is a bit of a time-hog when used repeatedly on larger documents (more than 20,000 words).
I anticipate more hurdles.

If I have correctly analyzed Kevin's ~150 one-hour episodes to date, I will have a machine that classifies English-language words on any machine, and that is the ultimate test.
My First objective towards this goal is to achieve a 90% correct rate for all the English language words on my machine!

Cheers
Chris

macropod · Post by **macropod** » 26 Apr 2021, 21:49

It seems to me, then, that what you're really after is a simple word list, which could be created from VBA code like that in the attached document. The code adds all content from documents in the chosen folder to the active document, then processes the resulting content to remove all capitalised words, numbers, etc. It then deletes all common words and processes the remainder to generate an output word list which is spell-checked and any spelling errors removed.

ChrisGreaves · Post by **ChrisGreaves** » 28 Apr 2021, 10:32

PJ_in_FL wrote: ↑
21 Apr 2021, 16:01
Why don't you limit your loop to doc.Words.Count

(01)	wd in doc.Words	Store in Array	Test array	To Lexicon	7%
(02)	wd in doc.Words	Test word		To Lexicon	7%
(03)	lng to doc.Words.Count	Store in Array	Test array	To Lexicon	28%
(04)	lng to doc.Words.Count	Test word		To Lexicon	28%
(05)	strSplitAt	Store in Array	Test array	To Lexicon	6%
(06)	strSplitAt	Test word		To Lexicon	6%
(07)	strNextLowerCase	Store in Array	Test array	To Lexicon	4%
(08)	strNextLowerCase	Test word		To Lexicon	8%
(09)	Split() to array		Test array	To Lexicon	7%

Hello again PJ.
I am not done yet, but I have some data.

I set up nine methods, four pairs of methods plus the ninth using the intrinsic "Split()" function.
Each pair of tests builds a lexicon as each word string is found, and stores items in an array as each word is found. The first member of the pair therefore updates the lexicon on-the-fly, whereas the array method must perform the analysis on an array of terms. Obviously the "array" member of the pair needs more processing time, right?, because it must pass words INTO an array, and then process the words FROM the array towards the lexicon. (but see below "7/8")

The ninth method "Split()" does not provide the option for on-the-fly processing because Split() always produces an array. Nonetheless, Split() is super-fast because it is an intrinsic function.

Both doc.Words.Count methods (on-the-fly and via array) use about four times as much time as the others, but (a) as you have shown, the Words.Count does get me over that weird document that sent me into an endless loop when I used "For each wd in doc.Words" and (b) My test this morning was run on a test document of only 2,500 words (Words04.doc too large to attach so available at www.chrisgreaves.com/Downloads/20210428_0759.zip ).

Tonight I shall rerun the timing test with a 50,000 word document. I should as well dig up that endless-loop document and see how my nine methods cope with that.

I think that methods 7/8 differ in timing because the procedure "strNextLowerCase()" is designed to collect ONLY lower-case-only strings, and this reduces the number/quantity of collected strings to be passed into and from arrays.

Given that I want the harvester to work for a population of unknown users, that is collections of documents other than my own, a foolproof method is essential, so for now doc.Words.Count is still a candidate!

More later
Chris

ChrisGreaves · Post by **ChrisGreaves** » 28 Apr 2021, 10:49

LisaGreen wrote: ↑
25 Apr 2021, 05:36
I wonder if counting something else indicative of word boundaries would be faster.

Hello again Lisa. Please see my response to PJ above, but especially the upgraded function "strNextLowerCase" in the zip file.
Cheers
Chris

ChrisGreaves · Post by **ChrisGreaves** » 29 Apr 2021, 11:14

ChrisGreaves wrote: ↑
28 Apr 2021, 10:32

1 wd in doc.Words Store in Array Test array To Lexicon 7% Method01 00000.000556 0.000556 0.013344 0.80064 1.410%
2 wd in doc.Words Test word To Lexicon 7% Method02 00000.000567 0.000567 0.013608 0.81648 1.438%
3 lng to doc.Words.Count Store in Array Test array To Lexicon 28% Method03 00000.017905 0.017905 0.42972 25.7832 45.419%
4 lng to doc.Words.Count Test word To Lexicon 28% Method04 00000.018067 0.018067 0.433608 26.01648 45.830%
5 strSplitAt Store in Array Test array To Lexicon 6% Method05 00000.000498 0.000498 0.011952 0.71712 1.263%
6 strSplitAt Test word To Lexicon 6% Method06 00000.000463 0.000463 0.011112 0.66672 1.174%
7 strNextLowerCase Store in Array Test array To Lexicon 4% Method07 00000.000336 0.000336 0.008064 0.48384 0.852%
8 strNextLowerCase Test word To Lexicon 8% Method08 00000.000544 0.000544 0.013056 0.78336 1.380%
9 Split() to array Test array To Lexicon 7% Method09 00000.000486 0.000486 0.011664 0.69984 1.233%
56.76768
record days hours minutes

Here is last night's run on a 26,00 word document. I had to switch to NOW() from Timer because Timer runs only from Midnight etc etc.

You will note that methods (3) and (4) are expensive, BUT THEY WORK, whereas all the others are cheap, BUT (1) and (2) have been shown to fail (endless loop) in at least one document on my hard drive.

My conclusion: doc.Words.Count (45% each of the two tests) is prohibitively expensive; I can't afford to use it.
The meanness arises because this time last week, doc.Words was the only method to hand at that time that could cope with every documented presented to date.

I will now start processing the entire drive (14,000 documents) again to see which of the remaining seven methods can process every document.

(minutes later): The seven remaining methods run flawlessly (that is, do not go into an endless loop), so based on the timings above I shall rerun on drive T: (14,000 documents) tonight using method 7 "strNextLowerCase" and see what crops up next.

Cheers
Chris

Eileen's Lounge

Word2003 poser: when do I run out of words?

Word2003 poser: when do I run out of words?

Re: Word2003 poser: when do I run out of words?

Re: Word2003 poser: when do I run out of words?

Re: Word2003 poser: when do I run out of words?

Re: Word2003 poser: when do I run out of words?

Re: Word2003 poser: when do I run out of words?

Re: Word2003 poser: when do I run out of words?

Re: Word2003 poser: when do I run out of words?

Re: Word2003 poser: when do I run out of words?

Re: Word2003 poser: when do I run out of words?

Re: Word2003 poser: when do I run out of words?

Re: Word2003 poser: when do I run out of words?

Re: Word2003 poser: when do I run out of words?

Re: Word2003 poser: when do I run out of words?

Re: Word2003 poser: when do I run out of words?

Re: Word2003 poser: when do I run out of words?

Re: Word2003 poser: when do I run out of words?