Word2003 poser: when do I run out of words?

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 15498
Joined: 24 Jan 2010, 23:23
Location: brings.slot.perky

Word2003 poser: when do I run out of words?

Post by ChrisGreaves »

Code: Select all

Dim doc As Document
Set doc = ThisDocument
Dim wd As Range
Dim strCr() As String
ReDim strCr(0)
For Each wd In doc.Words
    If UBound(strCr) > 100 Then Stop
    wd.Select
    Debug.Print wd
    strCr(UBound(strCr)) = wd
    ReDim Preserve strCr(UBound(strCr) + 1)
Next wd
The attached document contains a stripped-down version of a procedure that was working fine until 11:53 this morning. I Process a Document by examining every word in the document. Hence "For Each wd In doc.Words".

The attached document contains a table, and only that table.
The document contains no text beyond that table, and when I Ctrl-End and tap Enter, I am given an extra row to the table.

I tried Table, spliT, added a regular paragraph of text and then deleted the second (split off) part of the table.

I have processed 620 documents to date, I believe that many of them hold tables.

I can fudge a solution by checking that the selection point or the wd.Range.Start has not changed, but am puzzled as to why MSWord2003 will suddenly go into an endless loop on such a basic piece of code.

I suspect that the answers will lie in the neighbourhood of the mandatory empty paragraph at the end of each document.

Thanks for any pointers.
Chris
You do not have the required permissions to view the files attached to this post.
An expensive day out: Wallet and Grimace

User avatar
HansV
Administrator
Posts: 78236
Joined: 16 Jan 2010, 00:14
Status: Microsoft MVP
Location: Wageningen, The Netherlands

Re: Word2003 poser: when do I run out of words?

Post by HansV »

When I open your document in Word 2019 and run the macro, it finishes normally.
By the way, why do you select each word?
Best wishes,
Hans

PJ_in_FL
5StarLounger
Posts: 1090
Joined: 21 Jan 2011, 16:51
Location: Florida

Re: Word2003 poser: when do I run out of words?

Post by PJ_in_FL »

Why don't you limit your loop to

Code: Select all

doc.Words.Count
PJ in (usually sunny) FL

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 15498
Joined: 24 Jan 2010, 23:23
Location: brings.slot.perky

Re: Word2003 poser: when do I run out of words?

Post by ChrisGreaves »

HansV wrote:
21 Apr 2021, 15:24
When I open your document in Word 2019 and run the macro, it finishes normally.
No, Hans. I'm not moving on until I have mastered Word2003 :evilgrin:
By the way, why do you select each word?
Good question.
The original code has no "If UBound(strCr) > 100 Then Stop", " wd.Select" or "Debug.Print wd".
I pushed them in there while I was trying to track down where Word2003 thought it was, and why it thought it was doing what it was doing.

If the code runs for you in Word2019 that suggests that this might be a bug with the unique distinction of having been fixed!

I confess to being puzzled that I haven't stumbled across this before in a quarter-century of processing strings in Word6/2003.

Thanks
Chris
An expensive day out: Wallet and Grimace

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 15498
Joined: 24 Jan 2010, 23:23
Location: brings.slot.perky

Re: Word2003 poser: when do I run out of words?

Post by ChrisGreaves »

PJ_in_FL wrote:
21 Apr 2021, 16:01
Why don't you limit your loop to

Code: Select all

doc.Words.Count

Code: Select all

    Dim lng As Long
    For lng = 1 To doc.Words.Count
''''        If UBound(strCr) > 100 Then Stop
''''        wd.Select
''''        Debug.Print wd
        Set wd = doc.Words(lng)
        strCr(UBound(strCr)) = wd
        ReDim Preserve strCr(UBound(strCr) + 1)
    Next lng
Hi PJ.
I can easily answer your question right now: It's because I am not as smart as I think I am.
HTH :cheers: :chocciebar: :clapping: :thankyou:
Chris
An expensive day out: Wallet and Grimace

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 15498
Joined: 24 Jan 2010, 23:23
Location: brings.slot.perky

Re: Word2003 poser: when do I run out of words?

Post by ChrisGreaves »

PJ_in_FL wrote:
21 Apr 2021, 16:01
Why don't you limit your loop to

Code: Select all

doc.Words.Count
Hi PJ.
I am now back at my data drive root folder, so a long run.
I have an impression that the loop using Words.Count is taking about five times as long as the Each loop, but that is just an impression, and may be indicative of my stomach telling me that we are at supper time.
Time is not critical on this sort of job - if it takes a week, it can run overnight all week.
Once I have waded through the data drive I might go back and do a couple of timing runs.
Cheers
Chris
An expensive day out: Wallet and Grimace

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 15498
Joined: 24 Jan 2010, 23:23
Location: brings.slot.perky

Re: Word2003 poser: when do I run out of words?

Post by ChrisGreaves »

PJ_in_FL wrote:
21 Apr 2021, 16:01
Why don't you limit your loop to

Code: Select all

doc.Words.Count
On The Other Hand, there are certain coding protocols that one should adopt, especially with documents containing many (e.g. 20,000) words.

Code: Select all

        Dim lngWordsCount As Long
        lngWordsCount = doc.Words.Count
        Dim strCr() As String
        ReDim strCr(lngWordsCount)
        Dim lng As Long
        For lng = 1 To lngWordsCount
            Application.Caption = lng & "/" & lngWordsCount
            DoEvents
            strCr(lng - 1) = doc.Words(lng)
        Next lng
Things went a LOT faster once I decided to make a call to doc.Words.Count just once, instead of twice each time through a loop of 20,000 words.
(It'll dawn on you about ten seconds after you read this!)
Thanks to lngWordsCount , I now have my life back.

(One hour later): STILL too slow, now bogged down on "doc.Words(lng)"
I am yet to do timing runs, and back when I was using "For Each wd In doc.Words" I was not into such a lengthy document.
I now suspect that "doc.Words" is the thing that soaks up more time than it is worth.
Good thing that I run this thing overnight!

(One half-hour later): More thoughts. I used to grab the document contents as a text string, and then parse it, word by word, with my strSplitAt() function, written before the days of the Split() function (which loads words to an array). That will be something else to test in timing runs.


Cheers
Chris
An expensive day out: Wallet and Grimace

LisaGreen
5StarLounger
Posts: 964
Joined: 08 Nov 2012, 17:54

Re: Word2003 poser: when do I run out of words?

Post by LisaGreen »

Chris,

I wonder if counting something else indicative of word boundaries would be faster.

My experience is that the shorter the item being examined the quicker the code, and also if the item is the same.

How about counting spaces?
Or.... a standard method of counting the occurence of strings within a string is to compare the length of a string before and after removing the string.

Lisa

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 15498
Joined: 24 Jan 2010, 23:23
Location: brings.slot.perky

Re: Word2003 poser: when do I run out of words?

Post by ChrisGreaves »

LisaGreen wrote:
25 Apr 2021, 05:36
I wonder if counting something else indicative of word boundaries would be faster ...
Hi Lisa, I agree with you.
Indeed for years a mantra "there is always a better way" has been my guide.
I have attached a document, (LATER: just realised that the document had not been attached!) a Work-In-Progress because it is ten o'clock, the sun is shining, we are destined to reach 12c and I must prepare the potato and artichoke beds!

(6) Splits the text string by looking for a lower-case-letter following a non-lower-case-letter, and obtaining all the lower-case letters until the next non lower-case-letter, but bypass arrays. Instead, for each found string:- if NOT INSTR then spell-check

Code: Select all

Function strNextLowerCase(strInput As String) As String
    While (Len(strInput) > 0) And (InStr(1, strcLowerAlpha, Left(strInput, 1)) = 0) ' still no LC
        strInput = Right(strInput, Len(strInput) - 1)
    Wend
    While (Len(strInput) > 0) And (InStr(1, strcLowerAlpha, Left(strInput, 1)) > 0) ' still in LC
        strNextLowerCase = strNextLowerCase & Left(strInput, 1)
        strInput = Right(strInput, Len(strInput) - 1)
    Wend
'Sub TESTstrNextLowerCase()
'    Debug.Assert "" = strNextLowerCase("")
'    Debug.Assert "" = strNextLowerCase("7")
'    Debug.Assert "a" = strNextLowerCase("a")
'    Debug.Assert "a" = strNextLowerCase("a7")
'    Debug.Assert "a" = strNextLowerCase("7a")
'    Debug.Assert "alpha" = strNextLowerCase("alpha")
'    Debug.Assert "alpha" = strNextLowerCase("-alpha")
'End Sub
End Function
I have inserted these snippets just to let you know that Great Minds continue to Think Alike!

The code is prepared around the concept of MY need for a word: "A WORD is a string that consists of only lower-case alphabetic characters that passes a spell-check test."

Cheers :thankyou:
Chris
You do not have the required permissions to view the files attached to this post.
Last edited by ChrisGreaves on 26 Apr 2021, 08:46, edited 2 times in total.
An expensive day out: Wallet and Grimace

User avatar
BobH
UraniumLounger
Posts: 9215
Joined: 13 Feb 2010, 01:27
Location: Deep in the Heart of Texas

Re: Word2003 poser: when do I run out of words?

Post by BobH »

In another lifetime, a WORD was 8 bytes! :flee:
Bob's yer Uncle
(1/2)(1+√5)
Intel Core i5, 3570K, 3.40 GHz, 16 GB RAM, ECS Z77 H2-A3 Mobo, Windows 10 >HPE 64-bit, MS Office 2016

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 15498
Joined: 24 Jan 2010, 23:23
Location: brings.slot.perky

Re: Word2003 poser: when do I run out of words?

Post by ChrisGreaves »

BobH wrote:
25 Apr 2021, 17:48
In another lifetime, a WORD was 8 bytes! :flee:
... and in a world before that, a word was a string of six-bit bytes delimited by word-marks iset by an instruction whose mnemonic operation-code was "SW", and whose object code was displayed as ",", which was, I think, an 0-3-8 punched code.

Back then, programmers had control of what programs did, and the operating system was called "Allan", and "Roy", and soon after that, "Frankie" who had a gorgeous smile!

I am, of course, too young to recall memory as a tube of mercury holding thirty-two bits. Now THAT was memory!
Cheers
Chris
An expensive day out: Wallet and Grimace

User avatar
macropod
4StarLounger
Posts: 508
Joined: 17 Dec 2010, 03:14

Re: Word2003 poser: when do I run out of words?

Post by macropod »

Perhaps you could explain what you are trying to achieve?
Paul Edstein
[Fmr MS MVP - Word]

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 15498
Joined: 24 Jan 2010, 23:23
Location: brings.slot.perky

Re: Word2003 poser: when do I run out of words?

Post by ChrisGreaves »

macropod wrote:
25 Apr 2021, 21:36
Perhaps you could explain what you are trying to achieve?
Hi MacroPod.
"I Process a Document by examining every word in the document" for every DOCument on a hard drive.
Documents have been typed in by me (hence my vocabulary) or have been harvested from other people (typically text copied from a web page or from a PDF file).
I have just realized that although I'd said "The Attached Document", I had failed to attach the document. Corrected now.. This document is a Work-In-progress and will be worked on today, once the rain starts again. :sad:

My latest project, based on Kevin Stroud’s the History of English Podcast, is to develop an algorithm to analyze the written word in Modern English, and determine whether or not it is a Loan Word or whether the word has survived from Old English. Kevin Stroud’s work is an audio-book, so he relies on both the spoken and written forms of a word. I rely only on the written form. Therein lies the challenge.

I define a "word" as a string of lower-case letters that passes a machine spell-check.
Lower-case because I will ignore proper words and gamble that capitalized leading words of a sentence will turn up mid-sentence elsewhere.

I must be fair to myself in testing my rules for determining if a "word" is Old English or a Loan Word (from French, Latin, Greek etc). Some of my documents contain Australian Aborigine names (Mukinbudin, Warralakin, Warrachupin, Boodarockin) or native American names (Cheektowaga, Tonawonda, Lackawanna) and so on, and while these pass spell-check on my machine, they do so because they are in Custom Dictionaries. I must disable, then enable custom dictionaries correctly. This poses a little problem.

I have found that at least one document on my machine sends <For Each wd in doc.Words> into an infinite loop
I have found that <doc.Words> is a bit of a time-hog when used repeatedly on larger documents (more than 20,000 words).
I anticipate more hurdles.

If I have correctly analyzed Kevin's ~150 one-hour episodes to date, I will have a machine that classifies English-language words on any machine, and that is the ultimate test.
My First objective towards this goal is to achieve a 90% correct rate for all the English language words on my machine!

Cheers
Chris
An expensive day out: Wallet and Grimace

User avatar
macropod
4StarLounger
Posts: 508
Joined: 17 Dec 2010, 03:14

Re: Word2003 poser: when do I run out of words?

Post by macropod »

It seems to me, then, that what you're really after is a simple word list, which could be created from VBA code like that in the attached document. The code adds all content from documents in the chosen folder to the active document, then processes the resulting content to remove all capitalised words, numbers, etc. It then deletes all common words and processes the remainder to generate an output word list which is spell-checked and any spelling errors removed.
You do not have the required permissions to view the files attached to this post.
Paul Edstein
[Fmr MS MVP - Word]

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 15498
Joined: 24 Jan 2010, 23:23
Location: brings.slot.perky

Re: Word2003 poser: when do I run out of words?

Post by ChrisGreaves »

PJ_in_FL wrote:
21 Apr 2021, 16:01
Why don't you limit your loop to doc.Words.Count
(01)wd in doc.WordsStore in ArrayTest arrayTo Lexicon7%
(02)wd in doc.WordsTest wordTo Lexicon7%
(03)lng to doc.Words.CountStore in ArrayTest arrayTo Lexicon28%
(04)lng to doc.Words.CountTest wordTo Lexicon28%
(05)strSplitAtStore in ArrayTest arrayTo Lexicon6%
(06)strSplitAtTest wordTo Lexicon6%
(07)strNextLowerCaseStore in ArrayTest arrayTo Lexicon4%
(08)strNextLowerCaseTest wordTo Lexicon8%
(09)Split() to arrayTest arrayTo Lexicon7%
Hello again PJ.
I am not done yet, but I have some data.

I set up nine methods, four pairs of methods plus the ninth using the intrinsic "Split()" function.
Each pair of tests builds a lexicon as each word string is found, and stores items in an array as each word is found. The first member of the pair therefore updates the lexicon on-the-fly, whereas the array method must perform the analysis on an array of terms. Obviously the "array" member of the pair needs more processing time, right?, because it must pass words INTO an array, and then process the words FROM the array towards the lexicon. (but see below "7/8")

The ninth method "Split()" does not provide the option for on-the-fly processing because Split() always produces an array. Nonetheless, Split() is super-fast because it is an intrinsic function.

Both doc.Words.Count methods (on-the-fly and via array) use about four times as much time as the others, but (a) as you have shown, the Words.Count does get me over that weird document that sent me into an endless loop when I used "For each wd in doc.Words" and (b) My test this morning was run on a test document of only 2,500 words (Words04.doc too large to attach so available at www.chrisgreaves.com/Downloads/20210428_0759.zip ).

Tonight I shall rerun the timing test with a 50,000 word document. I should as well dig up that endless-loop document and see how my nine methods cope with that.

I think that methods 7/8 differ in timing because the procedure "strNextLowerCase()" is designed to collect ONLY lower-case-only strings, and this reduces the number/quantity of collected strings to be passed into and from arrays.

Given that I want the harvester to work for a population of unknown users, that is collections of documents other than my own, a foolproof method is essential, so for now doc.Words.Count is still a candidate!

More later
Chris
An expensive day out: Wallet and Grimace

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 15498
Joined: 24 Jan 2010, 23:23
Location: brings.slot.perky

Re: Word2003 poser: when do I run out of words?

Post by ChrisGreaves »

LisaGreen wrote:
25 Apr 2021, 05:36
I wonder if counting something else indicative of word boundaries would be faster.
Hello again Lisa. Please see my response to PJ above, but especially the upgraded function "strNextLowerCase" in the zip file.
Cheers
Chris
An expensive day out: Wallet and Grimace

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 15498
Joined: 24 Jan 2010, 23:23
Location: brings.slot.perky

Re: Word2003 poser: when do I run out of words?

Post by ChrisGreaves »

ChrisGreaves wrote:
28 Apr 2021, 10:32
1wd in doc.WordsStore in ArrayTest arrayTo Lexicon7%Method01 00000.0005560.0005560.0133440.800641.410%
2wd in doc.WordsTest wordTo Lexicon7%Method02 00000.0005670.0005670.0136080.816481.438%
3lng to doc.Words.CountStore in ArrayTest arrayTo Lexicon28%Method03 00000.0179050.0179050.4297225.783245.419%
4lng to doc.Words.CountTest wordTo Lexicon28%Method04 00000.0180670.0180670.43360826.0164845.830%
5strSplitAtStore in ArrayTest arrayTo Lexicon6%Method05 00000.0004980.0004980.0119520.717121.263%
6strSplitAtTest wordTo Lexicon6%Method06 00000.0004630.0004630.0111120.666721.174%
7strNextLowerCaseStore in ArrayTest arrayTo Lexicon4%Method07 00000.0003360.0003360.0080640.483840.852%
8strNextLowerCaseTest wordTo Lexicon8%Method08 00000.0005440.0005440.0130560.783361.380%
9Split() to arrayTest arrayTo Lexicon7%Method09 00000.0004860.0004860.0116640.699841.233%
56.76768
recorddayshoursminutes
Here is last night's run on a 26,00 word document. I had to switch to NOW() from Timer because Timer runs only from Midnight etc etc.

You will note that methods (3) and (4) are expensive, BUT THEY WORK, whereas all the others are cheap, BUT (1) and (2) have been shown to fail (endless loop) in at least one document on my hard drive.

My conclusion: doc.Words.Count (45% each of the two tests) is prohibitively expensive; I can't afford to use it.
The meanness arises because this time last week, doc.Words was the only method to hand at that time that could cope with every documented presented to date.

I will now start processing the entire drive (14,000 documents) again to see which of the remaining seven methods can process every document.

(minutes later): The seven remaining methods run flawlessly (that is, do not go into an endless loop), so based on the timings above I shall rerun on drive T: (14,000 documents) tonight using method 7 "strNextLowerCase" and see what crops up next.

Cheers
Chris
An expensive day out: Wallet and Grimace