macropod wrote: ↑25 Apr 2021, 21:36
Perhaps you could explain what you are trying to achieve?
Hi MacroPod.
"I Process a Document by examining every word in the document" for every DOCument on a hard drive.
Documents have been typed in by me (hence my vocabulary) or have been harvested from other people (typically text copied from a web page or from a PDF file).
I have just realized that although I'd said "The Attached Document", I had failed to attach the document.
Corrected now.. This document is a Work-In-progress and will be worked on today, once the rain starts again.
My latest project, based on Kevin Stroud’s the History of English Podcast, is to develop an algorithm to analyze the written word in Modern English, and determine whether or not it is a Loan Word or whether the word has survived from Old English. Kevin Stroud’s work is an audio-book, so he relies on both the spoken and written forms of a word. I rely only on the written form. Therein lies the challenge.
I define a "word" as a string of lower-case letters that passes a machine spell-check.
Lower-case because I will ignore proper words and gamble that capitalized leading words of a sentence will turn up mid-sentence elsewhere.
I must be fair to myself in testing my rules for determining if a "word" is Old English or a Loan Word (from French, Latin, Greek etc). Some of my documents contain Australian Aborigine names (Mukinbudin, Warralakin, Warrachupin, Boodarockin) or native American names (Cheektowaga, Tonawonda, Lackawanna) and so on, and while these pass spell-check on my machine, they do so because they are in Custom Dictionaries. I must disable, then enable custom dictionaries correctly. This poses a little problem.
I have found that at least one document on my machine sends <For Each wd in doc.Words> into an infinite loop
I have found that <doc.Words> is a bit of a time-hog when used repeatedly on larger documents (more than 20,000 words).
I anticipate more hurdles.
If I have correctly analyzed Kevin's ~150 one-hour episodes to date, I will have a machine that classifies English-language words on any machine, and that is the ultimate test.
My First objective towards this goal is to achieve a 90% correct rate for all the English language words on my machine!
Cheers
Chris