Home -> Projects -> What is a word? Project Gutenberg Dictionary
Is it anything defined in a dictionary? Which dictionary? What about slang? Or technical terms? The answer, as always, is “it depends”. Why do you care if a certain string is a word? The answer to this question dictates how a word should be defined. I asked myself this question while playing an anagram-finding word game – I found it frustrating that some strings I thought were “words” weren’t accepted and other strings I didn’t recognize were correct. I came up with a few ideas for answering this question:
The first idea is rather flawed. Dictionaries defines all sorts of strings that no one ever uses, and these strings should not be answers to anagram puzzles. I postulated that Wikipedia might be a more representative sample of words that people actually use to communicate, but it turns out that Wikipedia has tons of proper nouns (names, places, acronyms) that would be hard to filter out. Of these problems the acronyms are the most egregious (you really don’t want strings like ALSA, Advanced Linux Sound Architecture, to be words in an anagram game), so I proposed the “book” definition. This definition still has problems, but it probably won’t have many acronyms.
This wouldn’t be much of a project if I just philosophized for a couple paragraphs about what strings are words. After proposing the “book” definition I walked to the library and started indexing …. ok, maybe that isn’t practical. After proposing the “book” definition I looked up Project Gutenberg and found that it is easy to download a large number of the books in text form. In particular I downloaded the August 2010 collection of roughly 30,000 books (~8 gigabytes compressed). I formalized my definition of a word as “a string that appears in more than k unique books” (for some value of k to be determined) and started processing.
The books in the archive are compressed, so the first step is to uncompress them:
for i in $(find . -name "*.zip" -type f | grep -v "_h.zip")
do
pushd $(dirname $i) > /dev/null
unzip -qn $(basename $i)
popd > /dev/null
done
Then a little more bash magic to create a wordlist for each book, wrapped up in a Makefile because I anticipate rerunning this script and caring about dependencies:
TXT_FILES = $(shell find . -type f -name '*.txt')
WORDLIST_FILES = $(patsubst %.txt,%.wordlist,$(TXT_FILES))
.PHONY = clean all
all: $(WORDLIST_FILES)
%.wordlist: %.txt
cat $< | tr -c [:lower:] '\n' | sort | uniq > $@
clean:
find . -type f -name "*.wordlist" -delete
Of course, the fist try at anything doesn’t work quite right. It quickly became apparent that I should only be using lower case (| tr [:upper:] [:lower:] |), and it isn’t quite correct to treat all non-alphabetic characters as separators (“isn’t” isn’t two words “isn” and “t”). Also only most of the books were English. And books have lots of words (who knew!) and running | sort | uniq on entire books turns out not to be a fast operation. Actually I did wait for it to run on all of the books, but the final step of amalgamating word lists required sorting 200 million words, which probably never would have finished. Quick bash scripts are nice for “small data”, but they aren’t designed for large scale text processing. Here’s my final version of the Makefile:
TXT_FILES = $(shell find . -type f -name '*.txt')
WORDLIST_FILES = $(patsubst %.txt,%.wordlist,$(TXT_FILES))
.PHONY = clean all
all: $(WORDLIST_FILES) wordCountByBook
wordCountByBook: $(WORDLIST_FILES) Makefile
find . -name "*.wordlist" -exec cat {} \; | ./sortUniq.py > $@
%.wordlist: %.txt
if grep 'Language: English' $<; then cat $< | tr [:upper:] [:lower:] | tr -c [:lower:] '\n' | ./sortUniq.py | awk '{print $$2}' > $@; else touch $@; fi
clean:
find . -type f -name "*.wordlist" -delete
rm -f wordCountByBook
Notably I’ve introduced a conditional statement to check if the book is English before enumerating its words and I’ve replaced invocations of | sort | uniq with a custom python script sortUniq.py that runs in linear time:
#!/usr/bin/python3
import sys
wordMap = {}
for word in sys.stdin:
word = word.strip()
count = wordMap.get(word, 0)
wordMap[word] = count + 1;
wordList = list(wordMap)
wordList.sort(key=wordMap.get)
for word in wordList:
print(wordMap[word], " ", word)
Here is the final dictionary, with words listed by the number of English books they appear in. There is no perfect cutoff, but somewhere around >= 10 books the words are mostly real English words. And anything that appears in hundreds of books is almost certainly a real word. There are a couple exceptions. As mentioned above I don’t handle contractions very well. I was also surprised to see Olde English words, plus some words from other languages (even though I filtered for English books). All in all it seems like a better dictionary than the one used in the anagram game.