Andrei's blog

MetaLearn: Word lists, WordNet, and NLPs

If you are not a (Python) programmer, you won't find anything interesting in here.

Index:

  1. Pre-Story
  2. CSV Word Lists
  3. WordNet
  4. NLP

Pre-Story

I recently decided to try something new: Nim

It is a programming language similar to Python (in terms of syntax) but statically typed and faster.

My Project was to write an "assistant" for the game "Contact"

In short, Contact is a word guessing game, so the assistant I made was about sorting and presenting a list of words with their definitions. Little did I know, that my biggest takeaway from that, wouldn't be the new programming language, but WordNet.

I created a Jupyter Notebook (an IDE for data-processing) and went on Google to look for dictionaries and word lists. I needed a list of nouns with definitions sorted by frequency (I now realize that I should have used frequency per definition and not per word.) After looking thought GitHub and sorting through "1000 most common english words to learn today" repositories (I needed >10000 of specifically nouns), I found out about WordNet. (Although I still used frequency lists from GitHub because I was to lazy to figure how to compile them from Wordnet)

CSV Word Lists

There are quite a few .CSV (or .JSON) word lists out on the internet, sometimes sorted by frequency or cut off by frequency but sorted alphabetically.

One use is to generate "Pass Phrases." Here is one very big repository that I used: gwordlist. According to it's creator, he needed it for his password generator. There are also wordlists for the inverse: password crackers eg. John the Ripper

Another use is vocabulary for learning eg. this list has only 25_000 words.

However there are many word lists that have been scrapped from large datasets and have been (often) left unfiltered such as the Unix dictionary. Unix word list is one of the smaller and tidier lists out there. There are ones that contain billions of "words" most of which numbers, abbreviations, or words that most dictionaries have never even heard of. Ironically those lists are often sorted alphabetically, meaning you get is "dirty-data."

A unique example of wordlists I found is Monkeytype open source typing test website, you should check it out. For it's purposes, Monkeytype has made lists of most common words for and endless amount of languages (including programming languages), but most of them are sorted alphabetically.

It amazes me that many "common words" wordlists are sorted alphabetically, especially when there is no data attached to them. Finding frequency word lists is even more annoying for languages other than english.

Different word lists are designed for different purposes, and have the corresponding sources. The most common english sources are: Wikipedia, Project Gutenberg and collections of scientific papers.

WordNet

If you want a simple spellchecker, a frequency sorted word list is enough (you can use Enchant libraries). But there are also applications where you need more than wordlists, and definitions, synonyms, parts of speech are desirable. Eg. when doing some kind of indexing for a search engine, you might want to put all synonyms and their plurals into the same basket.

The approach some programmers take is to scrape Mirriam-Webster or Dictionary.com with beautiful soup.

It seems they never heard of WordNet, and thats unfortunate, because it would have saved them lots of trouble. The wikipedia page for WordNet makes it seem complicated, to put it simply:

WordNet is a free, downloadable, Database-Dictionary.

It is a free and open source dictionary that can be used by (programming) libraries and programmers for different purposes. In fact, the wikipedia page lists dictionary websites that are are simply frontends of this database.

WordNet contains definitions, parts of speech and links to related words or their variations. You can try Snappywords dictionary based on WordNet to see WordNets full abilities.

The original English WordNet was created by Princeton University, but other Universities have created their own wordnets for their own languages. There are wordnets for most languages, that you can use for your projects without spamming requests to dictionary websites or endlessly dig through GitHub gathering the dataset piece by piece.

There are also many (programming) libraries that can use wordnets. Here is one for Python: wn it has a documentation on how to do most things.

I think you should keep Wordnets in the back of your mind, it is an awesome resource that can make almost any project that involves languages far easier to make.

NLP

NLP stands for "Natural Language Processing." It includes everything from AI text-generation to speech recognition.

A lot of NLP tools use WordNet. NLPs often need to split sentences into parts (tokenize) to make them easier for machines to understand.

I am not a Machine learning engineering, so what I am presenting you is 5 min worth of googling.

If you are interested in NLP, you should check NLTK: Natural Lanugage Toolkit, it looks like a good starting point.

Conclusion

These were the results of me trying to cheat at a game.

I hope you will bookmark some of the resources I linked and come back to them when you need them.

- very qualified word processing expert