Using word2vec with NLTK

word2vec is an algorithm for constructing vector representations of words, also known as word embeddings. The vector for each word is a semantic description of how that word is used in context, so two words that are used similarly in text will get similar vector represenations. Once you map words into vector space, you can then use vector math to find words that have similar semantics.

gensim provides a nice Python implementation of Word2Vec that works perfectly with NLTK corpora. The model takes a list of sentences, and each sentence is expected to be a list of words. This is exactly what is returned by the sents() method of NLTK corpus readers. So let’s compare the semantics of a couple words in a few different NLTK corpora:

>>> from gensim.models import Word2Vec
>>> from nltk.corpus import brown, movie_reviews, treebank
>>> b = Word2Vec(brown.sents())
>>> mr = Word2Vec(movie_reviews.sents())
>>> t = Word2Vec(treebank.sents())

>>> b.most_similar('money', topn=5)
[('pay', 0.6832243204116821), ('ready', 0.6152011156082153), ('try', 0.5845392942428589), ('care', 0.5826011896133423), ('move', 0.5752171277999878)]
>>> mr.most_similar('money', topn=5)
[('unstoppable', 0.6900672316551208), ('pain', 0.6289106607437134), ('obtain', 0.62665855884552), ('jail', 0.6140228509902954), ('patients', 0.6089504957199097)]
>>> t.most_similar('money', topn=5)
[('short-term', 0.9459682106971741), ('-LCB-', 0.9449775218963623), ('rights', 0.9442864656448364), ('interested', 0.9430986642837524), ('national', 0.9396077990531921)]

>>> b.most_similar('great', topn=5)
[('new', 0.6999611854553223), ('experience', 0.6718623042106628), ('social', 0.6702290177345276), ('group', 0.6684836149215698), ('life', 0.6667487025260925)]
>>> mr.most_similar('great', topn=5)
[('wonderful', 0.7548679113388062), ('good', 0.6538234949111938), ('strong', 0.6523671746253967), ('phenomenal', 0.6296845078468323), ('fine', 0.5932096242904663)]
>>> t.most_similar('great', topn=5)
[('won', 0.9452997446060181), ('set', 0.9445616006851196), ('target', 0.9342271089553833), ('received', 0.9333916306495667), ('long', 0.9224691390991211)]

>>> b.most_similar('company', topn=5)
[('industry', 0.6164317727088928), ('technical', 0.6059585809707642), ('orthodontist', 0.5982754826545715), ('foamed', 0.5929019451141357), ('trail', 0.5763031840324402)]
>>> mr.most_similar('company', topn=5)
[('colony', 0.6689200401306152), ('temple', 0.6546304225921631), ('arrival', 0.6497283577919006), ('army', 0.6339291334152222), ('planet', 0.6184555292129517)]
>>> t.most_similar('company', topn=5)
[('panel', 0.7949466705322266), ('Herald', 0.7674347162246704), ('Analysts', 0.7463694214820862), ('amendment', 0.7282689809799194), ('Treasury', 0.719698429107666)]

I hope it’s pretty clear from the above examples that the semantic similarity of words can vary greatly depending on the textual context. In this case, we’re comparing a wide selection of text from the brown corpus with movie reviews and financial news from the treebank corpus.

Note that if you call most_similar() with a word that was not present in the sentences, you will get a KeyError exception. This can be a common occurrence with smaller corpora like treebank.

For more details, see this tutorial on using Word2Vec. And Word Embeddings for Fashion is a great introduction to these concepts, although it uses a different word embedding algorithm called glove.