- Safety checks Python packages against a vulnerability database
- PyUp offers a paid service for Safety
- Automatic Python Vulnerability Checking shows how to use GitHub actions to automate Safety checks
- GitHub Dependabot can check Python requirements files and other language dependencies against GitHub’s vulnerability database
word2vec is an algorithm for constructing vector representations of words, also known as word embeddings. The vector for each word is a semantic description of how that word is used in context, so two words that are used similarly in text will get similar vector represenations. Once you map words into vector space, you can then use vector math to find words that have similar semantics.
gensim provides a nice Python implementation of Word2Vec that works perfectly with NLTK corpora. The model takes a list of sentences, and each sentence is expected to be a list of words. This is exactly what is returned by the
sents() method of NLTK corpus readers. So let’s compare the semantics of a couple words in a few different NLTK corpora:
>>> from gensim.models import Word2Vec >>> from nltk.corpus import brown, movie_reviews, treebank >>> b = Word2Vec(brown.sents()) >>> mr = Word2Vec(movie_reviews.sents()) >>> t = Word2Vec(treebank.sents()) >>> b.most_similar('money', topn=5) [('pay', 0.6832243204116821), ('ready', 0.6152011156082153), ('try', 0.5845392942428589), ('care', 0.5826011896133423), ('move', 0.5752171277999878)] >>> mr.most_similar('money', topn=5) [('unstoppable', 0.6900672316551208), ('pain', 0.6289106607437134), ('obtain', 0.62665855884552), ('jail', 0.6140228509902954), ('patients', 0.6089504957199097)] >>> t.most_similar('money', topn=5) [('short-term', 0.9459682106971741), ('-LCB-', 0.9449775218963623), ('rights', 0.9442864656448364), ('interested', 0.9430986642837524), ('national', 0.9396077990531921)] >>> b.most_similar('great', topn=5) [('new', 0.6999611854553223), ('experience', 0.6718623042106628), ('social', 0.6702290177345276), ('group', 0.6684836149215698), ('life', 0.6667487025260925)] >>> mr.most_similar('great', topn=5) [('wonderful', 0.7548679113388062), ('good', 0.6538234949111938), ('strong', 0.6523671746253967), ('phenomenal', 0.6296845078468323), ('fine', 0.5932096242904663)] >>> t.most_similar('great', topn=5) [('won', 0.9452997446060181), ('set', 0.9445616006851196), ('target', 0.9342271089553833), ('received', 0.9333916306495667), ('long', 0.9224691390991211)] >>> b.most_similar('company', topn=5) [('industry', 0.6164317727088928), ('technical', 0.6059585809707642), ('orthodontist', 0.5982754826545715), ('foamed', 0.5929019451141357), ('trail', 0.5763031840324402)] >>> mr.most_similar('company', topn=5) [('colony', 0.6689200401306152), ('temple', 0.6546304225921631), ('arrival', 0.6497283577919006), ('army', 0.6339291334152222), ('planet', 0.6184555292129517)] >>> t.most_similar('company', topn=5) [('panel', 0.7949466705322266), ('Herald', 0.7674347162246704), ('Analysts', 0.7463694214820862), ('amendment', 0.7282689809799194), ('Treasury', 0.719698429107666)]
I hope it’s pretty clear from the above examples that the semantic similarity of words can vary greatly depending on the textual context. In this case, we’re comparing a wide selection of text from the brown corpus with movie reviews and financial news from the treebank corpus.
Note that if you call
most_similar() with a word that was not present in the sentences, you will get a
KeyError exception. This can be a common occurrence with smaller corpora like
NLTK 3 has quite a number of changes from NLTK 2, many of which will break old code. You can see a list of documented changes in the wiki page, Porting your code to NLTK 3.0. Below are the major changes I encountered while working on the NLTK 3 Cookbook.
The FreqDist api has changed. It now inherits from collections.Counter, which implements most of the previous functionality, but in a different way. So instead of
fd.inc(tag), you now need to do
fd[tag] += 1.
fd.samples() doesn’t exist anymore. Instead, you can use
fd.most_common(), which is a method of collections.Counter that returns a list that looks like
NLTK 3 has changed many wordnet Synset attributes to methods:
Same goes for the Lemma class. For example,
lemma.antonyms() is now a method.
batch_tag() method is now
tag_sents(). The brill tagger API has changed significantly:
brill.FastBrillTaggerTrainer is now
brill_trainer.BrillTaggerTrainer, and the brill templates have been replaced by the tbl.feature.Feature interface with
brill.Word as implementations of the interface.
Simplified tags have been replaced with the universal tagset. So
tagged_corpus.tagged_sents(tagset='universal'). In order to make this work, TaggedCorpusReader should be initialized with a known tagset, using the
tagset kwarg, so that its tags can be mapped to the universal tagset. Known tagset mappings are stored in
treebank tagset is called
en-ptb (PennTreeBank) and the
brown tagset is called
en-brown. These files are simply 2 column, tab separated mappings of source tag to universal tag. The function
nltk.tag.mapping.map_tag(source, target, source tag) is used to perform the mapping.
Chunking & Parse Trees
The main change in chunkers & parsers is replacing the term node with label. RegexpChunkParser now takes a chunk
chunk_label argument instead of
chunk_node, while in the Tree class, the
node attribute has been replaced with the
The SVM classifiers and scipy based
MaxentClassifier algorithms (like
CG) have been removed, but the addition of the SklearnClassifier more than makes up for it. This classifier allows you to make use of most scikit-learn classification algorithms, which are generally faster and more memory efficient than the other NLTK classifiers, while being at least as accurate.
NLTK 3 is compatible with both Python 2 and Python 3. If you are new to Python 3, then you’ll likely be puzzled when you find that training the same model on the same data can result in slightly different accuracy metrics, because dictionary ordering is random in Python 3. This is a deliberate decision to improve security, but you can control it with the
PYTHONHASHSEED environment variable. Just run
$ PYTHONHASHSEED=0 python to get consistent dictionary ordering & accuracy metrics.
Python 3 has also removed the separate
unicode string object, so that now all strings are unicode. But some of the NLTK corpus functions return byte strings, which look like
b"raw string", so you may need convert these to normal strings before doing any further string processing.
Here’s a few other Python 3 changes I ran into:
dict.iteritems()doesn’t exist, use
dict.keys()does not produce a list (it returns a view). If you want a list, use
Because of the above switching costs, upgrading right away may not be worth it. I’m still running plenty of NLTK 2 code, because it’s stable and works great. But if you’re starting a new project, or want to take advantage of new functionality, you should definitely start with NLTK 3.
After many weekend writing sessions, the 2nd edition of the NLTK Cookbook, updated for NLTK 3 and Python 3, is available at Amazon and Packt. Code for the book is on github at nltk3-cookbook. Here’s some details on the changes & updates in the 2nd edition:
First off, all the code in the book is for Python 3 and NLTK 3. Most of it should work for Python 2, but not all of it. And NLTK 3 has made many backwards incompatible changes since version 2.0.4. One of the nice things about Python 3 is that it’s unicode all the way. No more issues with ASCII versus unicode strings. However, you do have to deal with byte strings in a few cases. Another interesting change is that hash randomization is on by default, which means that if you don’t set the PYTHONHASHSEED environment variable, training accuracy can change slightly on each run, because the iteration order of dictionaries is no longer consistent by default.
In Chapter 1, Tokenizing Text and WordNet Basics, I added a recipe for training a sentence tokenizer using the PunktSentenceTokenizer. This is surprisingly easy, and you can find the code in chapter1.py.
Chapter 2, Replacing and Correcting Words, shows the additional languages supported by the SnowballStemmer. An unfortunate removal from this chapter is
babelizer, which was a fun library to use, but is no longer supported by Yahoo.
In Chapter 4, Part-of-Speech Tagging, the last recipe shows how to use train_tagger.py from NLTK-Trainer to replicate most of the tagger training recipes detailed earlier in the chapter. NLTK-Trainer was largely inspired by my experience writing Python Text Processing with NLTK 2.0 Cookbook, after realizing that many aspects of training part-of-speech taggers could be encapsulated in a command line script.
Chapter 5, Extracing Chunks, adds examples for using train_chunker.py to train phrase chunkers.
Chapter 7, Text Classification, adds coverage of train_classifier.py, along with examples of using the SklearnClassifier, which provides access to many of the scikit-learn classification algorithms. The scikit-learn classifiers tend to be at least as accurate as NLTK’s classifiers, are often faster to train, and have much smaller memory & disk footprints. And since NLTK 3 removed support for scipy based
MaxentClassifier algorithms and SVM classifiers, the choice of which classifers to use has become very easy: when in doubt, choose SklearnClassifier (code examples can be found in chapter7.py).
There are a few library changes in Chapter 9, Parsing Specific Data Types:
timexand SimpleParse recipes have been removed due to lack of Python 3 compatibility
- uses beautifulsoup4 with examples of UnicodeDammit
- chardet was replaced with charade, which is compatible with both Python 2 & 3. But since publication, charade was merged back into chardet and is no longer maintained. I recommend installing chardet and replacing all instances of the
charademodule name with
So if you want to learn the latest & greatest NLTK 3, pickup your copy of Python 3 Text Processing with NLTK 3 Cookbook, and checkout the code at nltk3-cookbook. If you like the book, please review it at Amazon or goodreads.
This is a review of the book Instant Pygame for Python Game Development How-to, by Ivan Idris. Packt asked me to review the book, and I agreed because like many developers, I’ve thought about writing my own game, and I’ve been curious about the capabilities of pygame. It’s a short book, ~120 pages, so this is a short review.
The book covers pygame basics like drawing images, rendering text, playing sounds, creating animations, and altering the mouse cursor. The author has helpfully posted some video demos of some of the exercises, which are linked from the book. I think this is a great way to show what’s possible, while also giving the reader a clear idea of what they are creating & what should happen. After the basic intro exercises, I think the best content was how to manipulate pixel arrays with numpy (the author has also written two books on numpy: NumPy Beginner’s Guide & NumPy Cookbook), how to create & use sprites, and how to make your own version of the game of life.
There were 3 chapters whose content puzzled me. When you’ve got such a short book on a specific topic, why bring up matplotlib, profiling, and debugging? These chapters seemed off-topic and just thrown in there randomly. The organization of the book could have been much better too, leading the reader from the basics all the way to a full-fledged game, with each chapter adding to the previous chapters. Instead, the chapters sometimes felt like unrelated low-level examples.
Overall, the book was a quick & easy read, that rapidly introduces you to basic pygame functionality, and leads you on to more complex activities. My main takeaway is that pygame provides an easy to use & low-level framework for building simple games, and can be used to create more complex games (but probably not FPS or similar graphically intensive games). The ideal games would probably be puzzle based and/or dialogue heavy, and only require simple interactions from the user. So if you’re interested in building such a game in Python, you should definitely get a copy of Instant Pygame for Python Game Development How-to.
Now that NLTK versions 2.0.1 & higher include the SklearnClassifier (contributed by Lars Buitinck), it’s much easier to make use of the excellent scikit-learn library of algorithms for text classification. But how well do they work?
Below is a table showing both the accuracy & F-measure of many of these algorithms using different feature extraction methods. Unlike the standard NLTK classifiers, sklearn classifiers are designed for handling numeric features. So there are 3 different values under the
feats column for each algorithm.
bow means bag-of-words feature extraction, where every word gets a 1 if present, or a 0 if not.
int means word counts are used, so if a word occurs twice, it gets the number 2 as its feature value (whereas with
bow it would still get a 1). And
tfidf means the TfidfTransformer is used to produce a floating point number that measures the importance of a word, using the tf-idf algorithm.
All numbers were determined using nltk-trainer, specifically,
python train_classifier.py movie_reviews --no-pickle --classifier sklearn.ALGORITHM --fraction 0.75. For
int features, the option
--value-type int was used, and for
tfidf features, the options
--value-type float --tfidf were used. This was with NLTK 2.0.3 and sklearn 0.12.1.
|algorithm||feats||accuracy||neg f-measure||pos f-measure|
As you can see, the best algorithms are BernoulliNB, MultinomialNB, LogisticRegression, LinearSVC, and NuSVC. Surprisingly,
tfidf features either provide a very small performance increase, or significantly decrease performance. So let’s see if we can improve performance with the same techniques used in previous articles in this series, specifically bigrams and high information words.
Below is a table showing the accuracy of the top 5 algorithms using just
unigrams (the default, a.k.a single words), and using unigrams +
bigrams (pairs of words) with the option
--ngrams 1 2.
MultinomialNB got a modest boost in accuracy, putting them on-par with the rest of the algorithms. But we can do better than this using feature scoring.
As I’ve shown previously, eliminating low information features can have significant positive effects. Below is a table showing the accuracy of each algorithm at different score levels, using the option
--min_score SCORE (and keeping the
--ngrams 1 2 option to get bigram features).
|algorithm||score 1||score 2||score 3|
NuSVC all get a nice gain of ~4-5%, but the most interesting results are from the
MultinomialNB algorithms, which drop down significantly at
--min_score 1, but then skyrocket up to 97% with
--min_score 2. The only explanation I can offer for this is that Naive Bayes classification, because it does not weight features, can be quite sensitive to changes in training data (see Bayesian Poisoning for an example).
If you haven’t yet tried using scikit-learn for text classification, then I hope this article convinces you that it’s worth learning. NLTK’s SklearnClassifier makes the process much easier, since you don’t have to convert feature dictionaries to numpy arrays yourself, or keep track of all known features. The Scikits classifiers also tend to be more memory efficient than the standard NLTK classifiers, due to their use of sparse arrays.
NLTK 2.0.1, a.k.a NLTK 2, was recently released, and what follows is my favorite changes, new features, and highlights from the ChangeLog.
The SVMClassifier adds support vector machine classification thru SVMLight with PySVMLight. This is a much needed addition to the set of supported classification algorithms. But even more interesting…
The SklearnClassifier provides a general interface to text classification with scikit-learn. While scikit-learn is still pre-1.0, it is rapidly becoming one of the most popular machine learning toolkits, and provides more advanced feature extraction methods for classification.
NLTK has moved development and hosting to github, replacing google code and SVN. The primary motivation is to make new development easier, and already a Python 3 branch is under active development. I think this is great, since github makes forking & pull requests quite easy, and it’s become the de-facto “social coding” site.
Coinciding with the github move, the documentation was updated to use Sphinx, the same documentation generator used by Python and many other projects. While I personally like Sphinx and restructured text (which I used to write this post), I’m not thrilled with the results. The new documentation structure and NLTK homepage seem much less approachable. While it works great if you know exactly what you’re looking for, I worry that new/interested users will have a harder time getting started.
Since the 0.9.9 release, a number of new corpora and corpus readers have been added:
And here’s a few final highlights:
- The HunposTagger, which wraps hunpos.
- The StanfordTagger plus 2 subclasses for NER and POS tagging with the Stanford POS Tagger.
- The SnowballStemmer, which supports 13 different languages. You can try it out at my online stemming demo.
I think NLTK’s ideal role is be a standard interface between corpora and NLP algorithms. There are many different corpus formats, and every algorithm has its own data structure requirements, so providing common abstract interfaces to connect these together is very powerful. It allows you to test the same algorithm on disparate corpora, or try multiple algorithms on a single corpus. This is what NLTK already does best, and I hope that becomes even more true in the future.
My PyCon tutorial, Introduction to NLTK, now has over 40 people registered. This is about twice as many people as I was expecting, but I’m glad so many people want to learn NLTK 🙂 Because of the large class size, it’d really helpful to have a couple assistants with at least some NLTK experience, including, but not limited to:
* installing NLTK
* installing & using NLTK on Windows
* installing & using nltk-trainer
* creating custom corpora
* using WordNet
If you’re interested in helping out, please read Tutorial Assistants and contact me, japerk — at — gmail. Thanks!
At the end of February and the beginning of March, I’ll be giving 3 talks in the SF Bay Area and one in St Louis, MO. In chronological order…
How Weotta uses MongoDB
Grant and I will be helping 10gen celebrate the opening of their new San Francisco office on Tuesday, February 21, by talking about
How Weotta uses MongoDB. We’ll cover some of our favorite features of MongoDB and how we use it for local place & events search. Then we’ll finish with a preview of Weotta’s upcoming MongoDB powered local search APIs.
NLTK Jam Session at NICAR 2012
On Thursday, February 23, in St Louis, MO, I’ll be demonstrating how to use NLTK as part of the NewsCamp workshop at NICAR 2012. This will be a version of my PyCon NLTK Tutorial with a focus on news text and corpora like treebank.
Corpus Bootstrapping with NLTK at Strata 2012
As part of the Strata 2012 Deep Data program, I’ll talk about Corpus Bootstrapping with NLTK on Tuesday, February 28. The premise of this talk is that while there’s plenty of great algorithms and methods for natural language processing, most of them require a training corpus, and chances are the training corpus you really need doesn’t exist. So how can you quickly create a quality corpus at minimal cost? I’ll cover specific real-world examples to answer this question.
NLTK Tutorial at PyCon 2012
Introduction to NLTK will be a 3 hour tutorial at PyCon on Thursday, March 8th. You’ll get to know NLTK in depth, learn about corpus organization, and train your own models manually & with nltk-trainer. My goal is that you’ll walk out with at least one new NLP superpower that you can put to use immediately.