NLTK 3 has quite a number of changes from NLTK 2, many of which will break old code. You can see a list of documented changes in the wiki page, Porting your code to NLTK 3.0. Below are the major changes I encountered while working on the NLTK 3 Cookbook.
The FreqDist api has changed. It now inherits from collections.Counter, which implements most of the previous functionality, but in a different way. So instead of
fd.inc(tag), you now need to do
fd[tag] += 1.
fd.samples() doesn’t exist anymore. Instead, you can use
fd.most_common(), which is a method of collections.Counter that returns a list that looks like
NLTK 3 has changed many wordnet Synset attributes to methods:
Same goes for the Lemma class. For example,
lemma.antonyms() is now a method.
batch_tag() method is now
tag_sents(). The brill tagger API has changed significantly:
brill.FastBrillTaggerTrainer is now
brill_trainer.BrillTaggerTrainer, and the brill templates have been replaced by the tbl.feature.Feature interface with
brill.Word as implementations of the interface.
Simplified tags have been replaced with the universal tagset. So
tagged_corpus.tagged_sents(tagset='universal'). In order to make this work, TaggedCorpusReader should be initialized with a known tagset, using the
tagset kwarg, so that its tags can be mapped to the universal tagset. Known tagset mappings are stored in
treebank tagset is called
en-ptb (PennTreeBank) and the
brown tagset is called
en-brown. These files are simply 2 column, tab separated mappings of source tag to universal tag. The function
nltk.tag.mapping.map_tag(source, target, source tag) is used to perform the mapping.
Chunking & Parse Trees
The main change in chunkers & parsers is replacing the term node with label. RegexpChunkParser now takes a chunk
chunk_label argument instead of
chunk_node, while in the Tree class, the
node attribute has been replaced with the
The SVM classifiers and scipy based
MaxentClassifier algorithms (like
CG) have been removed, but the addition of the SklearnClassifier more than makes up for it. This classifier allows you to make use of most scikit-learn classification algorithms, which are generally faster and more memory efficient than the other NLTK classifiers, while being at least as accurate.
NLTK 3 is compatible with both Python 2 and Python 3. If you are new to Python 3, then you’ll likely be puzzled when you find that training the same model on the same data can result in slightly different accuracy metrics, because dictionary ordering is random in Python 3. This is a deliberate decision to improve security, but you can control it with the
PYTHONHASHSEED environment variable. Just run
$ PYTHONHASHSEED=0 python to get consistent dictionary ordering & accuracy metrics.
Python 3 has also removed the separate
unicode string object, so that now all strings are unicode. But some of the NLTK corpus functions return byte strings, which look like
b"raw string", so you may need convert these to normal strings before doing any further string processing.
Here’s a few other Python 3 changes I ran into:
dict.iteritems()doesn’t exist, use
dict.keys()does not produce a list (it returns a view). If you want a list, use
Because of the above switching costs, upgrading right away may not be worth it. I’m still running plenty of NLTK 2 code, because it’s stable and works great. But if you’re starting a new project, or want to take advantage of new functionality, you should definitely start with NLTK 3.