NLTK 3 Changes

NLTK 3 has quite a number of changes from NLTK 2, many of which will break old code. You can see a list of documented changes in the wiki page, Porting your code to NLTK 3.0. Below are the major changes I encountered while working on the NLTK 3 Cookbook.

Probability Classes

The FreqDist api has changed. It now inherits from collections.Counter, which implements most of the previous functionality, but in a different way. So instead of, you now need to do fd[tag] += 1.

fd.samples() doesn’t exist anymore. Instead, you can use fd.most_common(), which is a method of collections.Counter that returns a list that looks like [(word, count)].

ConditionalFreqDist now inherits from collections.defaultdict (one of my favorite Python data structures) which provides most of the previous functionality for free.

WordNet API

NLTK 3 has changed many wordnet Synset attributes to methods:

  • syn.definition -> syn.definition()
  • syn.examples -> syn.examples()
  • syn.lemmas -> syn.lemmas()
  • ->
  • syn.pos -> syn.pos()

Same goes for the Lemma class. For example, lemma.antonyms() is now a method.


The batch_tag() method is now tag_sents(). The brill tagger API has changed significantly: brill.FastBrillTaggerTrainer is now brill_trainer.BrillTaggerTrainer, and the brill templates have been replaced by the tbl.feature.Feature interface with brill.Pos or brill.Word as implementations of the interface.

Universal Tagset

Simplified tags have been replaced with the universal tagset. So tagged_corpus.tagged_sents(simplify_tags=True) becomes tagged_corpus.tagged_sents(tagset='universal'). In order to make this work, TaggedCorpusReader should be initialized with a known tagset, using the tagset kwarg, so that its tags can be mapped to the universal tagset. Known tagset mappings are stored in nltk_data/taggers/universal_tagset. The treebank tagset is called en-ptb (PennTreeBank) and the brown tagset is called en-brown. These files are simply 2 column, tab separated mappings of source tag to universal tag. The function nltk.tag.mapping.map_tag(source, target, source tag) is used to perform the mapping.

Chunking & Parse Trees

The main change in chunkers & parsers is replacing the term node with label. RegexpChunkParser now takes a chunk chunk_label argument instead of chunk_node, while in the Tree class, the node attribute has been replaced with the label() method.


The SVM classifiers and scipy based MaxentClassifier algorithms (like CG) have been removed, but the addition of the SklearnClassifier more than makes up for it. This classifier allows you to make use of most scikit-learn classification algorithms, which are generally faster and more memory efficient than the other NLTK classifiers, while being at least as accurate.

Python 3

NLTK 3 is compatible with both Python 2 and Python 3. If you are new to Python 3, then you’ll likely be puzzled when you find that training the same model on the same data can result in slightly different accuracy metrics, because dictionary ordering is random in Python 3. This is a deliberate decision to improve security, but you can control it with the PYTHONHASHSEED environment variable. Just run $ PYTHONHASHSEED=0 python to get consistent dictionary ordering & accuracy metrics.

Python 3 has also removed the separate unicode string object, so that now all strings are unicode. But some of the NLTK corpus functions return byte strings, which look like b"raw string", so you may need convert these to normal strings before doing any further string processing.

Here’s a few other Python 3 changes I ran into:

  • itertools.izip -> zip
  • dict.iteritems() doesn’t exist, use dict.items() instead
  • dict.keys() does not produce a list (it returns a view). If you want a list, use dict.dict_keys()


Because of the above switching costs, upgrading right away may not be worth it. I’m still running plenty of NLTK 2 code, because it’s stable and works great. But if you’re starting a new project, or want to take advantage of new functionality, you should definitely start with NLTK 3.