NLTK 3 Changes

NLTK 3 has quite a number of changes from NLTK 2, many of which will break old code. You can see a list of documented changes in the wiki page, Porting your code to NLTK 3.0. Below are the major changes I encountered while working on the NLTK 3 Cookbook.

Probability Classes

The FreqDist api has changed. It now inherits from collections.Counter, which implements most of the previous functionality, but in a different way. So instead of fd.inc(tag), you now need to do fd[tag] += 1.

fd.samples() doesn’t exist anymore. Instead, you can use fd.most_common(), which is a method of collections.Counter that returns a list that looks like [(word, count)].

ConditionalFreqDist now inherits from collections.defaultdict (one of my favorite Python data structures) which provides most of the previous functionality for free.

WordNet API

NLTK 3 has changed many wordnet Synset attributes to methods:

  • syn.definition -> syn.definition()
  • syn.examples -> syn.examples()
  • syn.lemmas -> syn.lemmas()
  • syn.name -> syn.name()
  • syn.pos -> syn.pos()

Same goes for the Lemma class. For example, lemma.antonyms() is now a method.

Tagging

The batch_tag() method is now tag_sents(). The brill tagger API has changed significantly: brill.FastBrillTaggerTrainer is now brill_trainer.BrillTaggerTrainer, and the brill templates have been replaced by the tbl.feature.Feature interface with brill.Pos or brill.Word as implementations of the interface.

Universal Tagset

Simplified tags have been replaced with the universal tagset. So tagged_corpus.tagged_sents(simplify_tags=True) becomes tagged_corpus.tagged_sents(tagset='universal'). In order to make this work, TaggedCorpusReader should be initialized with a known tagset, using the tagset kwarg, so that its tags can be mapped to the universal tagset. Known tagset mappings are stored in nltk_data/taggers/universal_tagset. The treebank tagset is called en-ptb (PennTreeBank) and the brown tagset is called en-brown. These files are simply 2 column, tab separated mappings of source tag to universal tag. The function nltk.tag.mapping.map_tag(source, target, source tag) is used to perform the mapping.

Chunking & Parse Trees

The main change in chunkers & parsers is replacing the term node with label. RegexpChunkParser now takes a chunk chunk_label argument instead of chunk_node, while in the Tree class, the node attribute has been replaced with the label() method.

Classification

The SVM classifiers and scipy based MaxentClassifier algorithms (like CG) have been removed, but the addition of the SklearnClassifier more than makes up for it. This classifier allows you to make use of most scikit-learn classification algorithms, which are generally faster and more memory efficient than the other NLTK classifiers, while being at least as accurate.

Python 3

NLTK 3 is compatible with both Python 2 and Python 3. If you are new to Python 3, then you’ll likely be puzzled when you find that training the same model on the same data can result in slightly different accuracy metrics, because dictionary ordering is random in Python 3. This is a deliberate decision to improve security, but you can control it with the PYTHONHASHSEED environment variable. Just run $ PYTHONHASHSEED=0 python to get consistent dictionary ordering & accuracy metrics.

Python 3 has also removed the separate unicode string object, so that now all strings are unicode. But some of the NLTK corpus functions return byte strings, which look like b"raw string", so you may need convert these to normal strings before doing any further string processing.

Here’s a few other Python 3 changes I ran into:

  • itertools.izip -> zip
  • dict.iteritems() doesn’t exist, use dict.items() instead
  • dict.keys() does not produce a list (it returns a view). If you want a list, use dict.dict_keys()

Upgrading

Because of the above switching costs, upgrading right away may not be worth it. I’m still running plenty of NLTK 2 code, because it’s stable and works great. But if you’re starting a new project, or want to take advantage of new functionality, you should definitely start with NLTK 3.

  • Ventzi Zhechev

    Sadly, I have to move to NLTK 3. I want to serialise the trained POS tagger models to disk so that I don’t have to retrain each time I restart a web app we have—and there’s a bug in NLTK 2 that makes that impossible…
    I found the lack of documentation on porting NLTK 2 Brill tagger code to NLTK 3 appalling. Do you happen to have an insight into what needs to change here exactly?

  • Hi Ventzi,

    I covered the new BrillTagger in Chapter 4 of my book, which you can find at https://www.packtpub.com/application-development/python-3-text-processing-nltk-3-cookbook

    The chapter content isn’t online anywhere, but here’s code from the book for training a BrillTagger: https://github.com/japerk/nltk3-cookbook/blob/master/tag_util.py#L25

  • Ventzi Zhechev

    Thanks for that!
    I see that you’re using the 18 rules from the NLTK demo. I dug around in the NLTK source code (why did I have to do that?!?) and found that these are pre-defined and ready to use.
    In the end, I’m using brill.brill24() as the rule set and haven’t had time to do any evaluation, nor to compare Brill’s rules to fntbl.

  • Guest

    “The SVM classifiers and scipy based MaxentClassifier algorithms (like CG) have been removed”

    But maxent.py is still present inside the nltk -> classify.

    I downloaded it from here https://pypi.python.org/pypi/nltk

  • Thanks for the note. I’ve found that the scikit-learn LogisticRegression classifier tends to be at least as accurate, more memory efficient, and faster to train.

  • Selva Saravanakumar

    Hi Jacob,

    Of all the classifiers(NaiveBayes, maxent etc.) available in NLTK, which one you would recommend for better classification results?

  • I recommend trying all of them. NaiveBayes can be better for smaller datasets, Maxent tends to be more balanced, but you never really know what will work best without experimenting.