NLTK 3 has quite a number of changes from NLTK 2, many of which will break old code. You can see a list of documented changes in the wiki page, Porting your code to NLTK 3.0. Below are the major changes I encountered while working on the NLTK 3 Cookbook.
Probability Classes
The FreqDist api has changed. It now inherits from collections.Counter, which implements most of the previous functionality, but in a different way. So instead of <span class="pre">fd.inc(tag)</span>
, you now need to do <span class="pre">fd[tag]</span> <span class="pre">+=</span> <span class="pre">1</span>
.
<span class="pre">fd.samples()</span>
doesn’t exist anymore. Instead, you can use <span class="pre">fd.most_common()</span>
, which is a method of collections.Counter that returns a list that looks like <span class="pre">[(word,</span> <span class="pre">count)]</span>
.
ConditionalFreqDist now inherits from collections.defaultdict (one of my favorite Python data structures) which provides most of the previous functionality for free.
WordNet API
NLTK 3 has changed many wordnet Synset attributes to methods:
<span class="pre">syn.definition</span>
-><span class="pre">syn.definition()</span>
<span class="pre">syn.examples</span>
-><span class="pre">syn.examples()</span>
<span class="pre">syn.lemmas</span>
-><span class="pre">syn.lemmas()</span>
<span class="pre">syn.name</span>
-><span class="pre">syn.name()</span>
<span class="pre">syn.pos</span>
-><span class="pre">syn.pos()</span>
Same goes for the Lemma class. For example, <span class="pre">lemma.antonyms()</span>
is now a method.
Tagging
The <span class="pre">batch_tag()</span>
method is now <span class="pre">tag_sents()</span>
. The brill tagger API has changed significantly: <span class="pre">brill.FastBrillTaggerTrainer</span>
is now <span class="pre">brill_trainer.BrillTaggerTrainer</span>
, and the brill templates have been replaced by the tbl.feature.Feature interface with <span class="pre">brill.Pos</span>
or <span class="pre">brill.Word</span>
as implementations of the interface.
Universal Tagset
Simplified tags have been replaced with the universal tagset. So <span class="pre">tagged_corpus.tagged_sents(simplify_tags=True)</span>
becomes <span class="pre">tagged_corpus.tagged_sents(tagset='universal')</span>
. In order to make this work, TaggedCorpusReader should be initialized with a known tagset, using the <span class="pre">tagset</span>
kwarg, so that its tags can be mapped to the universal tagset. Known tagset mappings are stored in <span class="pre">nltk_data/taggers/universal_tagset</span>
. The <span class="pre">treebank</span>
tagset is called <span class="pre">en-ptb</span>
(PennTreeBank) and the <span class="pre">brown</span>
tagset is called <span class="pre">en-brown</span>
. These files are simply 2 column, tab separated mappings of source tag to universal tag. The function <span class="pre">nltk.tag.mapping.map_tag(source,</span> <span class="pre">target,</span> <span class="pre">source</span> <span class="pre">tag)</span>
is used to perform the mapping.
Chunking & Parse Trees
The main change in chunkers & parsers is replacing the term node with label. RegexpChunkParser now takes a chunk <span class="pre">chunk_label</span>
argument instead of <span class="pre">chunk_node</span>
, while in the Tree class, the <span class="pre">node</span>
attribute has been replaced with the <span class="pre">label()</span>
method.
Classification
The SVM classifiers and scipy based <span class="pre">MaxentClassifier</span>
algorithms (like <span class="pre">CG</span>
) have been removed, but the addition of the SklearnClassifier more than makes up for it. This classifier allows you to make use of most scikit-learn classification algorithms, which are generally faster and more memory efficient than the other NLTK classifiers, while being at least as accurate.
Python 3
NLTK 3 is compatible with both Python 2 and Python 3. If you are new to Python 3, then you’ll likely be puzzled when you find that training the same model on the same data can result in slightly different accuracy metrics, because dictionary ordering is random in Python 3. This is a deliberate decision to improve security, but you can control it with the <span class="pre">PYTHONHASHSEED</span>
environment variable. Just run <span class="pre">$</span> <span class="pre">PYTHONHASHSEED=0</span> <span class="pre">python</span>
to get consistent dictionary ordering & accuracy metrics.
Python 3 has also removed the separate <span class="pre">unicode</span>
string object, so that now all strings are unicode. But some of the NLTK corpus functions return byte strings, which look like <span class="pre">b"raw</span> <span class="pre">string"</span>
, so you may need convert these to normal strings before doing any further string processing.
Here’s a few other Python 3 changes I ran into:
<span class="pre">itertools.izip</span>
-><span class="pre">zip</span>
<span class="pre">dict.iteritems()</span>
doesn’t exist, use<span class="pre">dict.items()</span>
instead<span class="pre">dict.keys()</span>
does not produce a list (it returns a view). If you want a list, use<span class="pre">dict.dict_keys()</span>
Upgrading
Because of the above switching costs, upgrading right away may not be worth it. I’m still running plenty of NLTK 2 code, because it’s stable and works great. But if you’re starting a new project, or want to take advantage of new functionality, you should definitely start with NLTK 3.