In previous installments on part-of-speech tagging, we saw that a Brill Tagger provides significant accuracy improvements over the Ngram Taggers combined with Regex and Affix Tagging.
With the latest 2.0 beta releases (2.0b8 as of this writing), NLTK has included a ClassifierBasedTagger as well as a pre-trained tagger used by the nltk.tag.pos_tag method. Based on the name, then pre-trained tagger appears to be a ClassifierBasedTagger
trained on the treebank corpus using a MaxentClassifier. So let’s see how a classifier tagger compares to the brill tagger.
NLTK Training Sets
For the brown corpus, I trained on 2/3 of the reviews, lore, and romance categories, and tested against the remaining 1/3. For conll2000, I used the standard train.txt vs test.txt. And for treebank, I again used a 2/3 vs 1/3 split.
import itertools
from nltk.corpus import brown, conll2000, treebank
brown_reviews = brown.tagged_sents(categories=['reviews'])
brown_reviews_cutoff = len(brown_reviews) * 2 / 3
brown_lore = brown.tagged_sents(categories=['lore'])
brown_lore_cutoff = len(brown_lore) * 2 / 3
brown_romance = brown.tagged_sents(categories=['romance'])
brown_romance_cutoff = len(brown_romance) * 2 / 3
brown_train = list(itertools.chain(brown_reviews[:brown_reviews_cutoff],
brown_lore[:brown_lore_cutoff], brown_romance[:brown_romance_cutoff]))
brown_test = list(itertools.chain(brown_reviews[brown_reviews_cutoff:],
brown_lore[brown_lore_cutoff:], brown_romance[brown_romance_cutoff:]))
conll_train = conll2000.tagged_sents('train.txt')
conll_test = conll2000.tagged_sents('test.txt')
treebank_cutoff = len(treebank.tagged_sents()) * 2 / 3
treebank_train = treebank.tagged_sents()[:treebank_cutoff]
treebank_test = treebank.tagged_sents()[treebank_cutoff:]
Naive Bayes Classifier Taggers
There are 3 new taggers referenced below:
cpos
is an instance of ClassifierBasedPOSTagger using the default NaiveBayesClassifier. It was trained by doing ClassifierBasedPOSTagger(train=train_sents)
craubt
is like cpos
, but has the raubt
tagger from part 2 as a backoff tagger by doing ClassifierBasedPOSTagger(train=train_sents,
backoff=raubt)
bcpos
is a BrillTagger using cpos
as its initial tagger instead of raubt
.
The raubt
tagger is the same as from part 2, and braubt
is from part 3.
postag
is NLTK’s pre-trained tagger used by the pos_tag function. It can be loaded using nltk.data.load(nltk.tag._POS_TAGGER)
.
Accuracy Evaluation
Tagger accuracy was determined by calling the evaluate method with the test set on each trained tagger. Here are the results:

Conclusions
The above results are quite interesting, and lead to a few conclusions:
- Training data is hugely significant when it comes to accuracy. This is why
postag
takes a huge nose dive on brown
, while at the same time can get near 100% accuracy on treebank
.
- A ClassifierBasedPOSTagger does not need a backoff tagger, since
cpos
accuracy is exactly the same as for craubt
across all corpora.
- The ClassifierBasedPOSTagger is not necessarily more accurate than the
bcraubt
tagger from part 3 (at least with the default feature detector). It also takes much longer to train and tag (more details below) and so may not be worth the tradeoff in efficiency.
- Using brill tagger will nearly always increase the accuracy of your initial tagger, but not by much.
I was also surprised at how much more accurate postag
was compared to cpos
. Thinking that postag
was probably trained on the full treebank corpus, I did the same, and re-evaluated:
cpos = ClassifierBasedPOSTagger(train=treebank.tagged_sents())
cpos.evaluate(treebank_test)
The result was 98.08% accuracy. So the remaining 2% difference must be due to the MaxentClassifier being more accurate than the naive bayes classifier, and/or the use of a different feature detector. I tried again with classifier_builder=MaxentClassifier.train
and only got to 98.4% accuracy. So I can only conclude that a different feature detector is used. Hopefully the NLTK leaders will publish the training method so we can all know for sure.
Classification Efficiency
On the nltk-users list, there was a question about which tagger is the most computationaly economic. I can’t tell you the right answer, but I can definitely say that ClassifierBasedPOSTagger is the wrong answer. During accuracy evaluation, I noticed that the cpos
tagger took a lot longer than raubt
or braubt
. So I ran timeit on the tag method of each tagger, and got the following results:
Tagger |
secs/pass |
raubt |
0.00005 |
braubt |
0.00009 |
cpos |
0.02219 |
bcpos |
0.02259 |
postag |
0.01241 |
This was run with python 2.6.4 on an Athlon 64 Dual Core 4600+ with 3G RAM, but the important thing is the relative times. braubt
is over 246 times faster than cpos
! To put it another way, braubt
can process over 66666 words/sec, where cpos
can only do 270 words/sec and postag
only 483 words/sec. So the lesson is: do not use a classifier based tagger if speed is an issue.
Here’s the code for timing postag
. You can do the same thing for any other pickled tagger by replacing nltk.tag._POS_TAGGER
with a nltk.data accessible path with a .pickle suffix for the load method.
import nltk, timeit
text = nltk.word_tokenize('And now for something completely different')
setup = 'import nltk.data, nltk.tag; tagger = nltk.data.load(nltk.tag._POS_TAGGER)'
t = timeit.Timer('tagger.tag(%s)' % text, setup)
print 'timing postag 1000 times'
spent = t.timeit(number=1000)
print 'took %.5f secs/pass' % (spent / 1000)
File Size
There’s also a significant difference in the file size of the pickled taggers (trained on treebank):
Tagger |
Size |
raubt |
272K |
braubt |
273K |
cpos |
3.8M |
bcpos |
3.8M |
postag |
8.2M |
Fin
I think there’s a lot of room for experimentation with classifier based taggers and their feature detectors. But if speed is an issue for you, don’t even bother. In that case, stick with a simpler tagger that’s nearly as accurate and orders of magnitude faster.