NLTK Default Tagger Treebank Tag Coverage
For some research I'm doing with Michael D. Healy, I need to measure part-of-speech tagger coverage and performance. To that end, I've added a new script to nltk-trainer: analyze_tagger_coverage.py. This script will tag every sentence of a corpus and count how many times it produces each tag. If you also use the --metrics option, and the corpus reader provides a tagged_sents() method, then you can get detailed performance metrics by comparing the tagger's results against the actual tags.
NLTK Default Tagger Performance on Treebank
Below is a table showing the performance details of the NLTK 2.0b9 default tagger on the treebank corpus, which you can see for yourself by running python analyze_tagger_coverage.py treebank --metrics. The default tagger is 99.57% accurate on treebank, and below you can see exactly on which tags it fails. The Found column shows the number of occurrences of each tag produced by the default tagger, while the Actual column shows the actual number of occurrences in the treebank corpus. Precision and Recall, which I've explained in the context of classification, show the performance for each tag. If the Precision is less than 1, that means the tagger gave the tag to a word that it shouldn't have (a false positive). If the Recall is less than 1, it means the tagger did not give the tag to a word that it should have (a false negative).
| Tag | Found | Actual | Precision | Recall |
| # | 16 | 16 | 1 | 1 |
| $ | 724 | 724 | 1 | 1 |
| ' | 694 | 694 | 1 | 1 |
| , | 4887 | 4886 | 1 | 1 |
| -LRB- | 120 | 120 | 1 | 1 |
| -NONE- | 6591 | 6592 | 1 | 1 |
| -RRB- | 126 | 126 | 1 | 1 |
| . | 3874 | 3874 | 1 | 1 |
| : | 563 | 563 | 1 | 1 |
| CC | 2271 | 2265 | 1 | 1 |
| CD | 3547 | 3546 | 0.999 | 0.999 |
| DT | 8170 | 8165 | 1 | 1 |
| EX | 88 | 88 | 1 | 1 |
| FW | 4 | 4 | 1 | 1 |
| IN | 9880 | 9857 | 0.9913 | 0.958 |
| JJ | 5803 | 5834 | 0.9913 | 0.9789 |
| JJR | 386 | 381 | 1 | 0.9149 |
| JJS | 185 | 182 | 0.9667 | 1 |
| LS | 12 | 13 | 1 | 0.8571 |
| MD | 927 | 927 | 1 | 1 |
| NN | 13166 | 13166 | 0.9917 | 0.9879 |
| NNP | 9427 | 9410 | 0.9948 | 0.994 |
| NNPS | 246 | 244 | 0.9903 | 0.9533 |
| NNS | 6055 | 6047 | 0.9952 | 0.9972 |
| PDT | 21 | 27 | 1 | 0.6667 |
| POS | 824 | 824 | 1 | 1 |
| PRP | 1716 | 1716 | 1 | 1 |
| PRP$ | 766 | 766 | 1 | 1 |
| RB | 2800 | 2822 | 0.9931 | 0.975 |
| RBR | 130 | 136 | 1 | 0.875 |
| RBS | 33 | 35 | 1 | 0.5 |
| RP | 213 | 216 | 1 | 1 |
| SYM | 1 | 1 | 1 | 1 |
| TO | 2180 | 2179 | 1 | 1 |
| UH | 3 | 3 | 1 | 1 |
| VB | 2562 | 2554 | 0.9914 | 1 |
| VBD | 3035 | 3043 | 0.9902 | 0.9807 |
| VBG | 1458 | 1460 | 0.9965 | 0.9982 |
| VBN | 2145 | 2134 | 0.9885 | 0.9957 |
| VBP | 1318 | 1321 | 0.9931 | 0.9828 |
| VBZ | 2124 | 2125 | 0.9937 | 0.9906 |
| WDT | 440 | 445 | 1 | 0.8333 |
| WP | 241 | 241 | 1 | 1 |
| WP$ | 14 | 14 | 1 | 1 |
| WRB | 178 | 178 | 1 | 1 |
| `` | 712 | 712 | 1 | 1 |
Unknown Words in Treebank
Suprisingly, the treebank corpus contains 6592 words tags with -NONE-. But it's not that bad, since it's only 440 unique words, and they are not regular words at all: *EXP*-2, *T*-91, *-106, and many more similar looking tokens.
Part of Speech Tagging with NLTK Part 4 – Brill Tagger vs Classifier Taggers
In previous installments on part-of-speech tagging, we saw that a Brill Tagger provides significant accuracy improvements over the Ngram Taggers combined with Regex and Affix Tagging.
With the latest 2.0 beta releases (2.0b8 as of this writing), NLTK has included a ClassifierBasedTagger as well as a pre-trained tagger used by the nltk.tag.pos_tag method. Based on the name, then pre-trained tagger appears to be a ClassifierBasedTagger trained on the treebank corpus using a MaxentClassifier. So let's see how a classifier tagger compares to the brill tagger.
NLTK Training Sets
For the brown corpus, I trained on 2/3 of the reviews, lore, and romance categories, and tested against the remaining 1/3. For conll2000, I used the standard train.txt vs test.txt. And for treebank, I again used a 2/3 vs 1/3 split.
import itertools
from nltk.corpus import brown, conll2000, treebank
brown_reviews = brown.tagged_sents(categories=['reviews'])
brown_reviews_cutoff = len(brown_reviews) * 2 / 3
brown_lore = brown.tagged_sents(categories=['lore'])
brown_lore_cutoff = len(brown_lore) * 2 / 3
brown_romance = brown.tagged_sents(categories=['romance'])
brown_romance_cutoff = len(brown_romance) * 2 / 3
brown_train = list(itertools.chain(brown_reviews[:brown_reviews_cutoff],
brown_lore[:brown_lore_cutoff], brown_romance[:brown_romance_cutoff]))
brown_test = list(itertools.chain(brown_reviews[brown_reviews_cutoff:],
brown_lore[brown_lore_cutoff:], brown_romance[brown_romance_cutoff:]))
conll_train = conll2000.tagged_sents('train.txt')
conll_test = conll2000.tagged_sents('test.txt')
treebank_cutoff = len(treebank.tagged_sents()) * 2 / 3
treebank_train = treebank.tagged_sents()[:treebank_cutoff]
treebank_test = treebank.tagged_sents()[treebank_cutoff:]
Naive Bayes Classifier Taggers
There are 3 new taggers referenced below:
cposis an instance of ClassifierBasedPOSTagger using the default NaiveBayesClassifier. It was trained by doingClassifierBasedPOSTagger(train=train_sents)craubtis likecpos, but has theraubttagger from part 2 as a backoff tagger by doingClassifierBasedPOSTagger(train=train_sents,backoff=raubt)bcposis a BrillTagger usingcposas its initial tagger instead ofraubt.
The raubt tagger is the same as from part 2, and braubt is from part 3.
postag is NLTK's pre-trained tagger used by the pos_tag function. It can be loaded using nltk.data.load(nltk.tag._POS_TAGGER).
Accuracy Evaluation
Tagger accuracy was determined by calling the evaluate method with the test set on each trained tagger. Here are the results:
Conclusions
The above results are quite interesting, and lead to a few conclusions:
- Training data is hugely significant when it comes to accuracy. This is why
postagtakes a huge nose dive onbrown, while at the same time can get near 100% accuracy ontreebank. - A ClassifierBasedPOSTagger does not need a backoff tagger, since
cposaccuracy is exactly the same as forcraubtacross all corpora. - The ClassifierBasedPOSTagger is not necessarily more accurate than the
bcraubttagger from part 3 (at least with the default feature detector). It also takes much longer to train and tag (more details below) and so may not be worth the tradeoff in efficiency. - Using brill tagger will nearly always increase the accuracy of your initial tagger, but not by much.
I was also surprised at how much more accurate postag was compared to cpos. Thinking that postag was probably trained on the full treebank corpus, I did the same, and re-evaluated:
cpos = ClassifierBasedPOSTagger(train=treebank.tagged_sents()) cpos.evaluate(treebank_test)
The result was 98.08% accuracy. So the remaining 2% difference must be due to the MaxentClassifier being more accurate than the naive bayes classifier, and/or the use of a different feature detector. I tried again with classifier_builder=MaxentClassifier.train and only got to 98.4% accuracy. So I can only conclude that a different feature detector is used. Hopefully the NLTK leaders will publish the training method so we can all know for sure.
Classification Efficiency
On the nltk-users list, there was a question about which tagger is the most computationaly economic. I can't tell you the right answer, but I can definitely say that ClassifierBasedPOSTagger is the wrong answer. During accuracy evaluation, I noticed that the cpos tagger took a lot longer than raubt or braubt. So I ran timeit on the tag method of each tagger, and got the following results:
| Tagger | secs/pass |
|---|---|
| raubt | 0.00005 |
| braubt | 0.00009 |
| cpos | 0.02219 |
| bcpos | 0.02259 |
| postag | 0.01241 |
This was run with python 2.6.4 on an Athlon 64 Dual Core 4600+ with 3G RAM, but the important thing is the relative times. braubt is over 246 times faster than cpos! To put it another way, braubt can process over 66666 words/sec, where cpos can only do 270 words/sec and postag only 483 words/sec. So the lesson is: do not use a classifier based tagger if speed is an issue.
Here's the code for timing postag. You can do the same thing for any other pickled tagger by replacing nltk.tag._POS_TAGGER with a nltk.data accessible path with a .pickle suffix for the load method.
import nltk, timeit
text = nltk.word_tokenize('And now for something completely different')
setup = 'import nltk.data, nltk.tag; tagger = nltk.data.load(nltk.tag._POS_TAGGER)'
t = timeit.Timer('tagger.tag(%s)' % text, setup)
print 'timing postag 1000 times'
spent = t.timeit(number=1000)
print 'took %.5f secs/pass' % (spent / 1000)
File Size
There's also a significant difference in the file size of the pickled taggers (trained on treebank):
| Tagger | Size |
|---|---|
| raubt | 272K |
| braubt | 273K |
| cpos | 3.8M |
| bcpos | 3.8M |
| postag | 8.2M |
Fin
I think there's a lot of room for experimentation with classifier based taggers and their feature detectors. But if speed is an issue for you, don't even bother. In that case, stick with a simpler tagger that's nearly as accurate and orders of magnitude faster.




