Tag Archives: metrics

NLTK Default Tagger CoNLL2000 Tag Coverage

Following up on the previous post showing the tag coverage of the NLTK 2.0b9 default tagger on the treebank corpus, below are the same metrics applied to the conll2000 corpus, using the analyze_tagger_coverage.py script from nltk-trainer.

NLTK Default Tagger Performance on CoNLL2000

The default tagger is 93.9% accurate on the conll2000 corpus, which is to be expected since both treebank and conll2000 are based on the Wall Street Journal. You can see all the metrics shown below for yourself by running python analyze_tagger_coverage.py conll2000 --metrics. In many cases, the Precision and Recall metrics are significantly lower than 1, even when the Found and Actual counts are similar. This happens when words are given the wrong tag (creating false positives and false negatives) while the overall tag frequency remains about the same. The CC tag is a great example of this: the Found count is only 3 higher than the Actual count, yet Precision is 68.75% and Recall is 73.33%. This tells us that the number of words that were mis-tagged as CC, and the number of CC words that were not given the CC tag, are approximately equal, creating similar counts despite the false positives and false negatives.

Tag Found Actual Precision Recall
# 46 47 1 1
$ 2122 2134 1 0.6
1811 1809 1 1
( 0 351 None 0
) 0 358 None 0
, 13160 13160 1 1
-LRB- 351 0 0 None
-NONE- 59 0 0 None
-RRB- 358 0 0 None
. 10800 10802 1 1
: 1288 1285 0.7143 1
CC 6589 6586 0.6875 0.7333
CD 10325 10233 0.972 0.9919
DT 22301 22355 0.7826 1
EX 229 254 1 1
FW 1 42 1 0.0455
IN 27798 27835 0.7315 0.7899
JJ 15370 16049 0.7372 0.7303
JJR 1114 1055 0.5412 0.575
JJS 611 451 0.6912 0.7966
LS 13 0 0 None
MD 2616 2637 0.7143 0.75
NN 38023 36789 0.7345 0.8441
NNP 24967 24690 0.8752 0.9421
NNPS 589 550 0.4553 0.3684
NNS 17068 16653 0.8572 0.9527
PDT 24 65 0.6667 1
POS 2224 2203 0.6667 1
PRP 4620 4634 0.8438 0.7941
PRP$ 2292 2302 0.6364 1
RB 7681 7961 0.8076 0.8582
RBR 288 392 0.5 0.3684
RBS 90 240 0.5 0.1667
RP 634 95 0.1176 1
SYM 0 6 None 0
TO 6257 6259 1 0.75
UH 2 17 1 0.1111
VB 6681 7286 0.9042 0.8313
VBD 8501 8424 0.7521 0.8605
VBG 3730 4000 0.8493 0.8603
VBN 5763 5867 0.8164 0.8721
VBP 3232 3407 0.6754 0.6638
VBZ 5224 5561 0.7273 0.6906
WDT 1156 1157 0.6 0.5
WP 637 639 1 1
WP$ 38 39 1 1
WRB 566 571 0.9 0.75
1855 1854 0.6667 1

Unknown Words in CoNLL2000

The conll2000 corpus has 0 words tagged with -NONE-, yet the default tagger is unable to identify 50 unique words. Here’s a sample: boiler-room, so-so, Coca-Cola, top-10, AC&R, F-16, I-880, R2-D2, mid-1992. For the most part, the unknown words are symbolic names, acronyms, or two separate words combined with a “-“. You might think this can solved with better tokenization, but for words like F-16 and I-880, tokenizing on the “-” would be incorrect.

Missing Symbols and Rare Tags

The default tagger apparently does not recognize parentheses or the SYM tag, and has trouble with many of the more rare tags, such as FW, LS, RBS, and UH. These failures highlight the need for training a part-of-speech tagger (or any NLP object) on a corpus that is as similar as possible to the corpus you are analyzing. At the very least, your training corpus and testing corpus should share the same set of part-of-speech tags, and in similar proportion. Otherwise, mistakes will be made, such as not recognizing common symbols, or finding -LRB- and -RRB- tags where they do not exist.

NLTK Default Tagger Treebank Tag Coverage

For some research I’m doing with Michael D. Healy, I need to measure part-of-speech tagger coverage and performance. To that end, I’ve added a new script to nltk-trainer: analyze_tagger_coverage.py. This script will tag every sentence of a corpus and count how many times it produces each tag. If you also use the --metrics option, and the corpus reader provides a tagged_sents() method, then you can get detailed performance metrics by comparing the tagger’s results against the actual tags.

NLTK Default Tagger Performance on Treebank

Below is a table showing the performance details of the NLTK 2.0b9 default tagger on the treebank corpus, which you can see for yourself by running python analyze_tagger_coverage.py treebank --metrics. The default tagger is 99.57% accurate on treebank, and below you can see exactly on which tags it fails. The Found column shows the number of occurrences of each tag produced by the default tagger, while the Actual column shows the actual number of occurrences in the treebank corpus. Precision and Recall, which I’ve explained in the context of classification, show the performance for each tag. If the Precision is less than 1, that means the tagger gave the tag to a word that it shouldn’t have (a false positive). If the Recall is less than 1, it means the tagger did not give the tag to a word that it should have (a false negative).

Tag Found Actual Precision Recall
# 16 16 1 1
$ 724 724 1 1
694 694 1 1
, 4887 4886 1 1
-LRB- 120 120 1 1
-NONE- 6591 6592 1 1
-RRB- 126 126 1 1
. 3874 3874 1 1
: 563 563 1 1
CC 2271 2265 1 1
CD 3547 3546 0.999 0.999
DT 8170 8165 1 1
EX 88 88 1 1
FW 4 4 1 1
IN 9880 9857 0.9913 0.958
JJ 5803 5834 0.9913 0.9789
JJR 386 381 1 0.9149
JJS 185 182 0.9667 1
LS 12 13 1 0.8571
MD 927 927 1 1
NN 13166 13166 0.9917 0.9879
NNP 9427 9410 0.9948 0.994
NNPS 246 244 0.9903 0.9533
NNS 6055 6047 0.9952 0.9972
PDT 21 27 1 0.6667
POS 824 824 1 1
PRP 1716 1716 1 1
PRP$ 766 766 1 1
RB 2800 2822 0.9931 0.975
RBR 130 136 1 0.875
RBS 33 35 1 0.5
RP 213 216 1 1
SYM 1 1 1 1
TO 2180 2179 1 1
UH 3 3 1 1
VB 2562 2554 0.9914 1
VBD 3035 3043 0.9902 0.9807
VBG 1458 1460 0.9965 0.9982
VBN 2145 2134 0.9885 0.9957
VBP 1318 1321 0.9931 0.9828
VBZ 2124 2125 0.9937 0.9906
WDT 440 445 1 0.8333
WP 241 241 1 1
WP$ 14 14 1 1
WRB 178 178 1 1
712 712 1 1

Unknown Words in Treebank

Suprisingly, the treebank corpus contains 6592 words tags with -NONE-. But it’s not that bad, since it’s only 440 unique words, and they are not regular words at all: *EXP*-2, *T*-91, *-106, and many more similar looking tokens.