For some research I’m doing with Michael D. Healy, I need to measure part-of-speech tagger coverage and performance. To that end, I’ve added a new script to nltk-trainer: analyze_tagger_coverage.py
. This script will tag every sentence of a corpus and count how many times it produces each tag. If you also use the --metrics
option, and the corpus reader provides a tagged_sents()
method, then you can get detailed performance metrics by comparing the tagger’s results against the actual tags.
NLTK Default Tagger Performance on Treebank
Below is a table showing the performance details of the NLTK 2.0b9 default tagger on the treebank corpus, which you can see for yourself by running python analyze_tagger_coverage.py treebank --metrics
. The default tagger is 99.57% accurate on treebank, and below you can see exactly on which tags it fails. The Found column shows the number of occurrences of each tag produced by the default tagger, while the Actual column shows the actual number of occurrences in the treebank corpus. Precision and Recall, which I’ve explained in the context of classification, show the performance for each tag. If the Precision is less than 1, that means the tagger gave the tag to a word that it shouldn’t have (a false positive). If the Recall is less than 1, it means the tagger did not give the tag to a word that it should have (a false negative).
Tag | Found | Actual | Precision | Recall |
# | 16 | 16 | 1 | 1 |
$ | 724 | 724 | 1 | 1 |
‘ | 694 | 694 | 1 | 1 |
, | 4887 | 4886 | 1 | 1 |
-LRB- | 120 | 120 | 1 | 1 |
-NONE- | 6591 | 6592 | 1 | 1 |
-RRB- | 126 | 126 | 1 | 1 |
. | 3874 | 3874 | 1 | 1 |
: | 563 | 563 | 1 | 1 |
CC | 2271 | 2265 | 1 | 1 |
CD | 3547 | 3546 | 0.999 | 0.999 |
DT | 8170 | 8165 | 1 | 1 |
EX | 88 | 88 | 1 | 1 |
FW | 4 | 4 | 1 | 1 |
IN | 9880 | 9857 | 0.9913 | 0.958 |
JJ | 5803 | 5834 | 0.9913 | 0.9789 |
JJR | 386 | 381 | 1 | 0.9149 |
JJS | 185 | 182 | 0.9667 | 1 |
LS | 12 | 13 | 1 | 0.8571 |
MD | 927 | 927 | 1 | 1 |
NN | 13166 | 13166 | 0.9917 | 0.9879 |
NNP | 9427 | 9410 | 0.9948 | 0.994 |
NNPS | 246 | 244 | 0.9903 | 0.9533 |
NNS | 6055 | 6047 | 0.9952 | 0.9972 |
PDT | 21 | 27 | 1 | 0.6667 |
POS | 824 | 824 | 1 | 1 |
PRP | 1716 | 1716 | 1 | 1 |
PRP$ | 766 | 766 | 1 | 1 |
RB | 2800 | 2822 | 0.9931 | 0.975 |
RBR | 130 | 136 | 1 | 0.875 |
RBS | 33 | 35 | 1 | 0.5 |
RP | 213 | 216 | 1 | 1 |
SYM | 1 | 1 | 1 | 1 |
TO | 2180 | 2179 | 1 | 1 |
UH | 3 | 3 | 1 | 1 |
VB | 2562 | 2554 | 0.9914 | 1 |
VBD | 3035 | 3043 | 0.9902 | 0.9807 |
VBG | 1458 | 1460 | 0.9965 | 0.9982 |
VBN | 2145 | 2134 | 0.9885 | 0.9957 |
VBP | 1318 | 1321 | 0.9931 | 0.9828 |
VBZ | 2124 | 2125 | 0.9937 | 0.9906 |
WDT | 440 | 445 | 1 | 0.8333 |
WP | 241 | 241 | 1 | 1 |
WP$ | 14 | 14 | 1 | 1 |
WRB | 178 | 178 | 1 | 1 |
“ | 712 | 712 | 1 | 1 |
Unknown Words in Treebank
Suprisingly, the treebank corpus contains 6592 words tags with -NONE-
. But it’s not that bad, since it’s only 440 unique words, and they are not regular words at all: *EXP*-2
, *T*-91
, *-106
, and many more similar looking tokens.