NLTK Trainer includes 2 scripts for analyzing both a tagged corpus and the coverage of a part-of-speech tagger.
Analyze a Tagged Corpus
You can get part-of-speech tag statistics on a tagged corpus using analyze_tagged_corpus.py
. Here’s the tag counts for the treebank corpus:
$ python analyze_tagged_corpus.py treebank
loading nltk.corpus.treebank
100676 total words
12408 unique words
46 tags
Tag Count
======= =========
# 16
$ 724
'' 694
, 4886
-LRB- 120
-NONE- 6592
-RRB- 126
. 3874
: 563
CC 2265
CD 3546
DT 8165
EX 88
FW 4
IN 9857
JJ 5834
JJR 381
JJS 182
LS 13
MD 927
NN 13166
NNP 9410
NNPS 244
NNS 6047
PDT 27
POS 824
PRP 1716
PRP$ 766
RB 2822
RBR 136
RBS 35
RP 216
SYM 1
TO 2179
UH 3
VB 2554
VBD 3043
VBG 1460
VBN 2134
VBP 1321
VBZ 2125
WDT 445
WP 241
WP$ 14
WRB 178
`` 712
======= =========
By default, analyze_tagged_corpus.py
sorts by tags, but you can sort by the highest count using <span class="pre">--sort</span> count <span class="pre">--reverse</span>
. You can also see counts for simplified tags using <span class="pre">--simplify_tags</span>
:
$ python analyze_tagged_corpus.py treebank --simplify_tags
loading nltk.corpus.treebank
100676 total words
12408 unique words
31 tags
Tag Count
======= =========
7416
# 16
$ 724
'' 694
( 120
) 126
, 4886
. 3874
: 563
ADJ 6397
ADV 2993
CNJ 2265
DET 8192
EX 88
FW 4
L 13
MOD 927
N 19213
NP 9654
NUM 3546
P 9857
PRO 2698
S 1
TO 2179
UH 3
V 6000
VD 3043
VG 1460
VN 2134
WH 878
`` 712
======= =========
Analyze Tagger Coverage
You can analyze the coverage of a part-of-speech tagger against any corpus using analyze_tagger_coverage.py
. Here’s the results for the treebank corpus using NLTK’s default part-of-speech tagger:
$ python analyze_tagger_coverage.py treebank
loading tagger taggers/maxent_treebank_pos_tagger/english.pickle
analyzing tag coverage of treebank with ClassifierBasedPOSTagger
Tag Found
======= =========
# 16
$ 724
'' 694
, 4887
-LRB- 120
-NONE- 6591
-RRB- 126
. 3874
: 563
CC 2271
CD 3547
DT 8170
EX 88
FW 4
IN 9880
JJ 5803
JJR 386
JJS 185
LS 12
MD 927
NN 13166
NNP 9427
NNPS 246
NNS 6055
PDT 21
POS 824
PRP 1716
PRP$ 766
RB 2800
RBR 130
RBS 33
RP 213
SYM 1
TO 2180
UH 3
VB 2562
VBD 3035
VBG 1458
VBN 2145
VBP 1318
VBZ 2124
WDT 440
WP 241
WP$ 14
WRB 178
`` 712
======= =========
If you want to analyze the coverage of your own pickled tagger, use <span class="pre">--tagger</span> PATH/TO/TAGGER.pickle
. You can also get detailed metrics on Found vs Actual counts, as well as Precision and Recall for each tag by using the <span class="pre">--metrics</span>
argument with a corpus that provides a tagged_sents
method, like treebank:
$ python analyze_tagger_coverage.py treebank --metrics
loading tagger taggers/maxent_treebank_pos_tagger/english.pickle
analyzing tag coverage of treebank with ClassifierBasedPOSTagger
Accuracy: 0.995689
Unknown words: 440
Tag Found Actual Precision Recall
======= ========= ========== ============= ==========
# 16 16 1.0 1.0
$ 724 724 1.0 1.0
'' 694 694 1.0 1.0
, 4887 4886 1.0 1.0
-LRB- 120 120 1.0 1.0
-NONE- 6591 6592 1.0 1.0
-RRB- 126 126 1.0 1.0
. 3874 3874 1.0 1.0
: 563 563 1.0 1.0
CC 2271 2265 1.0 1.0
CD 3547 3546 0.99895833333 0.99895833333
DT 8170 8165 1.0 1.0
EX 88 88 1.0 1.0
FW 4 4 1.0 1.0
IN 9880 9857 0.99130434782 0.95798319327
JJ 5803 5834 0.99134948096 0.97892938496
JJR 386 381 1.0 0.91489361702
JJS 185 182 0.96666666666 1.0
LS 12 13 1.0 0.85714285714
MD 927 927 1.0 1.0
NN 13166 13166 0.99166034874 0.98791540785
NNP 9427 9410 0.99477911646 0.99398073836
NNPS 246 244 0.99029126213 0.95327102803
NNS 6055 6047 0.99515235457 0.99722414989
PDT 21 27 1.0 0.66666666666
POS 824 824 1.0 1.0
PRP 1716 1716 1.0 1.0
PRP$ 766 766 1.0 1.0
RB 2800 2822 0.99305555555 0.975
RBR 130 136 1.0 0.875
RBS 33 35 1.0 0.5
RP 213 216 1.0 1.0
SYM 1 1 1.0 1.0
TO 2180 2179 1.0 1.0
UH 3 3 1.0 1.0
VB 2562 2554 0.99142857142 1.0
VBD 3035 3043 0.990234375 0.98065764023
VBG 1458 1460 0.99650349650 0.99824868651
VBN 2145 2134 0.98852223816 0.99566473988
VBP 1318 1321 0.99305555555 0.98281786941
VBZ 2124 2125 0.99373040752 0.990625
WDT 440 445 1.0 0.83333333333
WP 241 241 1.0 1.0
WP$ 14 14 1.0 1.0
WRB 178 178 1.0 1.0
`` 712 712 1.0 1.0
======= ========= ========== ============= ==========
These additional metrics can be quite useful for identifying which tags a tagger has trouble with. Precision answers the question “for each word that was given this tag, was it correct?”, while Recall answers the question “for all words that should have gotten this tag, did they get it?”. If you look at PDT
, you can see that Precision is 100%, but Recall is 66%, meaning that every word that was given the PDT
tag was correct, but 6 out of the 27 words that should have gotten PDT
were mistakenly given a different tag. Or if you look at JJS
, you can see that Precision is 96.6% because it gave JJS
to 3 words that should have gotten a different tag, while Recall is 100% because all words that should have gotten JJS
got it.
Like this:
Like Loading...