Analyzing Tagged Corpora and NLTK Part of Speech Taggers
NLTK Trainer includes 2 scripts for analyzing both a tagged corpus and the coverage of a part-of-speech tagger.
Analyze a Tagged Corpus
You can get part-of-speech tag statistics on a tagged corpus using analyze_tagged_corpus.py. Here's the tag counts for the treebank corpus:
$ python analyze_tagged_corpus.py treebank loading nltk.corpus.treebank 100676 total words 12408 unique words 46 tags Tag Count ======= ========= # 16 $ 724 '' 694 , 4886 -LRB- 120 -NONE- 6592 -RRB- 126 . 3874 : 563 CC 2265 CD 3546 DT 8165 EX 88 FW 4 IN 9857 JJ 5834 JJR 381 JJS 182 LS 13 MD 927 NN 13166 NNP 9410 NNPS 244 NNS 6047 PDT 27 POS 824 PRP 1716 PRP$ 766 RB 2822 RBR 136 RBS 35 RP 216 SYM 1 TO 2179 UH 3 VB 2554 VBD 3043 VBG 1460 VBN 2134 VBP 1321 VBZ 2125 WDT 445 WP 241 WP$ 14 WRB 178 `` 712 ======= =========
By default, analyze_tagged_corpus.py sorts by tags, but you can sort by the highest count using --sort count --reverse. You can also see counts for simplified tags using --simplify_tags:
$ python analyze_tagged_corpus.py treebank --simplify_tags
loading nltk.corpus.treebank
100676 total words
12408 unique words
31 tags
Tag Count
======= =========
7416
# 16
$ 724
'' 694
( 120
) 126
, 4886
. 3874
: 563
ADJ 6397
ADV 2993
CNJ 2265
DET 8192
EX 88
FW 4
L 13
MOD 927
N 19213
NP 9654
NUM 3546
P 9857
PRO 2698
S 1
TO 2179
UH 3
V 6000
VD 3043
VG 1460
VN 2134
WH 878
`` 712
======= =========
Analyze Tagger Coverage
You can analyze the coverage of a part-of-speech tagger against any corpus using analyze_tagger_coverage.py. Here's the results for the treebank corpus using NLTK's default part-of-speech tagger:
$ python analyze_tagger_coverage.py treebank loading tagger taggers/maxent_treebank_pos_tagger/english.pickle analyzing tag coverage of treebank with ClassifierBasedPOSTagger Tag Found ======= ========= # 16 $ 724 '' 694 , 4887 -LRB- 120 -NONE- 6591 -RRB- 126 . 3874 : 563 CC 2271 CD 3547 DT 8170 EX 88 FW 4 IN 9880 JJ 5803 JJR 386 JJS 185 LS 12 MD 927 NN 13166 NNP 9427 NNPS 246 NNS 6055 PDT 21 POS 824 PRP 1716 PRP$ 766 RB 2800 RBR 130 RBS 33 RP 213 SYM 1 TO 2180 UH 3 VB 2562 VBD 3035 VBG 1458 VBN 2145 VBP 1318 VBZ 2124 WDT 440 WP 241 WP$ 14 WRB 178 `` 712 ======= =========
If you want to analyze the coverage of your own pickled tagger, use --tagger PATH/TO/TAGGER.pickle. You can also get detailed metrics on Found vs Actual counts, as well as Precision and Recall for each tag by using the --metrics argument with a corpus that provides a tagged_sents method, like treebank:
$ python analyze_tagger_coverage.py treebank --metrics loading tagger taggers/maxent_treebank_pos_tagger/english.pickle analyzing tag coverage of treebank with ClassifierBasedPOSTagger Accuracy: 0.995689 Unknown words: 440 Tag Found Actual Precision Recall ======= ========= ========== ============= ========== # 16 16 1.0 1.0 $ 724 724 1.0 1.0 '' 694 694 1.0 1.0 , 4887 4886 1.0 1.0 -LRB- 120 120 1.0 1.0 -NONE- 6591 6592 1.0 1.0 -RRB- 126 126 1.0 1.0 . 3874 3874 1.0 1.0 : 563 563 1.0 1.0 CC 2271 2265 1.0 1.0 CD 3547 3546 0.99895833333 0.99895833333 DT 8170 8165 1.0 1.0 EX 88 88 1.0 1.0 FW 4 4 1.0 1.0 IN 9880 9857 0.99130434782 0.95798319327 JJ 5803 5834 0.99134948096 0.97892938496 JJR 386 381 1.0 0.91489361702 JJS 185 182 0.96666666666 1.0 LS 12 13 1.0 0.85714285714 MD 927 927 1.0 1.0 NN 13166 13166 0.99166034874 0.98791540785 NNP 9427 9410 0.99477911646 0.99398073836 NNPS 246 244 0.99029126213 0.95327102803 NNS 6055 6047 0.99515235457 0.99722414989 PDT 21 27 1.0 0.66666666666 POS 824 824 1.0 1.0 PRP 1716 1716 1.0 1.0 PRP$ 766 766 1.0 1.0 RB 2800 2822 0.99305555555 0.975 RBR 130 136 1.0 0.875 RBS 33 35 1.0 0.5 RP 213 216 1.0 1.0 SYM 1 1 1.0 1.0 TO 2180 2179 1.0 1.0 UH 3 3 1.0 1.0 VB 2562 2554 0.99142857142 1.0 VBD 3035 3043 0.990234375 0.98065764023 VBG 1458 1460 0.99650349650 0.99824868651 VBN 2145 2134 0.98852223816 0.99566473988 VBP 1318 1321 0.99305555555 0.98281786941 VBZ 2124 2125 0.99373040752 0.990625 WDT 440 445 1.0 0.83333333333 WP 241 241 1.0 1.0 WP$ 14 14 1.0 1.0 WRB 178 178 1.0 1.0 `` 712 712 1.0 1.0 ======= ========= ========== ============= ==========
These additional metrics can be quite useful for identifying which tags a tagger has trouble with. Precision answers the question "for each word that was given this tag, was it correct?", while Recall answers the question "for all words that should have gotten this tag, did they get it?". If you look at PDT, you can see that Precision is 100%, but Recall is 66%, meaning that every word that was given the PDT tag was correct, but 6 out of the 27 words that should have gotten PDT were mistakenly given a different tag. Or if you look at JJS, you can see that Precision is 96.6% because it gave JJS to 3 words that should have gotten a different tag, while Recall is 100% because all words that should have gotten JJS got it.
Training Part of Speech Taggers with NLTK Trainer
NLTK trainer makes it easy to train part-of-speech taggers with various algorithms using train_tagger.py.
Training Sequential Backoff Taggers
The fastest algorithms are the sequential backoff taggers. You can specify the backoff sequence using the --sequential argument, which accepts any combination of the following letters:
a: |
AffixTagger |
u: |
UnigramTagger |
b: |
BigramTagger |
t: |
TrigramTagger |
For example, to train the same kinds of taggers that were used in Part of Speech Tagging with NLTK Part 1 - Ngram Taggers, you could do the following:
python train_tagger.py treebank --sequential ubt
You can rearrange ubt any way you want to change the order of the taggers (though ubt is generally the most accurate order).
Training Affix Taggers
The --sequential argument also recognizes the letter a, which will insert an AffixTagger into the backoff chain. If you do not specify the --affix argument, then it will include one AffixTagger with a 3-character suffix. However, you can change this by specifying one or more --affix N options, where N should be a positive number for prefixes, and a negative number for suffixes. For example, to train an aubt tagger with 2 AffixTaggers, one that uses a 3 character suffix, and another that uses a 2 character prefix, specify the --affix argument twice:
python train_tagger.py treebank --sequential aubt --affix -3 --affix 2
The order of the --affix arguments is the order in which each AffixTagger will be trained and inserted into the backoff chain.
Training Brill Taggers
To train a BrillTagger in a similar fashion to the one trained in Part of Speech Tagging Part 3 - Brill Tagger (using FastBrillTaggerTrainer), use the --brill argument:
python train_tagger.py treebank --sequential aubt --brill
The default training options are a maximum of 200 rules with a minimum score of 2, but you can change that with the --max_rules and --min_score arguments. You can also change the rule template bounds, which defaults to 1, using the --template_bounds argument.
Training Classifier Based Taggers
Many of the arguments used by train_classifier.py can also be used to train a ClassifierBasedPOSTagger. If you don't want this tagger to backoff to a sequential backoff tagger, be sure to specify --sequential ''. Here's an example for training a NaiveBayesClassifier based tagger, similar to what was shown in Part of Speech Tagging Part 4 - Classifier Taggers:
python train_tagger.py treebank --sequential '' --classifier NaiveBayes
If you do want to backoff to a sequential tagger, be sure to specify a cutoff probability, like so:
python train_tagger.py treebank --sequential ubt --classifier NaiveBayes --cutoff_prob 0.4
Any of the NLTK classification algorithms can be used for the --classifier argument, such as Maxent or MEGAM, and every algorithm other than NaiveBayes has specific training options that can be customized.
Phonetic Feature Options
You can also include phonetic algorithm features using the following arguments:
--metaphone: |
Use metaphone feature |
--double-metaphone: |
Use double metaphone feature |
--soundex: |
Use soundex feature |
--nysiis: |
Use NYSIIS feature |
--caverphone: |
Use caverphone feature |
These options create phonetic codes that will be included as features along with the default features used by the ClassifierBasedPOSTagger. The --double-metaphone algorithm comes from metaphone.py, while all the other phonetic algorithm have been copied from the advas project (which appears to be abandoned).
I created these options after discussions with Michael D Healy about Twitter Linguistics, in which he explained the prevalence of regional spelling variations. These phonetic features may be able to reduce that variation where a tagger is concerned, as slightly different spellings might generate the same phonetic code.
A tagger trained with any of these phonetic features will be an instance of nltk_trainer.tagging.taggers.PhoneticClassifierBasedPOSTagger, which means nltk_trainer must be included in your PYTHONPATH in order to load & use the tagger. The simplest way to do this is to install nltk-trainer using python setup.py install.
NLTK Default Tagger CoNLL2000 Tag Coverage
Following up on the previous post showing the tag coverage of the NLTK 2.0b9 default tagger on the treebank corpus, below are the same metrics applied to the conll2000 corpus, using the analyze_tagger_coverage.py script from nltk-trainer.
NLTK Default Tagger Performance on CoNLL2000
The default tagger is 93.9% accurate on the conll2000 corpus, which is to be expected since both treebank and conll2000 are based on the Wall Street Journal. You can see all the metrics shown below for yourself by running python analyze_tagger_coverage.py conll2000 --metrics. In many cases, the Precision and Recall metrics are significantly lower than 1, even when the Found and Actual counts are similar. This happens when words are given the wrong tag (creating false positives and false negatives) while the overall tag frequency remains about the same. The CC tag is a great example of this: the Found count is only 3 higher than the Actual count, yet Precision is 68.75% and Recall is 73.33%. This tells us that the number of words that were mis-tagged as CC, and the number of CC words that were not given the CC tag, are approximately equal, creating similar counts despite the false positives and false negatives.
| Tag | Found | Actual | Precision | Recall |
| # | 46 | 47 | 1 | 1 |
| $ | 2122 | 2134 | 1 | 0.6 |
| ' | 1811 | 1809 | 1 | 1 |
| ( | 0 | 351 | None | 0 |
| ) | 0 | 358 | None | 0 |
| , | 13160 | 13160 | 1 | 1 |
| -LRB- | 351 | 0 | 0 | None |
| -NONE- | 59 | 0 | 0 | None |
| -RRB- | 358 | 0 | 0 | None |
| . | 10800 | 10802 | 1 | 1 |
| : | 1288 | 1285 | 0.7143 | 1 |
| CC | 6589 | 6586 | 0.6875 | 0.7333 |
| CD | 10325 | 10233 | 0.972 | 0.9919 |
| DT | 22301 | 22355 | 0.7826 | 1 |
| EX | 229 | 254 | 1 | 1 |
| FW | 1 | 42 | 1 | 0.0455 |
| IN | 27798 | 27835 | 0.7315 | 0.7899 |
| JJ | 15370 | 16049 | 0.7372 | 0.7303 |
| JJR | 1114 | 1055 | 0.5412 | 0.575 |
| JJS | 611 | 451 | 0.6912 | 0.7966 |
| LS | 13 | 0 | 0 | None |
| MD | 2616 | 2637 | 0.7143 | 0.75 |
| NN | 38023 | 36789 | 0.7345 | 0.8441 |
| NNP | 24967 | 24690 | 0.8752 | 0.9421 |
| NNPS | 589 | 550 | 0.4553 | 0.3684 |
| NNS | 17068 | 16653 | 0.8572 | 0.9527 |
| PDT | 24 | 65 | 0.6667 | 1 |
| POS | 2224 | 2203 | 0.6667 | 1 |
| PRP | 4620 | 4634 | 0.8438 | 0.7941 |
| PRP$ | 2292 | 2302 | 0.6364 | 1 |
| RB | 7681 | 7961 | 0.8076 | 0.8582 |
| RBR | 288 | 392 | 0.5 | 0.3684 |
| RBS | 90 | 240 | 0.5 | 0.1667 |
| RP | 634 | 95 | 0.1176 | 1 |
| SYM | 0 | 6 | None | 0 |
| TO | 6257 | 6259 | 1 | 0.75 |
| UH | 2 | 17 | 1 | 0.1111 |
| VB | 6681 | 7286 | 0.9042 | 0.8313 |
| VBD | 8501 | 8424 | 0.7521 | 0.8605 |
| VBG | 3730 | 4000 | 0.8493 | 0.8603 |
| VBN | 5763 | 5867 | 0.8164 | 0.8721 |
| VBP | 3232 | 3407 | 0.6754 | 0.6638 |
| VBZ | 5224 | 5561 | 0.7273 | 0.6906 |
| WDT | 1156 | 1157 | 0.6 | 0.5 |
| WP | 637 | 639 | 1 | 1 |
| WP$ | 38 | 39 | 1 | 1 |
| WRB | 566 | 571 | 0.9 | 0.75 |
| `` | 1855 | 1854 | 0.6667 | 1 |
Unknown Words in CoNLL2000
The conll2000 corpus has 0 words tagged with -NONE-, yet the default tagger is unable to identify 50 unique words. Here's a sample: boiler-room, so-so, Coca-Cola, top-10, AC&R, F-16, I-880, R2-D2, mid-1992. For the most part, the unknown words are symbolic names, acronyms, or two separate words combined with a "-". You might think this can solved with better tokenization, but for words like F-16 and I-880, tokenizing on the "-" would be incorrect.
Missing Symbols and Rare Tags
The default tagger apparently does not recognize parentheses or the SYM tag, and has trouble with many of the more rare tags, such as FW, LS, RBS, and UH. These failures highlight the need for training a part-of-speech tagger (or any NLP object) on a corpus that is as similar as possible to the corpus you are analyzing. At the very least, your training corpus and testing corpus should share the same set of part-of-speech tags, and in similar proportion. Otherwise, mistakes will be made, such as not recognizing common symbols, or finding -LRB- and -RRB- tags where they do not exist.
NLTK Default Tagger Treebank Tag Coverage
For some research I'm doing with Michael D. Healy, I need to measure part-of-speech tagger coverage and performance. To that end, I've added a new script to nltk-trainer: analyze_tagger_coverage.py. This script will tag every sentence of a corpus and count how many times it produces each tag. If you also use the --metrics option, and the corpus reader provides a tagged_sents() method, then you can get detailed performance metrics by comparing the tagger's results against the actual tags.
NLTK Default Tagger Performance on Treebank
Below is a table showing the performance details of the NLTK 2.0b9 default tagger on the treebank corpus, which you can see for yourself by running python analyze_tagger_coverage.py treebank --metrics. The default tagger is 99.57% accurate on treebank, and below you can see exactly on which tags it fails. The Found column shows the number of occurrences of each tag produced by the default tagger, while the Actual column shows the actual number of occurrences in the treebank corpus. Precision and Recall, which I've explained in the context of classification, show the performance for each tag. If the Precision is less than 1, that means the tagger gave the tag to a word that it shouldn't have (a false positive). If the Recall is less than 1, it means the tagger did not give the tag to a word that it should have (a false negative).
| Tag | Found | Actual | Precision | Recall |
| # | 16 | 16 | 1 | 1 |
| $ | 724 | 724 | 1 | 1 |
| ' | 694 | 694 | 1 | 1 |
| , | 4887 | 4886 | 1 | 1 |
| -LRB- | 120 | 120 | 1 | 1 |
| -NONE- | 6591 | 6592 | 1 | 1 |
| -RRB- | 126 | 126 | 1 | 1 |
| . | 3874 | 3874 | 1 | 1 |
| : | 563 | 563 | 1 | 1 |
| CC | 2271 | 2265 | 1 | 1 |
| CD | 3547 | 3546 | 0.999 | 0.999 |
| DT | 8170 | 8165 | 1 | 1 |
| EX | 88 | 88 | 1 | 1 |
| FW | 4 | 4 | 1 | 1 |
| IN | 9880 | 9857 | 0.9913 | 0.958 |
| JJ | 5803 | 5834 | 0.9913 | 0.9789 |
| JJR | 386 | 381 | 1 | 0.9149 |
| JJS | 185 | 182 | 0.9667 | 1 |
| LS | 12 | 13 | 1 | 0.8571 |
| MD | 927 | 927 | 1 | 1 |
| NN | 13166 | 13166 | 0.9917 | 0.9879 |
| NNP | 9427 | 9410 | 0.9948 | 0.994 |
| NNPS | 246 | 244 | 0.9903 | 0.9533 |
| NNS | 6055 | 6047 | 0.9952 | 0.9972 |
| PDT | 21 | 27 | 1 | 0.6667 |
| POS | 824 | 824 | 1 | 1 |
| PRP | 1716 | 1716 | 1 | 1 |
| PRP$ | 766 | 766 | 1 | 1 |
| RB | 2800 | 2822 | 0.9931 | 0.975 |
| RBR | 130 | 136 | 1 | 0.875 |
| RBS | 33 | 35 | 1 | 0.5 |
| RP | 213 | 216 | 1 | 1 |
| SYM | 1 | 1 | 1 | 1 |
| TO | 2180 | 2179 | 1 | 1 |
| UH | 3 | 3 | 1 | 1 |
| VB | 2562 | 2554 | 0.9914 | 1 |
| VBD | 3035 | 3043 | 0.9902 | 0.9807 |
| VBG | 1458 | 1460 | 0.9965 | 0.9982 |
| VBN | 2145 | 2134 | 0.9885 | 0.9957 |
| VBP | 1318 | 1321 | 0.9931 | 0.9828 |
| VBZ | 2124 | 2125 | 0.9937 | 0.9906 |
| WDT | 440 | 445 | 1 | 0.8333 |
| WP | 241 | 241 | 1 | 1 |
| WP$ | 14 | 14 | 1 | 1 |
| WRB | 178 | 178 | 1 | 1 |
| `` | 712 | 712 | 1 | 1 |
Unknown Words in Treebank
Suprisingly, the treebank corpus contains 6592 words tags with -NONE-. But it's not that bad, since it's only 440 unique words, and they are not regular words at all: *EXP*-2, *T*-91, *-106, and many more similar looking tokens.
Announcing Text Processing APIs
If you liked the NLTK demos, then you'll love the text processing APIs. They provide all the functionality of the demos, plus a little bit more, and return results in JSON. Requests can contain up to 10,000 characters, instead of the 1,000 character limit on the demos, and you can do up to 100 calls per day. These limits may change in the future depending on usage & demand. If you'd like to do more, please fill out this survey to let me know what your needs are.
Announcing Python NLTK Demos
If you want to see what NLTK can do, but don't want to go thru the effort of installation and learning how to use it, then check out my Python NLTK demos.
It currently demonstrates the following functionality:
- part-of-speech tagging with the default NLTK pos tagger
- chunking and named entity recognition with the default NLTK chunker
- sentiment analysis with a combination of a naive bayes classifier and a maximum entropy classifier, both trained on the movie reviews corpus
If you like it, please share it. If you want to see more, leave a comment below. And if you are interested in a service that could apply these processes to your own data, please fill out this NLTK services survey.
Other Natural Language Processing Demos
Here's a list of similar resources on the web:
- A demo of the Stanford Parser with a javascript API: Natural-language Parsing For The Web
- A demo of the FreeLing language analysis suite: FreeLing Demo
- Emotional identification from text: EmoLib
Linguistic and Natural Language Processing Links
A number of links related to natural language processing and linguistics:
- What’s the Difference Between Stemming and Lemmatization? - Ask Dr. Search
- A List of Social Tagging Datasets Made Available for Research
- Social Signaling and Language Use
- Lexical Growth in the Blogosphere
- Spelling correction using the Python Natural Language Toolkit (nltk)
- OPUS - an open source parallel corpus
- Evaluating POS Taggers: The Contenders
- Text Analytics Wiki
Part of Speech Tagging with NLTK Part 4 – Brill Tagger vs Classifier Taggers
In previous installments on part-of-speech tagging, we saw that a Brill Tagger provides significant accuracy improvements over the Ngram Taggers combined with Regex and Affix Tagging.
With the latest 2.0 beta releases (2.0b8 as of this writing), NLTK has included a ClassifierBasedTagger as well as a pre-trained tagger used by the nltk.tag.pos_tag method. Based on the name, then pre-trained tagger appears to be a ClassifierBasedTagger trained on the treebank corpus using a MaxentClassifier. So let's see how a classifier tagger compares to the brill tagger.
NLTK Training Sets
For the brown corpus, I trained on 2/3 of the reviews, lore, and romance categories, and tested against the remaining 1/3. For conll2000, I used the standard train.txt vs test.txt. And for treebank, I again used a 2/3 vs 1/3 split.
import itertools
from nltk.corpus import brown, conll2000, treebank
brown_reviews = brown.tagged_sents(categories=['reviews'])
brown_reviews_cutoff = len(brown_reviews) * 2 / 3
brown_lore = brown.tagged_sents(categories=['lore'])
brown_lore_cutoff = len(brown_lore) * 2 / 3
brown_romance = brown.tagged_sents(categories=['romance'])
brown_romance_cutoff = len(brown_romance) * 2 / 3
brown_train = list(itertools.chain(brown_reviews[:brown_reviews_cutoff],
brown_lore[:brown_lore_cutoff], brown_romance[:brown_romance_cutoff]))
brown_test = list(itertools.chain(brown_reviews[brown_reviews_cutoff:],
brown_lore[brown_lore_cutoff:], brown_romance[brown_romance_cutoff:]))
conll_train = conll2000.tagged_sents('train.txt')
conll_test = conll2000.tagged_sents('test.txt')
treebank_cutoff = len(treebank.tagged_sents()) * 2 / 3
treebank_train = treebank.tagged_sents()[:treebank_cutoff]
treebank_test = treebank.tagged_sents()[treebank_cutoff:]
Naive Bayes Classifier Taggers
There are 3 new taggers referenced below:
cposis an instance of ClassifierBasedPOSTagger using the default NaiveBayesClassifier. It was trained by doingClassifierBasedPOSTagger(train=train_sents)craubtis likecpos, but has theraubttagger from part 2 as a backoff tagger by doingClassifierBasedPOSTagger(train=train_sents,backoff=raubt)bcposis a BrillTagger usingcposas its initial tagger instead ofraubt.
The raubt tagger is the same as from part 2, and braubt is from part 3.
postag is NLTK's pre-trained tagger used by the pos_tag function. It can be loaded using nltk.data.load(nltk.tag._POS_TAGGER).
Accuracy Evaluation
Tagger accuracy was determined by calling the evaluate method with the test set on each trained tagger. Here are the results:
Conclusions
The above results are quite interesting, and lead to a few conclusions:
- Training data is hugely significant when it comes to accuracy. This is why
postagtakes a huge nose dive onbrown, while at the same time can get near 100% accuracy ontreebank. - A ClassifierBasedPOSTagger does not need a backoff tagger, since
cposaccuracy is exactly the same as forcraubtacross all corpora. - The ClassifierBasedPOSTagger is not necessarily more accurate than the
bcraubttagger from part 3 (at least with the default feature detector). It also takes much longer to train and tag (more details below) and so may not be worth the tradeoff in efficiency. - Using brill tagger will nearly always increase the accuracy of your initial tagger, but not by much.
I was also surprised at how much more accurate postag was compared to cpos. Thinking that postag was probably trained on the full treebank corpus, I did the same, and re-evaluated:
cpos = ClassifierBasedPOSTagger(train=treebank.tagged_sents()) cpos.evaluate(treebank_test)
The result was 98.08% accuracy. So the remaining 2% difference must be due to the MaxentClassifier being more accurate than the naive bayes classifier, and/or the use of a different feature detector. I tried again with classifier_builder=MaxentClassifier.train and only got to 98.4% accuracy. So I can only conclude that a different feature detector is used. Hopefully the NLTK leaders will publish the training method so we can all know for sure.
Classification Efficiency
On the nltk-users list, there was a question about which tagger is the most computationaly economic. I can't tell you the right answer, but I can definitely say that ClassifierBasedPOSTagger is the wrong answer. During accuracy evaluation, I noticed that the cpos tagger took a lot longer than raubt or braubt. So I ran timeit on the tag method of each tagger, and got the following results:
| Tagger | secs/pass |
|---|---|
| raubt | 0.00005 |
| braubt | 0.00009 |
| cpos | 0.02219 |
| bcpos | 0.02259 |
| postag | 0.01241 |
This was run with python 2.6.4 on an Athlon 64 Dual Core 4600+ with 3G RAM, but the important thing is the relative times. braubt is over 246 times faster than cpos! To put it another way, braubt can process over 66666 words/sec, where cpos can only do 270 words/sec and postag only 483 words/sec. So the lesson is: do not use a classifier based tagger if speed is an issue.
Here's the code for timing postag. You can do the same thing for any other pickled tagger by replacing nltk.tag._POS_TAGGER with a nltk.data accessible path with a .pickle suffix for the load method.
import nltk, timeit
text = nltk.word_tokenize('And now for something completely different')
setup = 'import nltk.data, nltk.tag; tagger = nltk.data.load(nltk.tag._POS_TAGGER)'
t = timeit.Timer('tagger.tag(%s)' % text, setup)
print 'timing postag 1000 times'
spent = t.timeit(number=1000)
print 'took %.5f secs/pass' % (spent / 1000)
File Size
There's also a significant difference in the file size of the pickled taggers (trained on treebank):
| Tagger | Size |
|---|---|
| raubt | 272K |
| braubt | 273K |
| cpos | 3.8M |
| bcpos | 3.8M |
| postag | 8.2M |
Fin
I think there's a lot of room for experimentation with classifier based taggers and their feature detectors. But if speed is an issue for you, don't even bother. In that case, stick with a simpler tagger that's nearly as accurate and orders of magnitude faster.
Execnet vs Disco for Distributed NLTK
There's a number of options for distributed processing and mapreduce in python. Before execnet surfaced, I'd been using Disco to do distributed NLTK. Now that I've happily switched to distributed NLTK with execnet, I can explain some of the differences and why execnet is so much better for my purposes.
Disco Overhead
Disco is a mapreduce framework for python, with an erlang core. This is very cool, but unfortunately introduces overhead costs when your functions are not pure (meaning they require external code and/or data). And part of speech tagging with NLTK is definitely not pure; the map function requires a part of speech tagger in order to do anything. So to use a part of speech tagger within a Disco map function, it must be loaded inline, which means unpickling the object before doing any work. And since a pickled part of speech tagger can easily exceed 500K, unpickling it can take over 2 seconds. When every map call has a fixed overhead of 2 seconds, your mapreduce task can take orders of magnitude longer to complete.
As an example, let's say you need to do 6000 map calls, at 1 second of pure computation each. That's 100 minutes, not counting overhead. Now add in the 2s fixed overhead on each call, and you're at 300 minutes. What should be just over 1.6 hours of computation has jumped to 5 hours.
Execnet FTW
execnet provides a very different computational model: start some gateways and communicate thru message channels. In my case, all the fixed overhead can be done up-front, loading the part of speech tagger once per gateway, resulting in greatly reduced compute times. I did have to change my old Disco based code to work with execnet, but I actually ended up with less code that's easier to understand.
Conclusion
If you're just doing pure mapreduce computations, then consider using Disco. After the one time setup (which can be non-trivial), writing the functions will be relatively easy, and you'll get a nice web UI for configuration and monitoring. But if you're doing any dirty operations that need expensive initialization procedures, or can't quite fit what you need into a pure mapreduce framework, then execnet is for you.
Distributed NLTK with execnet
(for a Belorussian translation of this article, go here)
Want to speed up your natural language processing with NLTK? Have a lot of files to process, but don't know how to distribute NLTK across many cores?
Well, here's how you can use execnet to do distributed part of speech tagging with NLTK.
execnet
execnet is a simple library for creating a network of gateways and channels that you can use for distributed computation in python. With it, you can start python shells over ssh, send code and/or data, then receive results. Below are 2 scripts that will test the accuracy of NLTK's recommended part of speech tagger against every file in the brown corpus. The first script (the runner) does all the setup and receives the results, while the second script (the remote module) runs on every gateway, calculating and sending the accuracy of each file it receives for processing.
Runner
The runner does the following:
- Defines the hosts and number of gateways. I recommend 1 gateway per core per host.
- Loads and pickles the default NLTK part of speech tagger.
- Opens each gateway and creates a remote execution channel with the
tag_filesmodule (the remote module covered below). - Sends the pickled tagger and the name of a corpus (
brown) thru the channel. - Once all the channels have been created and initialized, it then sends all of the fileids in the corpus to alternating channels to distribute the work.
- Finally, it creates a receive queue and prints the accuracy response from each channel.
run_tag_files.py
import execnet
import nltk.tag, nltk.data
import cPickle as pickle
import tag_files
HOSTS = {
'localhost': 2
}
NICE = 20
channels = []
tagger = pickle.dumps(nltk.data.load(nltk.tag._POS_TAGGER))
for host, count in HOSTS.items():
print 'opening %d gateways at %s' % (count, host)
for i in range(count):
gw = execnet.makegateway('ssh=%s//nice=%d' % (host, NICE))
channel = gw.remote_exec(tag_files)
channels.append(channel)
channel.send(tagger)
channel.send('brown')
count = 0
chan = 0
for fileid in nltk.corpus.brown.fileids():
print 'sending %s to channel %d' % (fileid, chan)
channels[chan].send(fileid)
count += 1
# alternate channels
chan += 1
if chan >= len(channels): chan = 0
multi = execnet.MultiChannel(channels)
queue = multi.make_receive_queue()
for i in range(count):
channel, response = queue.get()
print response
Remote Module
The remote module is much simpler.
- Receives and unpickles the tagger.
- Receives the corpus name and loads it.
- For each fileid received, evaluates the accuracy of the tagger on the tagged sentences and sends an accuracy response.
tag_files.py
import nltk.corpus
import cPickle as pickle
if __name__ == '__channelexec__':
tagger = pickle.loads(channel.receive())
corpus_name = channel.receive()
corpus = getattr(nltk.corpus, corpus_name)
for fileid in channel:
accuracy = tagger.evaluate(corpus.tagged_sents(fileids=[fileid]))
channel.send('%s: %f' % (fileid, accuracy))
Putting it all together
Make sure you have NLTK and the corpus data installed on every host. You must also have passwordless ssh access to each host from the master host (the machine you run run_tag_files.py on).
run_tag_files.py and tag_files.py only need to be on the master host; execnet will take care of distributing the code. Assuming run_tag_files.py and tag_files.py are in the same directory, all you need to do is run python run_tag_files.py. You should get a message about opening gateways followed by a bunch of send messages. Then, just wait and watch the accuracy responses to see how accurate the built in part of speech tagger is on the brown corpus.
If you'd like test the accuracy of a different corpus, make sure every host has the corpus data, then send that corpus name instead of brown, and send the fileids from the new corpus.
If you want to test your own tagger, pickle it to a file, then load and send it instead of NLTK's tagger. Or you can train it on the master first, then send it once training is complete.
Distributed File Processing
In practice, it's often a PITA to make sure every host has every file you want to process, and you'll want to process files outside of NLTK's builtin corpora. My recommendation is to setup a GlusterFS storage cluster so that every host has a common mount point with access to every file that you want to process. If every host has the same mount point, you can send any file path to any channel for processing.
