Tag Archives: nltk

PyCon NLTK Tutorial Suggestions

PyCon 2012 just released a CFP, and NLTK shows up 3 times in the suggested topics. While I’ve never done this before, I know stuff about Text Processing with NLTK so I’m going to submit a tutorial abstract. But I want your feedback: what exactly should this tutorial cover? If you could attend a 3 hour class on NLTK, what knowledge & skills would you like to come away with? Here are a few specific topics I could cover:

  • part-of-speech tagging & chunking
  • text classification
  • creating a custom corpus and corpus reader
  • training custom models (manually and/or with nltk-trainer)
  • bootstrapping a custom corpus for text classification

Or I could do a high-level survey of many NLTK modules and corpora. Please let me know what you think in the comments, if you plan on going to PyCon 2012, and if you’d want to attend a tutorial on NLTK. You can also contact me directly if you prefer.

Co-Hosting

If you’ve done this kind of thing before, have some teaching and/or speaking experience, and you feel you could add value (maybe you’re a computational linguist or NLP’er and/or have used NLTK professionally), I’d be happy to work with a co-host. Contact me if you’re interested, or leave a note in the comments.

Programming Collective Intelligence Review

Programming Collective Intelligence

Programming Collective Intelligence is a great conceptual introduction to many common machine learning algorithms and techniques. It covers classification algorithms such as Naive Bayes and Neural Networks, and algorithmic optimization approaches like Genetic Programming. The book also manages to pick interesting example applications, such as stock price prediction and topic identification.

There are two chapters in particular that stand out to me. First is Chapter 6, which covers Naive Bayes classification. What stood out was that the algorithm presented is an online learner, which means it can be updated as data comes in, unlike the NLTK NaiveBayesClassifier, which can be trained only once. Another thing that caught my attention was Fisher’s method, which is not implemented in NLTK, but could be with a little work. Apparently Fisher’s method is great for spam filtering, and is used by the SpamBayes Outlook plugin (which is also written in Python).

Second, I found Chapter 9, which covers Support Vector Machines and Kernel Methods, to be quite intuitive. It explains the idea by starting with examples of linear classification and its shortfalls. But then the examples show that by scaling the data in a particular way first, linear classification suddenly becomes possible. And the kernel trick is simply a neat and efficient way to reduce the amount of calculation necessary to train a classifier on scaled data.

The final chapter summarizes all the key algorithms, and for many it includes commentary on their strengths and weaknesses. This seems like valuable reference material, especially for when you have a new data set to learn from, and you’re not sure which algorithms will help get the results you’re looking for. Overall, I found Programming Collective Intelligence to be an enjoyable read on my Kindle 3, and highly recommend it to anyone getting started with machine learning and Python, as well as anyone interested in a general survey of machine learning algorithms.

Bay Area NLP Meetup

This Thursday, June 7 2011, will be the first meeting of the Bay Area NLP group, at Chomp HQ in San Francisco, where I will be giving a talk on NLTK titled “NLTK: the Good, the Bad, and the Awesome”. I’ll be sharing some of the things I’ve learned using NLTK, operating text-processing.com, and doing random consulting on natural language processing. I’ll also explain why NLTK-Trainer exists and how awesome it is for training NLP models. So if you’re in the area and have some time Thursday evening, come by and say hi.

Update on 07/10/2011: slides are online from my talk: NLTK: the Good, the Bad, and the Awesome.

Interview and Article about NLTK and Text-Processing

I recently did an interview with Zoltan Varju (@zoltanvarju) about Python, NLTK, and my demos & APIs at text-processing.com, which you can read here. There’s even a bit about Erlang & functional programming, as well as some insight into what I’ve been working on at Weotta. And last week, the text-processing.com API got a write up (and a nice traffic boost) from Garrett Wilkin (@garrettwilkin) on programmableweb.com.

Analyzing Tagged Corpora and NLTK Part of Speech Taggers

NLTK Trainer includes 2 scripts for analyzing both a tagged corpus and the coverage of a part-of-speech tagger.

Analyze a Tagged Corpus

You can get part-of-speech tag statistics on a tagged corpus using analyze_tagged_corpus.py. Here’s the tag counts for the treebank corpus:

$ python analyze_tagged_corpus.py treebank
loading nltk.corpus.treebank
100676 total words
12408 unique words
46 tags

  Tag      Count
=======  =========
#               16
$              724
''             694
,             4886
-LRB-          120
-NONE-        6592
-RRB-          126
.             3874
:              563
CC            2265
CD            3546
DT            8165
EX              88
FW               4
IN            9857
JJ            5834
JJR            381
JJS            182
LS              13
MD             927
NN           13166
NNP           9410
NNPS           244
NNS           6047
PDT             27
POS            824
PRP           1716
PRP$           766
RB            2822
RBR            136
RBS             35
RP             216
SYM              1
TO            2179
UH               3
VB            2554
VBD           3043
VBG           1460
VBN           2134
VBP           1321
VBZ           2125
WDT            445
WP             241
WP$             14
WRB            178
``             712
=======  =========

By default, analyze_tagged_corpus.py sorts by tags, but you can sort by the highest count using --sort count --reverse. You can also see counts for simplified tags using --simplify_tags:

$ python analyze_tagged_corpus.py treebank --simplify_tags
loading nltk.corpus.treebank
100676 total words
12408 unique words
31 tags

  Tag      Count
=======  =========
              7416
#               16
$              724
''             694
(              120
)              126
,             4886
.             3874
:              563
ADJ           6397
ADV           2993
CNJ           2265
DET           8192
EX              88
FW               4
L               13
MOD            927
N            19213
NP            9654
NUM           3546
P             9857
PRO           2698
S                1
TO            2179
UH               3
V             6000
VD            3043
VG            1460
VN            2134
WH             878
``             712
=======  =========

Analyze Tagger Coverage

You can analyze the coverage of a part-of-speech tagger against any corpus using analyze_tagger_coverage.py. Here’s the results for the treebank corpus using NLTK’s default part-of-speech tagger:

$ python analyze_tagger_coverage.py treebank
loading tagger taggers/maxent_treebank_pos_tagger/english.pickle
analyzing tag coverage of treebank with ClassifierBasedPOSTagger

  Tag      Found
=======  =========
#               16
$              724
''             694
,             4887
-LRB-          120
-NONE-        6591
-RRB-          126
.             3874
:              563
CC            2271
CD            3547
DT            8170
EX              88
FW               4
IN            9880
JJ            5803
JJR            386
JJS            185
LS              12
MD             927
NN           13166
NNP           9427
NNPS           246
NNS           6055
PDT             21
POS            824
PRP           1716
PRP$           766
RB            2800
RBR            130
RBS             33
RP             213
SYM              1
TO            2180
UH               3
VB            2562
VBD           3035
VBG           1458
VBN           2145
VBP           1318
VBZ           2124
WDT            440
WP             241
WP$             14
WRB            178
``             712
=======  =========

If you want to analyze the coverage of your own pickled tagger, use --tagger PATH/TO/TAGGER.pickle. You can also get detailed metrics on Found vs Actual counts, as well as Precision and Recall for each tag by using the --metrics argument with a corpus that provides a tagged_sents method, like treebank:

$ python analyze_tagger_coverage.py treebank --metrics
loading tagger taggers/maxent_treebank_pos_tagger/english.pickle
analyzing tag coverage of treebank with ClassifierBasedPOSTagger

Accuracy: 0.995689
Unknown words: 440

  Tag      Found      Actual      Precision      Recall
=======  =========  ==========  =============  ==========
#               16          16  1.0            1.0
$              724         724  1.0            1.0
''             694         694  1.0            1.0
,             4887        4886  1.0            1.0
-LRB-          120         120  1.0            1.0
-NONE-        6591        6592  1.0            1.0
-RRB-          126         126  1.0            1.0
.             3874        3874  1.0            1.0
:              563         563  1.0            1.0
CC            2271        2265  1.0            1.0
CD            3547        3546  0.99895833333  0.99895833333
DT            8170        8165  1.0            1.0
EX              88          88  1.0            1.0
FW               4           4  1.0            1.0
IN            9880        9857  0.99130434782  0.95798319327
JJ            5803        5834  0.99134948096  0.97892938496
JJR            386         381  1.0            0.91489361702
JJS            185         182  0.96666666666  1.0
LS              12          13  1.0            0.85714285714
MD             927         927  1.0            1.0
NN           13166       13166  0.99166034874  0.98791540785
NNP           9427        9410  0.99477911646  0.99398073836
NNPS           246         244  0.99029126213  0.95327102803
NNS           6055        6047  0.99515235457  0.99722414989
PDT             21          27  1.0            0.66666666666
POS            824         824  1.0            1.0
PRP           1716        1716  1.0            1.0
PRP$           766         766  1.0            1.0
RB            2800        2822  0.99305555555  0.975
RBR            130         136  1.0            0.875
RBS             33          35  1.0            0.5
RP             213         216  1.0            1.0
SYM              1           1  1.0            1.0
TO            2180        2179  1.0            1.0
UH               3           3  1.0            1.0
VB            2562        2554  0.99142857142  1.0
VBD           3035        3043  0.990234375    0.98065764023
VBG           1458        1460  0.99650349650  0.99824868651
VBN           2145        2134  0.98852223816  0.99566473988
VBP           1318        1321  0.99305555555  0.98281786941
VBZ           2124        2125  0.99373040752  0.990625
WDT            440         445  1.0            0.83333333333
WP             241         241  1.0            1.0
WP$             14          14  1.0            1.0
WRB            178         178  1.0            1.0
``             712         712  1.0            1.0
=======  =========  ==========  =============  ==========

These additional metrics can be quite useful for identifying which tags a tagger has trouble with. Precision answers the question “for each word that was given this tag, was it correct?”, while Recall answers the question “for all words that should have gotten this tag, did they get it?”. If you look at PDT, you can see that Precision is 100%, but Recall is 66%, meaning that every word that was given the PDT tag was correct, but 6 out of the 27 words that should have gotten PDT were mistakenly given a different tag. Or if you look at JJS, you can see that Precision is 96.6% because it gave JJS to 3 words that should have gotten a different tag, while Recall is 100% because all words that should have gotten JJS got it.

Training Part of Speech Taggers with NLTK Trainer

NLTK trainer makes it easy to train part-of-speech taggers with various algorithms using train_tagger.py.

Training Sequential Backoff Taggers

The fastest algorithms are the sequential backoff taggers. You can specify the backoff sequence using the --sequential argument, which accepts any combination of the following letters:

a: AffixTagger
u: UnigramTagger
b: BigramTagger
t: TrigramTagger

For example, to train the same kinds of taggers that were used in Part of Speech Tagging with NLTK Part 1 – Ngram Taggers, you could do the following:

python train_tagger.py treebank --sequential ubt

You can rearrange ubt any way you want to change the order of the taggers (though ubt is generally the most accurate order).

Training Affix Taggers

The --sequential argument also recognizes the letter a, which will insert an AffixTagger into the backoff chain. If you do not specify the --affix argument, then it will include one AffixTagger with a 3-character suffix. However, you can change this by specifying one or more --affix N options, where N should be a positive number for prefixes, and a negative number for suffixes. For example, to train an aubt tagger with 2 AffixTaggers, one that uses a 3 character suffix, and another that uses a 2 character prefix, specify the --affix argument twice:

python train_tagger.py treebank --sequential aubt --affix -3 --affix 2

The order of the --affix arguments is the order in which each AffixTagger will be trained and inserted into the backoff chain.

Training Brill Taggers

To train a BrillTagger in a similar fashion to the one trained in Part of Speech Tagging Part 3 – Brill Tagger (using FastBrillTaggerTrainer), use the --brill argument:

python train_tagger.py treebank --sequential aubt --brill

The default training options are a maximum of 200 rules with a minimum score of 2, but you can change that with the --max_rules and --min_score arguments. You can also change the rule template bounds, which defaults to 1, using the --template_bounds argument.

Training Classifier Based Taggers

Many of the arguments used by train_classifier.py can also be used to train a ClassifierBasedPOSTagger. If you don’t want this tagger to backoff to a sequential backoff tagger, be sure to specify --sequential ''. Here’s an example for training a NaiveBayesClassifier based tagger, similar to what was shown in Part of Speech Tagging Part 4 – Classifier Taggers:

python train_tagger.py treebank --sequential '' --classifier NaiveBayes

If you do want to backoff to a sequential tagger, be sure to specify a cutoff probability, like so:

python train_tagger.py treebank --sequential ubt --classifier NaiveBayes --cutoff_prob 0.4

Any of the NLTK classification algorithms can be used for the --classifier argument, such as Maxent or MEGAM, and every algorithm other than NaiveBayes has specific training options that can be customized.

Phonetic Feature Options

You can also include phonetic algorithm features using the following arguments:

--metaphone: Use metaphone feature
--double-metaphone: Use double metaphone feature
--soundex: Use soundex feature
--nysiis: Use NYSIIS feature
--caverphone: Use caverphone feature

These options create phonetic codes that will be included as features along with the default features used by the ClassifierBasedPOSTagger. The --double-metaphone algorithm comes from metaphone.py, while all the other phonetic algorithm have been copied from the advas project (which appears to be abandoned).

I created these options after discussions with Michael D Healy about Twitter Linguistics, in which he explained the prevalence of regional spelling variations. These phonetic features may be able to reduce that variation where a tagger is concerned, as slightly different spellings might generate the same phonetic code.

A tagger trained with any of these phonetic features will be an instance of nltk_trainer.tagging.taggers.PhoneticClassifierBasedPOSTagger, which means nltk_trainer must be included in your PYTHONPATH in order to load & use the tagger. The simplest way to do this is to install nltk-trainer using python setup.py install.

Spelling Replacers in Microsoft Speller Challenge

Microsoft/Bing recently introduced its Speller Challenge, and I immediately thought about using my spelling replacer code from Chapter 2, Replacing and Correcting Words, in Python Text Processing with NLTK Cookbook. The API is now online, and can be accessed by doing a GET request to http://text-processing.com/api/spellcorrect/?runID=replacers&q=WORD. With an Expected F1 of ~0.5, I’m currently at number 12 on the Leaderboard, though I don’t expect that position to last long (I was at 10 when I first wrote this). I’m actually quite suprised the score is as high as it is considering the simplicity / lack of sophistication – it means there’s merit in replacing repeating character and/or that Enchant generally gives decent spelling suggestions when controlled by edit distance. Here’s an outline of the code, which should make sense if you’re familiar with the replacers module from Replacing and Correcting Words in Python Text Processing with NLTK Cookbook:

repeat_replacer = RepeatReplacer()
spelling_replacer = SpellingReplacer()

def replacer_suggest(word):
    suggest = repeat_replacer.replace(word)

    if suggest == word:
        suggest = spelling_replacer.replace(word)

    return [(suggest, 1.0)]

Python Text Processing with NLTK Cookbook Chapter 2 Errata

It has come to my attention that there are two errors in Chapter 2, Replacing and Correcting Words of Python Text Processing with NLTK Cookbook. My thanks to the reader who went out of their way to verify my mistakes and send in corrections.

In Lemmatizing words with WordNet, on page 29, under How it works…, I said that “cooking” is not a noun and does not have a lemma. In fact, cooking is a noun, and as such is its own lemma. Of course, “cooking” is also a verb, and the verb form has the lemma “cook”.

In Removing repeating characters, on page 35, under How it works…, I explained the repeat_regexp match groups incorrectly. The actual match grouping of the word “looooove” is (looo)(o)o(ve) because the pattern matching is greedy. The end result is still correct.

NLTK Default Tagger CoNLL2000 Tag Coverage

Following up on the previous post showing the tag coverage of the NLTK 2.0b9 default tagger on the treebank corpus, below are the same metrics applied to the conll2000 corpus, using the analyze_tagger_coverage.py script from nltk-trainer.

NLTK Default Tagger Performance on CoNLL2000

The default tagger is 93.9% accurate on the conll2000 corpus, which is to be expected since both treebank and conll2000 are based on the Wall Street Journal. You can see all the metrics shown below for yourself by running python analyze_tagger_coverage.py conll2000 --metrics. In many cases, the Precision and Recall metrics are significantly lower than 1, even when the Found and Actual counts are similar. This happens when words are given the wrong tag (creating false positives and false negatives) while the overall tag frequency remains about the same. The CC tag is a great example of this: the Found count is only 3 higher than the Actual count, yet Precision is 68.75% and Recall is 73.33%. This tells us that the number of words that were mis-tagged as CC, and the number of CC words that were not given the CC tag, are approximately equal, creating similar counts despite the false positives and false negatives.

Tag Found Actual Precision Recall
# 46 47 1 1
$ 2122 2134 1 0.6
1811 1809 1 1
( 0 351 None 0
) 0 358 None 0
, 13160 13160 1 1
-LRB- 351 0 0 None
-NONE- 59 0 0 None
-RRB- 358 0 0 None
. 10800 10802 1 1
: 1288 1285 0.7143 1
CC 6589 6586 0.6875 0.7333
CD 10325 10233 0.972 0.9919
DT 22301 22355 0.7826 1
EX 229 254 1 1
FW 1 42 1 0.0455
IN 27798 27835 0.7315 0.7899
JJ 15370 16049 0.7372 0.7303
JJR 1114 1055 0.5412 0.575
JJS 611 451 0.6912 0.7966
LS 13 0 0 None
MD 2616 2637 0.7143 0.75
NN 38023 36789 0.7345 0.8441
NNP 24967 24690 0.8752 0.9421
NNPS 589 550 0.4553 0.3684
NNS 17068 16653 0.8572 0.9527
PDT 24 65 0.6667 1
POS 2224 2203 0.6667 1
PRP 4620 4634 0.8438 0.7941
PRP$ 2292 2302 0.6364 1
RB 7681 7961 0.8076 0.8582
RBR 288 392 0.5 0.3684
RBS 90 240 0.5 0.1667
RP 634 95 0.1176 1
SYM 0 6 None 0
TO 6257 6259 1 0.75
UH 2 17 1 0.1111
VB 6681 7286 0.9042 0.8313
VBD 8501 8424 0.7521 0.8605
VBG 3730 4000 0.8493 0.8603
VBN 5763 5867 0.8164 0.8721
VBP 3232 3407 0.6754 0.6638
VBZ 5224 5561 0.7273 0.6906
WDT 1156 1157 0.6 0.5
WP 637 639 1 1
WP$ 38 39 1 1
WRB 566 571 0.9 0.75
1855 1854 0.6667 1

Unknown Words in CoNLL2000

The conll2000 corpus has 0 words tagged with -NONE-, yet the default tagger is unable to identify 50 unique words. Here’s a sample: boiler-room, so-so, Coca-Cola, top-10, AC&R, F-16, I-880, R2-D2, mid-1992. For the most part, the unknown words are symbolic names, acronyms, or two separate words combined with a “-“. You might think this can solved with better tokenization, but for words like F-16 and I-880, tokenizing on the “-” would be incorrect.

Missing Symbols and Rare Tags

The default tagger apparently does not recognize parentheses or the SYM tag, and has trouble with many of the more rare tags, such as FW, LS, RBS, and UH. These failures highlight the need for training a part-of-speech tagger (or any NLP object) on a corpus that is as similar as possible to the corpus you are analyzing. At the very least, your training corpus and testing corpus should share the same set of part-of-speech tags, and in similar proportion. Otherwise, mistakes will be made, such as not recognizing common symbols, or finding -LRB- and -RRB- tags where they do not exist.

NLTK Default Tagger Treebank Tag Coverage

For some research I’m doing with Michael D. Healy, I need to measure part-of-speech tagger coverage and performance. To that end, I’ve added a new script to nltk-trainer: analyze_tagger_coverage.py. This script will tag every sentence of a corpus and count how many times it produces each tag. If you also use the --metrics option, and the corpus reader provides a tagged_sents() method, then you can get detailed performance metrics by comparing the tagger’s results against the actual tags.

NLTK Default Tagger Performance on Treebank

Below is a table showing the performance details of the NLTK 2.0b9 default tagger on the treebank corpus, which you can see for yourself by running python analyze_tagger_coverage.py treebank --metrics. The default tagger is 99.57% accurate on treebank, and below you can see exactly on which tags it fails. The Found column shows the number of occurrences of each tag produced by the default tagger, while the Actual column shows the actual number of occurrences in the treebank corpus. Precision and Recall, which I’ve explained in the context of classification, show the performance for each tag. If the Precision is less than 1, that means the tagger gave the tag to a word that it shouldn’t have (a false positive). If the Recall is less than 1, it means the tagger did not give the tag to a word that it should have (a false negative).

Tag Found Actual Precision Recall
# 16 16 1 1
$ 724 724 1 1
694 694 1 1
, 4887 4886 1 1
-LRB- 120 120 1 1
-NONE- 6591 6592 1 1
-RRB- 126 126 1 1
. 3874 3874 1 1
: 563 563 1 1
CC 2271 2265 1 1
CD 3547 3546 0.999 0.999
DT 8170 8165 1 1
EX 88 88 1 1
FW 4 4 1 1
IN 9880 9857 0.9913 0.958
JJ 5803 5834 0.9913 0.9789
JJR 386 381 1 0.9149
JJS 185 182 0.9667 1
LS 12 13 1 0.8571
MD 927 927 1 1
NN 13166 13166 0.9917 0.9879
NNP 9427 9410 0.9948 0.994
NNPS 246 244 0.9903 0.9533
NNS 6055 6047 0.9952 0.9972
PDT 21 27 1 0.6667
POS 824 824 1 1
PRP 1716 1716 1 1
PRP$ 766 766 1 1
RB 2800 2822 0.9931 0.975
RBR 130 136 1 0.875
RBS 33 35 1 0.5
RP 213 216 1 1
SYM 1 1 1 1
TO 2180 2179 1 1
UH 3 3 1 1
VB 2562 2554 0.9914 1
VBD 3035 3043 0.9902 0.9807
VBG 1458 1460 0.9965 0.9982
VBN 2145 2134 0.9885 0.9957
VBP 1318 1321 0.9931 0.9828
VBZ 2124 2125 0.9937 0.9906
WDT 440 445 1 0.8333
WP 241 241 1 1
WP$ 14 14 1 1
WRB 178 178 1 1
712 712 1 1

Unknown Words in Treebank

Suprisingly, the treebank corpus contains 6592 words tags with -NONE-. But it’s not that bad, since it’s only 440 unique words, and they are not regular words at all: *EXP*-2, *T*-91, *-106, and many more similar looking tokens.