Tag Archives: nlp

Training Part of Speech Taggers with NLTK Trainer

NLTK trainer makes it easy to train part-of-speech taggers with various algorithms using train_tagger.py.

Training Sequential Backoff Taggers

The fastest algorithms are the sequential backoff taggers. You can specify the backoff sequence using the --sequential argument, which accepts any combination of the following letters:

a: AffixTagger
u: UnigramTagger
b: BigramTagger
t: TrigramTagger

For example, to train the same kinds of taggers that were used in Part of Speech Tagging with NLTK Part 1 – Ngram Taggers, you could do the following:

python train_tagger.py treebank --sequential ubt

You can rearrange ubt any way you want to change the order of the taggers (though ubt is generally the most accurate order).

Training Affix Taggers

The --sequential argument also recognizes the letter a, which will insert an AffixTagger into the backoff chain. If you do not specify the --affix argument, then it will include one AffixTagger with a 3-character suffix. However, you can change this by specifying one or more --affix N options, where N should be a positive number for prefixes, and a negative number for suffixes. For example, to train an aubt tagger with 2 AffixTaggers, one that uses a 3 character suffix, and another that uses a 2 character prefix, specify the --affix argument twice:

python train_tagger.py treebank --sequential aubt --affix -3 --affix 2

The order of the --affix arguments is the order in which each AffixTagger will be trained and inserted into the backoff chain.

Training Brill Taggers

To train a BrillTagger in a similar fashion to the one trained in Part of Speech Tagging Part 3 – Brill Tagger (using FastBrillTaggerTrainer), use the --brill argument:

python train_tagger.py treebank --sequential aubt --brill

The default training options are a maximum of 200 rules with a minimum score of 2, but you can change that with the --max_rules and --min_score arguments. You can also change the rule template bounds, which defaults to 1, using the --template_bounds argument.

Training Classifier Based Taggers

Many of the arguments used by train_classifier.py can also be used to train a ClassifierBasedPOSTagger. If you don’t want this tagger to backoff to a sequential backoff tagger, be sure to specify --sequential ''. Here’s an example for training a NaiveBayesClassifier based tagger, similar to what was shown in Part of Speech Tagging Part 4 – Classifier Taggers:

python train_tagger.py treebank --sequential '' --classifier NaiveBayes

If you do want to backoff to a sequential tagger, be sure to specify a cutoff probability, like so:

python train_tagger.py treebank --sequential ubt --classifier NaiveBayes --cutoff_prob 0.4

Any of the NLTK classification algorithms can be used for the --classifier argument, such as Maxent or MEGAM, and every algorithm other than NaiveBayes has specific training options that can be customized.

Phonetic Feature Options

You can also include phonetic algorithm features using the following arguments:

--metaphone: Use metaphone feature
--double-metaphone: Use double metaphone feature
--soundex: Use soundex feature
--nysiis: Use NYSIIS feature
--caverphone: Use caverphone feature

These options create phonetic codes that will be included as features along with the default features used by the ClassifierBasedPOSTagger. The --double-metaphone algorithm comes from metaphone.py, while all the other phonetic algorithm have been copied from the advas project (which appears to be abandoned).

I created these options after discussions with Michael D Healy about Twitter Linguistics, in which he explained the prevalence of regional spelling variations. These phonetic features may be able to reduce that variation where a tagger is concerned, as slightly different spellings might generate the same phonetic code.

A tagger trained with any of these phonetic features will be an instance of nltk_trainer.tagging.taggers.PhoneticClassifierBasedPOSTagger, which means nltk_trainer must be included in your PYTHONPATH in order to load & use the tagger. The simplest way to do this is to install nltk-trainer using python setup.py install.

Python Text Processing with NLTK Cookbook Chapter 2 Errata

It has come to my attention that there are two errors in Chapter 2, Replacing and Correcting Words of Python Text Processing with NLTK Cookbook. My thanks to the reader who went out of their way to verify my mistakes and send in corrections.

In Lemmatizing words with WordNet, on page 29, under How it works…, I said that “cooking” is not a noun and does not have a lemma. In fact, cooking is a noun, and as such is its own lemma. Of course, “cooking” is also a verb, and the verb form has the lemma “cook”.

In Removing repeating characters, on page 35, under How it works…, I explained the repeat_regexp match groups incorrectly. The actual match grouping of the word “looooove” is (looo)(o)o(ve) because the pattern matching is greedy. The end result is still correct.

NLTK Default Tagger CoNLL2000 Tag Coverage

Following up on the previous post showing the tag coverage of the NLTK 2.0b9 default tagger on the treebank corpus, below are the same metrics applied to the conll2000 corpus, using the analyze_tagger_coverage.py script from nltk-trainer.

NLTK Default Tagger Performance on CoNLL2000

The default tagger is 93.9% accurate on the conll2000 corpus, which is to be expected since both treebank and conll2000 are based on the Wall Street Journal. You can see all the metrics shown below for yourself by running python analyze_tagger_coverage.py conll2000 --metrics. In many cases, the Precision and Recall metrics are significantly lower than 1, even when the Found and Actual counts are similar. This happens when words are given the wrong tag (creating false positives and false negatives) while the overall tag frequency remains about the same. The CC tag is a great example of this: the Found count is only 3 higher than the Actual count, yet Precision is 68.75% and Recall is 73.33%. This tells us that the number of words that were mis-tagged as CC, and the number of CC words that were not given the CC tag, are approximately equal, creating similar counts despite the false positives and false negatives.

Tag Found Actual Precision Recall
# 46 47 1 1
$ 2122 2134 1 0.6
1811 1809 1 1
( 0 351 None 0
) 0 358 None 0
, 13160 13160 1 1
-LRB- 351 0 0 None
-NONE- 59 0 0 None
-RRB- 358 0 0 None
. 10800 10802 1 1
: 1288 1285 0.7143 1
CC 6589 6586 0.6875 0.7333
CD 10325 10233 0.972 0.9919
DT 22301 22355 0.7826 1
EX 229 254 1 1
FW 1 42 1 0.0455
IN 27798 27835 0.7315 0.7899
JJ 15370 16049 0.7372 0.7303
JJR 1114 1055 0.5412 0.575
JJS 611 451 0.6912 0.7966
LS 13 0 0 None
MD 2616 2637 0.7143 0.75
NN 38023 36789 0.7345 0.8441
NNP 24967 24690 0.8752 0.9421
NNPS 589 550 0.4553 0.3684
NNS 17068 16653 0.8572 0.9527
PDT 24 65 0.6667 1
POS 2224 2203 0.6667 1
PRP 4620 4634 0.8438 0.7941
PRP$ 2292 2302 0.6364 1
RB 7681 7961 0.8076 0.8582
RBR 288 392 0.5 0.3684
RBS 90 240 0.5 0.1667
RP 634 95 0.1176 1
SYM 0 6 None 0
TO 6257 6259 1 0.75
UH 2 17 1 0.1111
VB 6681 7286 0.9042 0.8313
VBD 8501 8424 0.7521 0.8605
VBG 3730 4000 0.8493 0.8603
VBN 5763 5867 0.8164 0.8721
VBP 3232 3407 0.6754 0.6638
VBZ 5224 5561 0.7273 0.6906
WDT 1156 1157 0.6 0.5
WP 637 639 1 1
WP$ 38 39 1 1
WRB 566 571 0.9 0.75
1855 1854 0.6667 1

Unknown Words in CoNLL2000

The conll2000 corpus has 0 words tagged with -NONE-, yet the default tagger is unable to identify 50 unique words. Here’s a sample: boiler-room, so-so, Coca-Cola, top-10, AC&R, F-16, I-880, R2-D2, mid-1992. For the most part, the unknown words are symbolic names, acronyms, or two separate words combined with a “-“. You might think this can solved with better tokenization, but for words like F-16 and I-880, tokenizing on the “-” would be incorrect.

Missing Symbols and Rare Tags

The default tagger apparently does not recognize parentheses or the SYM tag, and has trouble with many of the more rare tags, such as FW, LS, RBS, and UH. These failures highlight the need for training a part-of-speech tagger (or any NLP object) on a corpus that is as similar as possible to the corpus you are analyzing. At the very least, your training corpus and testing corpus should share the same set of part-of-speech tags, and in similar proportion. Otherwise, mistakes will be made, such as not recognizing common symbols, or finding -LRB- and -RRB- tags where they do not exist.

NLTK Default Tagger Treebank Tag Coverage

For some research I’m doing with Michael D. Healy, I need to measure part-of-speech tagger coverage and performance. To that end, I’ve added a new script to nltk-trainer: analyze_tagger_coverage.py. This script will tag every sentence of a corpus and count how many times it produces each tag. If you also use the --metrics option, and the corpus reader provides a tagged_sents() method, then you can get detailed performance metrics by comparing the tagger’s results against the actual tags.

NLTK Default Tagger Performance on Treebank

Below is a table showing the performance details of the NLTK 2.0b9 default tagger on the treebank corpus, which you can see for yourself by running python analyze_tagger_coverage.py treebank --metrics. The default tagger is 99.57% accurate on treebank, and below you can see exactly on which tags it fails. The Found column shows the number of occurrences of each tag produced by the default tagger, while the Actual column shows the actual number of occurrences in the treebank corpus. Precision and Recall, which I’ve explained in the context of classification, show the performance for each tag. If the Precision is less than 1, that means the tagger gave the tag to a word that it shouldn’t have (a false positive). If the Recall is less than 1, it means the tagger did not give the tag to a word that it should have (a false negative).

Tag Found Actual Precision Recall
# 16 16 1 1
$ 724 724 1 1
694 694 1 1
, 4887 4886 1 1
-LRB- 120 120 1 1
-NONE- 6591 6592 1 1
-RRB- 126 126 1 1
. 3874 3874 1 1
: 563 563 1 1
CC 2271 2265 1 1
CD 3547 3546 0.999 0.999
DT 8170 8165 1 1
EX 88 88 1 1
FW 4 4 1 1
IN 9880 9857 0.9913 0.958
JJ 5803 5834 0.9913 0.9789
JJR 386 381 1 0.9149
JJS 185 182 0.9667 1
LS 12 13 1 0.8571
MD 927 927 1 1
NN 13166 13166 0.9917 0.9879
NNP 9427 9410 0.9948 0.994
NNPS 246 244 0.9903 0.9533
NNS 6055 6047 0.9952 0.9972
PDT 21 27 1 0.6667
POS 824 824 1 1
PRP 1716 1716 1 1
PRP$ 766 766 1 1
RB 2800 2822 0.9931 0.975
RBR 130 136 1 0.875
RBS 33 35 1 0.5
RP 213 216 1 1
SYM 1 1 1 1
TO 2180 2179 1 1
UH 3 3 1 1
VB 2562 2554 0.9914 1
VBD 3035 3043 0.9902 0.9807
VBG 1458 1460 0.9965 0.9982
VBN 2145 2134 0.9885 0.9957
VBP 1318 1321 0.9931 0.9828
VBZ 2124 2125 0.9937 0.9906
WDT 440 445 1 0.8333
WP 241 241 1 1
WP$ 14 14 1 1
WRB 178 178 1 1
712 712 1 1

Unknown Words in Treebank

Suprisingly, the treebank corpus contains 6592 words tags with -NONE-. But it’s not that bad, since it’s only 440 unique words, and they are not regular words at all: *EXP*-2, *T*-91, *-106, and many more similar looking tokens.

Hierarchical Classification

Hierarchical classification is an obscure but simple concept. The idea is that you arrange two or more classifiers in a hierarchy such that the classifiers lower in the hierarchy are only used if a higher classifier returns an appropriate result.

For example, the text-processing.com sentiment analysis demo uses hierarchical classification by combining a subjectivity classifier and a polarity classifier. The subjectivity classifier is first, and determines whether the text is objective or subjective. If the text is objective, then a label of neutral is returned, and the polarity classifier is not used. However, if the text is subjective (or polar),  then the polarity classifier is used to determine if the text is positive or negative.

Hierarchical Sentiment Classification Model

Hierarchical classification is a useful way to combine multiple binary classifiers, if you have a hierarchy of labels that can modeled as a binary tree. In this model, each branch of the tree either continues on to a new pair of branches, or stops, and at each branching you use a classifier to determine which branch to take.

Python Text Processing with NLTK Book Reviews

If you’ve been considering buying Python Text Processing with NLTK 2.0 Cookbook, but haven’t yet, below are a couple reviews that may help convince you how awesome it is :)

Jaganadh says in his review of Python Text Processing with NLTK Cookbook at Jaggu’s World:

The eight chapter a revolutionary one which deals with Distributed data processing and handling large scale data with NLTK. (…) This chapter will be really helpful for industry people who is looking for to adopt NLTK in to NLP projects.

I give 9 out of 10 for the book. Natural Language Processing students, teachers, professional hurry and bag a copy of this book.

Sum-Wai says in his review of Python Text Processing with NLTK Cookbook at Tips Tank:

I like it where in each recipe, the author provides extra knowledge on the particular problem, like how a problem can be enhance and solve in another way, or what we need to do if the problem on hand changed, and some extra technical tips, which is very nice and useful.

If you’re thinking about the O’Reilly’s NLTK book – Natural Language Processing with Python, IMHO this book and the O’Reilly NLTK book complements each other. The O’Reilly NLTK book focuses more on getting you to know NLP and the features and usage of NLTK , while Python Text Processing with NLTK teaches us how we would implement NLP/NLTK with tools like MongoDB into solving real world problems.

And Neil Kodner, @neilkod, says:

I’m loving python text processing with nltk cookbook by @japerk, its an excellent companion to the O’Reilly NLTK book

Christmas is coming up, and who doesn’t think about python text processing during the holidays?

If you want a reviewer copy to write your own review, contact Packt at reviewrequest@packtpub.com. And if you do write a review and want to let me know about it, leave a comment here, or contact me on twitter.

The Beginning of Python Text Processing with NLTK Cookbook

It all started with an email to the baypiggies mailing list. An acquisition editor for Packt was looking for authors to expand their line of python cookbooks. For some reason I can’t remember, I thought they wanted to put together a multi-author cookbook, where each author contributes a few recipes. That sounded doable, because I’d already written a number of articles that could serve as the basis for a few recipes. So I replied with links to the following articles:

The reply back was:

The next step is to come up with around 8-14 topics/chapters and around 80-100 recipes for the book as a whole.

My first reaction was “WTF?? No way!” But luckily, I didn’t send that email. Instead, I took a couple days to think it over, and realized that maybe I could come up with that many recipes, if I broke my knowledge down into small pieces. I also decided to choose recipes that I didn’t already know how to write, and use them as motivation for learning & research. So I replied back with a list of 92 recipes, and got to work. Not surprisingly, the original list of 92 changed significantly while writing the book, and I believe the final recipe count is 81.

I was keenly aware that there’d be some necessary overlap with the original NLTK book, Natural Language Processing with Python. But I did my best to minimize that overlap, and to present a different take on similar content. And there’s a number of recipes that (as far as I know) you can’t find anywhere else, the largest group of which can be found in Chapter 6, Transforming Chunks and Trees. I’m very pleased with the result, and I hope everyone who buys the book is too. I’d like to think that Python Text Processing with NLTK 2.0 Cookbook is the practical companion to the more teaching oriented Natural Language Processing with Python.

If you’d like a taste of the book, checkout the online sample chapter (pdf) Chapter 3, Custom Corpora, which details how many of the included corpus readers work, how to use them, and how to create your own corpus readers. The last recipe shows you how to create a corpus reader on top of MongoDB, and it should be fairly easy to modify for use with any other database.

Packt has also published two excerpts from Chapter 8, Distributed Processing and Handling Large Datasets, which are partially based on those original 2 articles:

Python Text Processing with NLTK Cookbook

My new book, Python Text Processing with NLTK 2.0 Cookbook, has been published. You can find it at both Packt and Amazon. For those of you that pre-ordered it, thank you, and I hope you receive your copy soon.

The Packt page has a lot more details, including the Table of Contents and a sample chapter (pdf). The sample chapter is Chapter 3, Creating Custom Corpora, which covers the following:

  • creating your own corpora
  • using many of the included corpus readers
  • creating custom corpus readers
  • creating a corpus reader on top of MongoDB

I hope you find Python Text Processing with NLTK Cookbook useful, informative, and maybe even fun.

Training Binary Text Classifiers with NLTK Trainer

NLTK-Trainer (available github and bitbucket) was created to make it as easy as possible to train NLTK text classifiers. The train_classifiers.py script provides a command-line interface for training & evaluating classifiers, with a number of options for customizing text feature extraction and classifier training (run python train_classifier.py --help for a complete list of options). Below, I’ll show you how to use it to (mostly) replicate the results shown in my previous articles on text classification. You should checkout or download nltk-trainer if you want to run the examples yourself.

NLTK Movie Reviews Corpus

To run the code, we need to make sure everything is setup for training. The most important thing is installing the NLTK data (and of course, you’ll need to install NLTK as well). In this case, we need the movie_reviews corpus, which you can download/install by running sudo python -m nltk.downloader movie_reviews. This command will ensure that the movie_reviews corpus is downloaded and/or located in an NLTK data directory, such as /usr/share/nltk_data on Linux, or C:\nltk_data on Windows. The movie_reviews corpus can then be found under the corpora subdirectory.

Training a Naive Bayes Classifier

Now we can use train_classifier.py to replicate the results from the first article on text classification for sentiment analysis with a naive bayes classifier. The complete command is:

python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --show-most-informative 10 --no-pickle movie_reviews

Here’s an explanation of each option:

  • --instances files: this says that each file is treated as an individual instance, so that each feature set will contain word: True for each word in a file
  • --fraction 0.75: we’ll use 75% of the the files in each category for training, and the remaining 25% of the files for testing
  • --show-most-informative 10: show the 10 most informative words
  • --no-pickle: the default is to store a pickled classifier, but this option lets us do evaluation without pickling the classifier

If you cd into the nltk-trainer directory and the run the above command, your output should look like this:

$ python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --show-most-informative 10 --no-pickle movie_reviews
2 labels: ['neg', 'pos']
1500 training feats, 500 testing feats
training a NaiveBayes classifier
accuracy: 0.726000
neg precision: 0.952000
neg recall: 0.476000
neg f-measure: 0.634667
pos precision: 0.650667
pos recall: 0.976000
pos f-measure: 0.780800
10 most informative features
Most Informative Features
          finest = True              pos : neg    =     13.4 : 1.0
      astounding = True              pos : neg    =     11.0 : 1.0
          avoids = True              pos : neg    =     11.0 : 1.0
          inject = True              neg : pos    =     10.3 : 1.0
       strongest = True              pos : neg    =     10.3 : 1.0
       stupidity = True              neg : pos    =     10.2 : 1.0
           damon = True              pos : neg    =      9.8 : 1.0
            slip = True              pos : neg    =      9.7 : 1.0
          temple = True              pos : neg    =      9.7 : 1.0
          regard = True              pos : neg    =      9.7 : 1.0

If you refer to the article on measuring precision and recall of a classifier, you’ll see that the numbers are slightly different. We also ended up with a different top 10 most informative features. This is due to train_classifier.py choosing slightly different training instances than the code in the previous articles. But the results are still basically the same.

Filtering Stopwords

Let’s try it again, but this time we’ll filter out stopwords (the default is no stopword filtering):

$ python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --no-pickle --filter-stopwords english movie_reviews
2 labels: ['neg', 'pos']
1500 training feats, 500 testing feats
training a NaiveBayes classifier
accuracy: 0.724000
neg precision: 0.944444
neg recall: 0.476000
neg f-measure: 0.632979
pos precision: 0.649733
pos recall: 0.972000
pos f-measure: 0.778846

As shown in text classification with stopwords and collocations, filtering stopwords reduces accuracy. A helpful comment by Pierre explained that adverbs and determiners that start with “wh” can be valuable features, and removing them is what causes the dip in accuracy.

High Information Feature Selection

There’s two options that allow you to restrict which words are used by their information gain:

  • --max_feats 10000 will use the 10,000 most informative words, and discard the rest
  • --min_score 3 will use all words whose score is at least 3, and discard any words with a lower score

Here’s the results of using --max_feats 10000:

$ python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --no-pickle --max_feats 10000 movie_reviews
2 labels: ['neg', 'pos']
calculating word scores
10000 words meet min_score and/or max_feats
1500 training feats, 500 testing feats
training a NaiveBayes classifier
accuracy: 0.888000
neg precision: 0.970874
neg recall: 0.800000
neg f-measure: 0.877193
pos precision: 0.829932
pos recall: 0.976000
pos f-measure: 0.897059

The accuracy is a bit lower than shown in the article on eliminating low information features, most likely due to the slightly different training & testing instances. Using --min_score 3 instead increases accuracy a little bit:

$ python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --no-pickle --min_score 3 movie_reviews
2 labels: ['neg', 'pos']
calculating word scores
8298 words meet min_score and/or max_feats
1500 training feats, 500 testing feats
training a NaiveBayes classifier
accuracy: 0.894000
neg precision: 0.966825
neg recall: 0.816000
neg f-measure: 0.885033
pos precision: 0.840830
pos recall: 0.972000
pos f-measure: 0.901670

Bigram Features

To include bigram features (pairs of words that occur in a sentence), use the --bigrams option. This is different than finding significant collocations, as all bigrams are considered using the nltk.util.bigrams function. Combining --bigrams with --min_score 3 gives us the highest accuracy yet, 97%!:

  $ python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --no-pickle --min_score 3 --bigrams --show-most-informative 10 movie_reviews
  2 labels: ['neg', 'pos']
  calculating word scores
  28075 words meet min_score and/or max_feats
  1500 training feats, 500 testing feats
  training a NaiveBayes classifier
  accuracy: 0.970000
  neg precision: 0.979592
  neg recall: 0.960000
  neg f-measure: 0.969697
  pos precision: 0.960784
  pos recall: 0.980000
  pos f-measure: 0.970297
  10 most informative features
  Most Informative Features
                finest = True              pos : neg    =     13.4 : 1.0
     ('matt', 'damon') = True              pos : neg    =     13.0 : 1.0
  ('a', 'wonderfully') = True              pos : neg    =     12.3 : 1.0
('everything', 'from') = True              pos : neg    =     12.3 : 1.0
      ('witty', 'and') = True              pos : neg    =     11.0 : 1.0
            astounding = True              pos : neg    =     11.0 : 1.0
                avoids = True              pos : neg    =     11.0 : 1.0
     ('most', 'films') = True              pos : neg    =     11.0 : 1.0
                inject = True              neg : pos    =     10.3 : 1.0
         ('show', 's') = True              pos : neg    =     10.3 : 1.0

Of course, the “Bourne bias” is still present with the ('matt', 'damon') bigram, but you can’t argue with the numbers. Every metric is at 96% or greater, clearly showing that high information feature selection with bigrams is hugely beneficial for text classification, at least when using the the NaiveBayes algorithm. This also goes against what I said at the end of the article on high information feature selection:

bigrams don’t matter much when using only high information words

In fact, bigrams can make a huge difference, but you can’t restrict them to just 200 significant collocations. Instead, you must include all of them, and let the scoring function decide what’s significant and what isn’t.

Announcing Text Processing APIs

If you liked the NLTK demos, then you’ll love the text processing APIs. They provide all the functionality of the demos, plus a little bit more, and return results in JSON. Requests can contain up to 10,000 characters, instead of the 1,000 character limit on the demos, and you can do up to 100 calls per day. These limits may change in the future depending on usage & demand. If you’d like to do more, please fill out this survey to let me know what your needs are.