The Beginning of Python Text Processing with NLTK Cookbook
It all started with an email to the baypiggies mailing list. An acquisition editor for Packt was looking for authors to expand their line of python cookbooks. For some reason I can't remember, I thought they wanted to put together a multi-author cookbook, where each author contributes a few recipes. That sounded doable, because I'd already written a number of articles that could serve as the basis for a few recipes. So I replied with links to the following articles:
The reply back was:
The next step is to come up with around 8-14 topics/chapters and around 80-100 recipes for the book as a whole.
My first reaction was "WTF?? No way!" But luckily, I didn't send that email. Instead, I took a couple days to think it over, and realized that maybe I could come up with that many recipes, if I broke my knowledge down into small pieces. I also decided to choose recipes that I didn't already know how to write, and use them as motivation for learning & research. So I replied back with a list of 92 recipes, and got to work. Not surprisingly, the original list of 92 changed significantly while writing the book, and I believe the final recipe count is 81.
I was keenly aware that there'd be some necessary overlap with the original NLTK book, Natural Language Processing with Python. But I did my best to minimize that overlap, and to present a different take on similar content. And there's a number of recipes that (as far as I know) you can't find anywhere else, the largest group of which can be found in Chapter 6, Transforming Chunks and Trees. I'm very pleased with the result, and I hope everyone who buys the book is too. I'd like to think that Python Text Processing with NLTK 2.0 Cookbook is the practical companion to the more teaching oriented Natural Language Processing with Python.
If you'd like a taste of the book, checkout the online sample chapter (pdf) Chapter 3, Custom Corpora, which details how many of the included corpus readers work, how to use them, and how to create your own corpus readers. The last recipe shows you how to create a corpus reader on top of MongoDB, and it should be fairly easy to modify for use with any other database.
Packt has also published two excerpts from Chapter 8, Distributed Processing and Handling Large Datasets, which are partially based on those original 2 articles:
- Python Text Processing with NLTK: Storing Frequency Distributions in Redis
- Using Execnet for Parallel and Distributed Processing with NLTK
Python Text Processing with NLTK Cookbook
My new book, Python Text Processing with NLTK 2.0 Cookbook, has been published. You can find it at both Packt and Amazon. For those of you that pre-ordered it, thank you, and I hope you receive your copy soon.
The Packt page has a lot more details, including the Table of Contents and a sample chapter (pdf). The sample chapter is Chapter 3, Creating Custom Corpora, which covers the following:
- creating your own corpora
- using many of the included corpus readers
- creating custom corpus readers
- creating a corpus reader on top of MongoDB
I hope you find Python Text Processing with NLTK Cookbook useful, informative, and maybe even fun.
Training Binary Text Classifiers with NLTK Trainer
NLTK-Trainer (available github and bitbucket) was created to make it as easy as possible to train NLTK text classifiers. The train_classifiers.py script provides a command-line interface for training & evaluating classifiers, with a number of options for customizing text feature extraction and classifier training (run python train_classifier.py --help for a complete list of options). Below, I'll show you how to use it to (mostly) replicate the results shown in my previous articles on text classification. You should checkout or download nltk-trainer if you want to run the examples yourself.
NLTK Movie Reviews Corpus
To run the code, we need to make sure everything is setup for training. The most important thing is installing the NLTK data (and of course, you'll need to install NLTK as well). In this case, we need the movie_reviews corpus, which you can download/install by running sudo python -m nltk.downloader movie_reviews. This command will ensure that the movie_reviews corpus is downloaded and/or located in an NLTK data directory, such as /usr/share/nltk_data on Linux, or C:\nltk_data on Windows. The movie_reviews corpus can then be found under the corpora subdirectory.
Training a Naive Bayes Classifier
Now we can use train_classifier.py to replicate the results from the first article on text classification for sentiment analysis with a naive bayes classifier. The complete command is:
python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --show-most-informative 10 --no-pickle movie_reviews
Here's an explanation of each option:
--instances files: this says that each file is treated as an individual instance, so that each feature set will contain word: True for each word in a file--fraction 0.75: we'll use 75% of the the files in each category for training, and the remaining 25% of the files for testing--show-most-informative 10: show the 10 most informative words--no-pickle: the default is to store a pickled classifier, but this option lets us do evaluation without pickling the classifier
If you cd into the nltk-trainer directory and the run the above command, your output should look like this:
$ python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --show-most-informative 10 --no-pickle movie_reviews
2 labels: ['neg', 'pos']
1500 training feats, 500 testing feats
training a NaiveBayes classifier
accuracy: 0.726000
neg precision: 0.952000
neg recall: 0.476000
neg f-measure: 0.634667
pos precision: 0.650667
pos recall: 0.976000
pos f-measure: 0.780800
10 most informative features
Most Informative Features
finest = True pos : neg = 13.4 : 1.0
astounding = True pos : neg = 11.0 : 1.0
avoids = True pos : neg = 11.0 : 1.0
inject = True neg : pos = 10.3 : 1.0
strongest = True pos : neg = 10.3 : 1.0
stupidity = True neg : pos = 10.2 : 1.0
damon = True pos : neg = 9.8 : 1.0
slip = True pos : neg = 9.7 : 1.0
temple = True pos : neg = 9.7 : 1.0
regard = True pos : neg = 9.7 : 1.0
If you refer to the article on measuring precision and recall of a classifier, you'll see that the numbers are slightly different. We also ended up with a different top 10 most informative features. This is due to train_classifier.py choosing slightly different training instances than the code in the previous articles. But the results are still basically the same.
Filtering Stopwords
Let's try it again, but this time we'll filter out stopwords (the default is no stopword filtering):
$ python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --no-pickle --filter-stopwords english movie_reviews 2 labels: ['neg', 'pos'] 1500 training feats, 500 testing feats training a NaiveBayes classifier accuracy: 0.724000 neg precision: 0.944444 neg recall: 0.476000 neg f-measure: 0.632979 pos precision: 0.649733 pos recall: 0.972000 pos f-measure: 0.778846
As shown in text classification with stopwords and collocations, filtering stopwords reduces accuracy. A helpful comment by Pierre explained that adverbs and determiners that start with "wh" can be valuable features, and removing them is what causes the dip in accuracy.
High Information Feature Selection
There's two options that allow you to restrict which words are used by their information gain:
--max_feats 10000will use the 10,000 most informative words, and discard the rest--min_score 3will use all words whose score is at least 3, and discard any words with a lower score
Here's the results of using --max_feats 10000:
$ python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --no-pickle --max_feats 10000 movie_reviews 2 labels: ['neg', 'pos'] calculating word scores 10000 words meet min_score and/or max_feats 1500 training feats, 500 testing feats training a NaiveBayes classifier accuracy: 0.888000 neg precision: 0.970874 neg recall: 0.800000 neg f-measure: 0.877193 pos precision: 0.829932 pos recall: 0.976000 pos f-measure: 0.897059
The accuracy is a bit lower than shown in the article on eliminating low information features, most likely due to the slightly different training & testing instances. Using --min_score 3 instead increases accuracy a little bit:
$ python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --no-pickle --min_score 3 movie_reviews 2 labels: ['neg', 'pos'] calculating word scores 8298 words meet min_score and/or max_feats 1500 training feats, 500 testing feats training a NaiveBayes classifier accuracy: 0.894000 neg precision: 0.966825 neg recall: 0.816000 neg f-measure: 0.885033 pos precision: 0.840830 pos recall: 0.972000 pos f-measure: 0.901670
Bigram Features
To include bigram features (pairs of words that occur in a sentence), use the --bigrams option. This is different than finding significant collocations, as all bigrams are considered using the nltk.util.bigrams function. Combining --bigrams with --min_score 3 gives us the highest accuracy yet, 97%!:
$ python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --no-pickle --min_score 3 --bigrams --show-most-informative 10 movie_reviews
2 labels: ['neg', 'pos']
calculating word scores
28075 words meet min_score and/or max_feats
1500 training feats, 500 testing feats
training a NaiveBayes classifier
accuracy: 0.970000
neg precision: 0.979592
neg recall: 0.960000
neg f-measure: 0.969697
pos precision: 0.960784
pos recall: 0.980000
pos f-measure: 0.970297
10 most informative features
Most Informative Features
finest = True pos : neg = 13.4 : 1.0
('matt', 'damon') = True pos : neg = 13.0 : 1.0
('a', 'wonderfully') = True pos : neg = 12.3 : 1.0
('everything', 'from') = True pos : neg = 12.3 : 1.0
('witty', 'and') = True pos : neg = 11.0 : 1.0
astounding = True pos : neg = 11.0 : 1.0
avoids = True pos : neg = 11.0 : 1.0
('most', 'films') = True pos : neg = 11.0 : 1.0
inject = True neg : pos = 10.3 : 1.0
('show', 's') = True pos : neg = 10.3 : 1.0
Of course, the "Bourne bias" is still present with the ('matt', 'damon') bigram, but you can't argue with the numbers. Every metric is at 96% or greater, clearly showing that high information feature selection with bigrams is hugely beneficial for text classification, at least when using the the NaiveBayes algorithm. This also goes against what I said at the end of the article on high information feature selection:
bigrams don't matter much when using only high information words
In fact, bigrams can make a huge difference, but you can't restrict them to just 200 significant collocations. Instead, you must include all of them, and let the scoring function decide what's significant and what isn't.
Announcing Text Processing APIs
If you liked the NLTK demos, then you'll love the text processing APIs. They provide all the functionality of the demos, plus a little bit more, and return results in JSON. Requests can contain up to 10,000 characters, instead of the 1,000 character limit on the demos, and you can do up to 100 calls per day. These limits may change in the future depending on usage & demand. If you'd like to do more, please fill out this survey to let me know what your needs are.
Announcing Python NLTK Demos
If you want to see what NLTK can do, but don't want to go thru the effort of installation and learning how to use it, then check out my Python NLTK demos.
It currently demonstrates the following functionality:
- part-of-speech tagging with the default NLTK pos tagger
- chunking and named entity recognition with the default NLTK chunker
- sentiment analysis with a combination of a naive bayes classifier and a maximum entropy classifier, both trained on the movie reviews corpus
If you like it, please share it. If you want to see more, leave a comment below. And if you are interested in a service that could apply these processes to your own data, please fill out this NLTK services survey.
Other Natural Language Processing Demos
Here's a list of similar resources on the web:
- A demo of the Stanford Parser with a javascript API: Natural-language Parsing For The Web
- A demo of the FreeLing language analysis suite: FreeLing Demo
- Emotional identification from text: EmoLib
Text Classification for Sentiment Analysis – Stopwords and Collocations
Improving feature extraction can often have a significant positive impact on classifier accuracy (and precision and recall). In this article, I'll be evaluating two modifications of the word_feats feature extraction method:
- filter out stopwords
- include bigram collocations
To do this effectively, we'll modify the previous code so that we can use an arbitrary feature extractor function that takes the words in a file and returns the feature dictionary. As before, we'll use these features to train a Naive Bayes Classifier.
import collections
import nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
def evaluate_classifier(featx):
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')
negfeats = [(featx(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(featx(movie_reviews.words(fileids=[f])), 'pos') for f in posids]
negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4
trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
classifier = NaiveBayesClassifier.train(trainfeats)
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
for i, (feats, label) in enumerate(testfeats):
refsets[label].add(i)
observed = classifier.classify(feats)
testsets[observed].add(i)
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])
print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg'])
print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg'])
classifier.show_most_informative_features()
Baseline Bag of Words Feature Extraction
Here's the baseline feature extractor for bag of words feature selection.
def word_feats(words): return dict([(word, True) for word in words]) evaluate_classifier(word_feats)
The results are the same as in the previous articles, but I've included them here for reference:
accuracy: 0.728
pos precision: 0.651595744681
pos recall: 0.98
neg precision: 0.959677419355
neg recall: 0.476
Most Informative Features
magnificent = True pos : neg = 15.0 : 1.0
outstanding = True pos : neg = 13.6 : 1.0
insulting = True neg : pos = 13.0 : 1.0
vulnerable = True pos : neg = 12.3 : 1.0
ludicrous = True neg : pos = 11.8 : 1.0
avoids = True pos : neg = 11.7 : 1.0
uninvolving = True neg : pos = 11.7 : 1.0
astounding = True pos : neg = 10.3 : 1.0
fascination = True pos : neg = 10.3 : 1.0
idiotic = True neg : pos = 9.8 : 1.0
Stopword Filtering
Stopwords are words that are generally considered useless. Most search engines ignore these words because they are so common that including them would greatly increase the size of the index without improving precision or recall. NLTK comes with a stopwords corpus that includes a list of 128 english stopwords. Let's see what happens when we filter out these words.
from nltk.corpus import stopwords
stopset = set(stopwords.words('english'))
def stopword_filtered_word_feats(words):
return dict([(word, True) for word in words if word not in stopset])
evaluate_classifier(stopword_filtered_word_feats)
And the results for a stopword filtered bag of words are:
accuracy: 0.726 pos precision: 0.649867374005 pos recall: 0.98 neg precision: 0.959349593496 neg recall: 0.472
Accuracy went down .2%, and pos precision and neg recall dropped as well! Apparently stopwords add information to sentiment analysis classification. I did not include the most informative features since they did not change.
Bigram Collocations
As mentioned at the end of the article on precision and recall, it's possible that including bigrams will improve classification accuracy. The hypothesis is that people say things like "not great", which is a negative expression that the bag of words model could interpret as positive since it sees "great" as a separate word.
To find significant bigrams, we can use nltk.collocations.BigramCollocationFinder along with nltk.metrics.BigramAssocMeasures. The BigramCollocationFinder maintains 2 internal FreqDists, one for individual word frequencies, another for bigram frequencies. Once it has these frequency distributions, it can score individual bigrams using a scoring function provided by BigramAssocMeasures, such chi-square. These scoring functions measure the collocation correlation of 2 words, basically whether the bigram occurs about as frequently as each individual word.
import itertools from nltk.collocations import BigramCollocationFinder from nltk.metrics import BigramAssocMeasures def bigram_word_feats(words, score_fn=BigramAssocMeasures.chi_sq, n=200): bigram_finder = BigramCollocationFinder.from_words(words) bigrams = bigram_finder.nbest(score_fn, n) return dict([(ngram, True) for ngram in itertools.chain(words, bigrams)]) evaluate_classifier(bigram_word_feats)
After some experimentation, I found that using the 200 best bigrams from each file produced great results:
accuracy: 0.816
pos precision: 0.753205128205
pos recall: 0.94
neg precision: 0.920212765957
neg recall: 0.692
Most Informative Features
magnificent = True pos : neg = 15.0 : 1.0
outstanding = True pos : neg = 13.6 : 1.0
insulting = True neg : pos = 13.0 : 1.0
vulnerable = True pos : neg = 12.3 : 1.0
('matt', 'damon') = True pos : neg = 12.3 : 1.0
('give', 'us') = True neg : pos = 12.3 : 1.0
ludicrous = True neg : pos = 11.8 : 1.0
uninvolving = True neg : pos = 11.7 : 1.0
avoids = True pos : neg = 11.7 : 1.0
('absolutely', 'no') = True neg : pos = 10.6 : 1.0
Yes, you read that right, Matt Damon is apparently one of the best predictors for positive sentiment in movie reviews. But despite this chuckle-worthy result
- accuracy is up almost 9%
posprecision has increased over 10% with only 4% drop in recallnegrecall has increased over 21% with just under 4% drop in precision
So it appears that the bigram hypothesis is correct, and including significant bigrams can increase classifier effectiveness. Note that it's significant bigrams that enhance effectiveness. I tried using nltk.util.bigrams to include all bigrams, and the results were only a few points above baseline. This points to the idea that including only significant features can improve accuracy compared to using all features. In a future article, I'll try trimming down the single word features to only include significant words.
Text Classification for Sentiment Analysis – Naive Bayes Classifier
Sentiment analysis is becoming a popular area of research and social media analysis, especially around user reviews and tweets. It is a special case of text mining generally focused on identifying opinion polarity, and while it's often not very accurate, it can still be useful. For simplicity (and because the training data is easily accessible) I'll focus on 2 possible sentiment classifications: positive and negative.
NLTK Naive Bayes Classification
NLTK comes with all the pieces you need to get started on sentiment analysis: a movie reviews corpus with reviews categorized into pos and neg categories, and a number of trainable classifiers. We'll start with a simple NaiveBayesClassifier as a baseline, using boolean word feature extraction.
Bag of Words Feature Extraction
All of the NLTK classifiers work with featstructs, which can be simple dictionaries mapping a feature name to a feature value. For text, we'll use a simplified bag of words model where every word is feature name with a value of True. Here's the feature extraction method:
def word_feats(words): return dict([(word, True) for word in words])
Training Set vs Test Set and Accuracy
The movie reviews corpus has 1000 positive files and 1000 negative files. We'll use 3/4 of them as the training set, and the rest as the test set. This gives us 1500 training instances and 500 test instances. The classifier training method expects to be given a list of tokens in the form of [(feats, label)] where feats is a feature dictionary and label is the classification label. In our case, feats will be of the form {word: True} and label will be one of 'pos' or 'neg'. For accuracy evaluation, we can use nltk.classify.util.accuracy with the test set as the gold standard.
Training and Testing the Naive Bayes Classifier
Here's the complete python code for training and testing a Naive Bayes Classifier on the movie review corpus.
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
def word_feats(words):
return dict([(word, True) for word in words])
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')
negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]
negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4
trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats))
classifier = NaiveBayesClassifier.train(trainfeats)
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
classifier.show_most_informative_features()
And the output is:
train on 1500 instances, test on 500 instances
accuracy: 0.728
Most Informative Features
magnificent = True pos : neg = 15.0 : 1.0
outstanding = True pos : neg = 13.6 : 1.0
insulting = True neg : pos = 13.0 : 1.0
vulnerable = True pos : neg = 12.3 : 1.0
ludicrous = True neg : pos = 11.8 : 1.0
avoids = True pos : neg = 11.7 : 1.0
uninvolving = True neg : pos = 11.7 : 1.0
astounding = True pos : neg = 10.3 : 1.0
fascination = True pos : neg = 10.3 : 1.0
idiotic = True neg : pos = 9.8 : 1.0
As you can see, the 10 most informative features are, for the most part, highly descriptive adjectives. The only 2 words that seem a bit odd are "vulnerable" and "avoids". Perhaps these words refer to important plot points or character development that signify a good movie. Whatever the case, with simple assumptions and very little code we're able to get almost 73% accuracy. This is somewhat near human accuracy, as apparently people agree on sentiment only around 80% of the time. Future articles in this series will cover precision & recall metrics, alternative classifiers, and techniques for improving accuracy.
Linguistic and Natural Language Processing Links
A number of links related to natural language processing and linguistics:
- What’s the Difference Between Stemming and Lemmatization? - Ask Dr. Search
- A List of Social Tagging Datasets Made Available for Research
- Social Signaling and Language Use
- Lexical Growth in the Blogosphere
- Spelling correction using the Python Natural Language Toolkit (nltk)
- OPUS - an open source parallel corpus
- Evaluating POS Taggers: The Contenders
- Text Analytics Wiki
Part of Speech Tagging with NLTK Part 4 – Brill Tagger vs Classifier Taggers
In previous installments on part-of-speech tagging, we saw that a Brill Tagger provides significant accuracy improvements over the Ngram Taggers combined with Regex and Affix Tagging.
With the latest 2.0 beta releases (2.0b8 as of this writing), NLTK has included a ClassifierBasedTagger as well as a pre-trained tagger used by the nltk.tag.pos_tag method. Based on the name, then pre-trained tagger appears to be a ClassifierBasedTagger trained on the treebank corpus using a MaxentClassifier. So let's see how a classifier tagger compares to the brill tagger.
NLTK Training Sets
For the brown corpus, I trained on 2/3 of the reviews, lore, and romance categories, and tested against the remaining 1/3. For conll2000, I used the standard train.txt vs test.txt. And for treebank, I again used a 2/3 vs 1/3 split.
import itertools
from nltk.corpus import brown, conll2000, treebank
brown_reviews = brown.tagged_sents(categories=['reviews'])
brown_reviews_cutoff = len(brown_reviews) * 2 / 3
brown_lore = brown.tagged_sents(categories=['lore'])
brown_lore_cutoff = len(brown_lore) * 2 / 3
brown_romance = brown.tagged_sents(categories=['romance'])
brown_romance_cutoff = len(brown_romance) * 2 / 3
brown_train = list(itertools.chain(brown_reviews[:brown_reviews_cutoff],
brown_lore[:brown_lore_cutoff], brown_romance[:brown_romance_cutoff]))
brown_test = list(itertools.chain(brown_reviews[brown_reviews_cutoff:],
brown_lore[brown_lore_cutoff:], brown_romance[brown_romance_cutoff:]))
conll_train = conll2000.tagged_sents('train.txt')
conll_test = conll2000.tagged_sents('test.txt')
treebank_cutoff = len(treebank.tagged_sents()) * 2 / 3
treebank_train = treebank.tagged_sents()[:treebank_cutoff]
treebank_test = treebank.tagged_sents()[treebank_cutoff:]
Naive Bayes Classifier Taggers
There are 3 new taggers referenced below:
cposis an instance of ClassifierBasedPOSTagger using the default NaiveBayesClassifier. It was trained by doingClassifierBasedPOSTagger(train=train_sents)craubtis likecpos, but has theraubttagger from part 2 as a backoff tagger by doingClassifierBasedPOSTagger(train=train_sents,backoff=raubt)bcposis a BrillTagger usingcposas its initial tagger instead ofraubt.
The raubt tagger is the same as from part 2, and braubt is from part 3.
postag is NLTK's pre-trained tagger used by the pos_tag function. It can be loaded using nltk.data.load(nltk.tag._POS_TAGGER).
Accuracy Evaluation
Tagger accuracy was determined by calling the evaluate method with the test set on each trained tagger. Here are the results:
Conclusions
The above results are quite interesting, and lead to a few conclusions:
- Training data is hugely significant when it comes to accuracy. This is why
postagtakes a huge nose dive onbrown, while at the same time can get near 100% accuracy ontreebank. - A ClassifierBasedPOSTagger does not need a backoff tagger, since
cposaccuracy is exactly the same as forcraubtacross all corpora. - The ClassifierBasedPOSTagger is not necessarily more accurate than the
bcraubttagger from part 3 (at least with the default feature detector). It also takes much longer to train and tag (more details below) and so may not be worth the tradeoff in efficiency. - Using brill tagger will nearly always increase the accuracy of your initial tagger, but not by much.
I was also surprised at how much more accurate postag was compared to cpos. Thinking that postag was probably trained on the full treebank corpus, I did the same, and re-evaluated:
cpos = ClassifierBasedPOSTagger(train=treebank.tagged_sents()) cpos.evaluate(treebank_test)
The result was 98.08% accuracy. So the remaining 2% difference must be due to the MaxentClassifier being more accurate than the naive bayes classifier, and/or the use of a different feature detector. I tried again with classifier_builder=MaxentClassifier.train and only got to 98.4% accuracy. So I can only conclude that a different feature detector is used. Hopefully the NLTK leaders will publish the training method so we can all know for sure.
Classification Efficiency
On the nltk-users list, there was a question about which tagger is the most computationaly economic. I can't tell you the right answer, but I can definitely say that ClassifierBasedPOSTagger is the wrong answer. During accuracy evaluation, I noticed that the cpos tagger took a lot longer than raubt or braubt. So I ran timeit on the tag method of each tagger, and got the following results:
| Tagger | secs/pass |
|---|---|
| raubt | 0.00005 |
| braubt | 0.00009 |
| cpos | 0.02219 |
| bcpos | 0.02259 |
| postag | 0.01241 |
This was run with python 2.6.4 on an Athlon 64 Dual Core 4600+ with 3G RAM, but the important thing is the relative times. braubt is over 246 times faster than cpos! To put it another way, braubt can process over 66666 words/sec, where cpos can only do 270 words/sec and postag only 483 words/sec. So the lesson is: do not use a classifier based tagger if speed is an issue.
Here's the code for timing postag. You can do the same thing for any other pickled tagger by replacing nltk.tag._POS_TAGGER with a nltk.data accessible path with a .pickle suffix for the load method.
import nltk, timeit
text = nltk.word_tokenize('And now for something completely different')
setup = 'import nltk.data, nltk.tag; tagger = nltk.data.load(nltk.tag._POS_TAGGER)'
t = timeit.Timer('tagger.tag(%s)' % text, setup)
print 'timing postag 1000 times'
spent = t.timeit(number=1000)
print 'took %.5f secs/pass' % (spent / 1000)
File Size
There's also a significant difference in the file size of the pickled taggers (trained on treebank):
| Tagger | Size |
|---|---|
| raubt | 272K |
| braubt | 273K |
| cpos | 3.8M |
| bcpos | 3.8M |
| postag | 8.2M |
Fin
I think there's a lot of room for experimentation with classifier based taggers and their feature detectors. But if speed is an issue for you, don't even bother. In that case, stick with a simpler tagger that's nearly as accurate and orders of magnitude faster.
NLTK Classifier Based Chunker Accuracy
The NLTK Book has been updated with an explanation of how to train a classifier based chunker, and I wanted to compare it's accuracy versus my previous tagger based chunker.
Tag Chunker
I already covered how to train a tagger based chunker, with the the discovery that a Unigram-Bigram TagChunker is the narrow favorite. I'll use this Unigram-Bigram Chunker as a baseline for comparison below.
Classifier Chunker
A Classifier based Chunker uses a classifier such as the MaxentClassifier to determine which IOB chunk tags to use. It's very similar to the TagChunker in that the Chunker class is really a wrapper around a Classifier based part-of-speech tagger. And both are trainable alternatives to a regular expression parser. So first we need to create a ClassifierTagger, and then we can wrap it with a ClassifierChunker.
Classifier Tagger
The ClassifierTagger below is an abstracted version of what's described in the Information Extraction chapter of the NLTK Book. It should theoretically work with any feature extractor and classifier class when created with the train classmethod. The kwargs are passed to the classifier constructor.
from nltk.tag import TaggerI, untag class ClassifierTagger(TaggerI): '''Abstracted from "Training Classifier-Based Chunkers" section of http://nltk.googlecode.com/svn/trunk/doc/book/ch07.html ''' def __init__(self, feature_extractor, classifier): self.feature_extractor = feature_extractor self.classifier = classifier def tag(self, sent): history = [] for i, word in enumerate(sent): featureset = self.feature_extractor(sent, i, history) tag = self.classifier.classify(featureset) history.append(tag) return zip(sent, history) @classmethod def train(cls, train_sents, feature_extractor, classifier_cls, **kwargs): train_set = [] for tagged_sent in train_sents: untagged_sent = untag(tagged_sent) history = [] for i, (word, tag) in enumerate(tagged_sent): featureset = feature_extractor(untagged_sent, i, history) train_set.append((featureset, tag)) history.append(tag) classifier = classifier_cls.train(train_set, **kwargs) return cls(feature_extractor, classifier)
Classifier Chunker
The ClassifierChunker is a thin wrapper around the ClassifierTagger that converts between tagged tuples and parse trees. args and kwargs in __init__ are passed in to ClassifierTagger.train().
from nltk.chunk import ChunkParserI, tree2conlltags, conlltags2tree class ClassifierChunker(nltk.chunk.ChunkParserI): def __init__(self, train_sents, *args, **kwargs): tag_sents = [tree2conlltags(sent) for sent in train_sents] train_chunks = [[((w,t),c) for (w,t,c) in sent] for sent in tag_sents] self.tagger = ClassifierTagger.train(train_chunks, *args, **kwargs) def parse(self, tagged_sent): if not tagged_sent: return None chunks = self.tagger.tag(tagged_sent) return conlltags2tree([(w,t,c) for ((w,t),c) in chunks])
Feature Extractors
Classifiers work on featuresets, which are created with feature extraction functions. Below are the feature extractors I evaluated, partly copied from the NLTK Book.
def pos(sent, i, history):
word, pos = sent[i]
return {'pos': pos}
def pos_word(sent, i, history):
word, pos = sent[i]
return {'pos': pos, 'word': word}
def prev_pos(sent, i, history):
word, pos = sent[i]
if i == 0:
prevword, prevpos = '<START>', '<START>'
else:
prevword, prevpos = sent[i-1]
return {'pos': pos, 'prevpos': prevpos}
def prev_pos_word(sent, i, history):
word, pos = sent[i]
if i == 0:
prevword, prevpos = '<START>', '<START>'
else:
prevword, prevpos = sent[i-1]
return {'pos': pos, 'prevpos': prevpos, 'word': word}
def next_pos(sent, i, history):
word, pos = sent[i]
if i == len(sent) - 1:
nextword, nextpos = '<END>', '<END>'
else:
nextword, nextpos = sent[i+1]
return {'pos': pos, 'nextpos': nextpos}
def next_pos_word(sent, i, history):
word, pos = sent[i]
if i == len(sent) - 1:
nextword, nextpos = '<END>', '<END>'
else:
nextword, nextpos = sent[i+1]
return {'pos': pos, 'nextpos': nextpos, 'word': word}
def prev_next_pos(sent, i, history):
word, pos = sent[i]
if i == 0:
prevword, prevpos = '<START>', '<START>'
else:
prevword, prevpos = sent[i-1]
if i == len(sent) - 1:
nextword, nextpos = '<END>', '<END>'
else:
nextword, nextpos = sent[i+1]
return {'pos': pos, 'nextpos': nextpos, 'prevpos': prevpos}
def prev_next_pos_word(sent, i, history):
word, pos = sent[i]
if i == 0:
prevword, prevpos = '<START>', '<START>'
else:
prevword, prevpos = sent[i-1]
if i == len(sent) - 1:
nextword, nextpos = '<END>', '<END>'
else:
nextword, nextpos = sent[i+1]
return {'pos': pos, 'nextpos': nextpos, 'word': word, 'prevpos': prevpos}
Training
Now that we have all the pieces, we can put them together with training.
NOTE: training the classifier takes a long time. If you want to reduce the time, you can increase min_lldelta or decrease max_iter, but you risk reducing the accuracy. Also note that the MaxentClassifier will sometimes produce nan for the log likelihood (I'm guessing this is a divide-by-zero error somewhere). If you hit Ctrl-C once at this point, you can stop the training and continue.
from nltk.corpus import conll2000
from nltk.classify import MaxentClassifier
train_sents = conll2000.chunked_sents('train.txt')
# featx is one of the feature extractors defined above
chunker = ClassifierChunker(train_sents, featx, MaxentClassifier,
min_lldelta=0.01, max_iter=10)
Accuracy
I ran the above training code for each feature extractor defined above, and generated the charts below. ub still refers to the TagChunker, which is included to provide a comparison baseline. All the other labels on the X-Axis refer to a classifier trained with one of the above feature extraction functions, using the first letter of each part of the name (p refers to pos(), pnpw refers to prev_next_pos_word(), etc).
One of the most interesting results of this test is how including the word in the featureset affects the accuracy. The only time including the word improves the accuracy is if the previous part-of-speech tag is also included in the featureset. Otherwise, including the word decreases accuracy. And looking ahead with next_pos() and next_pos_word() produces the worst results of all, until the previous part-of-speech tag is included. So whatever else you have in a featureset, the most important features are the current & previous pos tags, which, not surprisingly, is exactly what the TagChunker trains on.
Custom Training Data
Not only can the ClassifierChunker be significantly more accurate than the TagChunker, it is also superior for custom training data. For my own custom chunk corpus, I was unable to get above 94% accuracy with the TagChunker. That may seem pretty good, but it means the chunker is unable to parse over 1000 known chunks! However, after training the ClassifierChunker with the prev_next_pos_word feature extractor, I was able to get 100% parsing accuracy on my own chunk corpus. This is a huge win, and means that the behavior of the ClassifierChunker is much more controllable thru manualation.
