Announcing Python NLTK Demos
If you want to see what NLTK can do, but don't want to go thru the effort of installation and learning how to use it, then check out my Python NLTK demos.
It currently demonstrates the following functionality:
- part-of-speech tagging with the default NLTK pos tagger
- chunking and named entity recognition with the default NLTK chunker
- sentiment analysis with a combination of a naive bayes classifier and a maximum entropy classifier, both trained on the movie reviews corpus
If you like it, please share it. If you want to see more, leave a comment below. And if you are interested in a service that could apply these processes to your own data, please fill out this NLTK services survey.
Other Natural Language Processing Demos
Here's a list of similar resources on the web:
- A demo of the Stanford Parser with a javascript API: Natural-language Parsing For The Web
- A demo of the FreeLing language analysis suite: FreeLing Demo
- Emotional identification from text: EmoLib
Text Classification for Sentiment Analysis – Stopwords and Collocations
Improving feature extraction can often have a significant positive impact on classifier accuracy (and precision and recall). In this article, I'll be evaluating two modifications of the word_feats feature extraction method:
- filter out stopwords
- include bigram collocations
To do this effectively, we'll modify the previous code so that we can use an arbitrary feature extractor function that takes the words in a file and returns the feature dictionary. As before, we'll use these features to train a Naive Bayes Classifier.
import collections
import nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
def evaluate_classifier(featx):
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')
negfeats = [(featx(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(featx(movie_reviews.words(fileids=[f])), 'pos') for f in posids]
negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4
trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
classifier = NaiveBayesClassifier.train(trainfeats)
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
for i, (feats, label) in enumerate(testfeats):
refsets[label].add(i)
observed = classifier.classify(feats)
testsets[observed].add(i)
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])
print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg'])
print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg'])
classifier.show_most_informative_features()
Baseline Bag of Words Feature Extraction
Here's the baseline feature extractor for bag of words feature selection.
def word_feats(words): return dict([(word, True) for word in words]) evaluate_classifier(word_feats)
The results are the same as in the previous articles, but I've included them here for reference:
accuracy: 0.728
pos precision: 0.651595744681
pos recall: 0.98
neg precision: 0.959677419355
neg recall: 0.476
Most Informative Features
magnificent = True pos : neg = 15.0 : 1.0
outstanding = True pos : neg = 13.6 : 1.0
insulting = True neg : pos = 13.0 : 1.0
vulnerable = True pos : neg = 12.3 : 1.0
ludicrous = True neg : pos = 11.8 : 1.0
avoids = True pos : neg = 11.7 : 1.0
uninvolving = True neg : pos = 11.7 : 1.0
astounding = True pos : neg = 10.3 : 1.0
fascination = True pos : neg = 10.3 : 1.0
idiotic = True neg : pos = 9.8 : 1.0Stopword Filtering
Stopwords are words that are generally considered useless. Most search engines ignore these words because they are so common that including them would greatly increase the size of the index without improving precision or recall. NLTK comes with a stopwords corpus that includes a list of 128 english stopwords. Let's see what happens when we filter out these words.
from nltk.corpus import stopwords
stopset = set(stopwords.words('english'))
def stopword_filtered_word_feats(words):
return dict([(word, True) for word in words if word not in stopset])
evaluate_classifier(stopword_filtered_word_feats)
And the results for a stopword filtered bag of words are:
accuracy: 0.726 pos precision: 0.649867374005 pos recall: 0.98 neg precision: 0.959349593496 neg recall: 0.472
Accuracy went down .2%, and pos precision and neg recall dropped as well! Apparently stopwords add information to sentiment analysis classification. I did not include the most informative features since they did not change.
Bigram Collocations
As mentioned at the end of the article on precision and recall, it's possible that including bigrams will improve classification accuracy. The hypothesis is that people say things like "not great", which is a negative expression that the bag of words model could interpret as positive since it sees "great" as a separate word.
To find significant bigrams, we can use nltk.collocations.BigramCollocationFinder along with nltk.metrics.BigramAssocMeasures. The BigramCollocationFinder maintains 2 internal FreqDists, one for individual word frequencies, another for bigram frequencies. Once it has these frequency distributions, it can score individual bigrams using a scoring function provided by BigramAssocMeasures, such chi-square. These scoring functions measure the collocation correlation of 2 words, basically whether the bigram occurs about as frequently as each individual word.
import itertools from nltk.collocations import BigramCollocationFinder from nltk.metrics import BigramAssocMeasures def bigram_word_feats(words, score_fn=BigramAssocMeasures.chi_sq, n=200): bigram_finder = BigramCollocationFinder.from_words(words) bigrams = bigram_finder.nbest(score_fn, n) return dict([(ngram, True) for ngram in itertools.chain(words, bigrams)]) evaluate_classifier(bigram_word_feats)
After some experimentation, I found that using the 200 best bigrams from each file produced great results:
accuracy: 0.816
pos precision: 0.753205128205
pos recall: 0.94
neg precision: 0.920212765957
neg recall: 0.692
Most Informative Features
magnificent = True pos : neg = 15.0 : 1.0
outstanding = True pos : neg = 13.6 : 1.0
insulting = True neg : pos = 13.0 : 1.0
vulnerable = True pos : neg = 12.3 : 1.0
('matt', 'damon') = True pos : neg = 12.3 : 1.0
('give', 'us') = True neg : pos = 12.3 : 1.0
ludicrous = True neg : pos = 11.8 : 1.0
uninvolving = True neg : pos = 11.7 : 1.0
avoids = True pos : neg = 11.7 : 1.0
('absolutely', 'no') = True neg : pos = 10.6 : 1.0Yes, you read that right, Matt Damon is apparently one of the best predictors for positive sentiment in movie reviews. But despite this chuckle-worthy result
- accuracy is up almost 9%
posprecision has increased over 10% with only 4% drop in recallnegrecall has increased over 21% with just under 4% drop in precision
So it appears that the bigram hypothesis is correct, and including significant bigrams can increase classifier effectiveness. Note that it's significant bigrams that enhance effectiveness. I tried using nltk.util.bigrams to include all bigrams, and the results were only a few points above baseline. This points to the idea that including only significant features can improve accuracy compared to using all features. In a future article, I'll try trimming down the single word features to only include significant words.
Text Classification for Sentiment Analysis – Precision and Recall
Accuracy is not the only metric for evaluating the effectiveness of a classifier. Two other useful metrics are precision and recall. These two metrics can provide much greater insight into the performance characteristics of a binary classifier.
Classifier Precision
Precision measures the exactness of a classifier. A higher precision means less false positives, while a lower precision means more false positives. This is often at odds with recall, as an easy way to improve precision is to decrease recall.
Classifier Recall
Recall measures the completeness, or sensitivity, of a classifier. Higher recall means less false negatives, while lower recall means more false negatives. Improving recall can often decrease precision because it gets increasingly harder to be precise as the sample space increases.
F-measure Metric
Precision and recall can be combined to produce a single metric known as F-measure, which is the weighted harmonic mean of precision and recall. I find F-measure to be about as useful as accuracy. Or in other words, compared to precision & recall, F-measure is mostly useless, as you'll see below.
Measuring Precision and Recall of a Naive Bayes Classifier
The NLTK metrics module provides functions for calculating all three metrics mentioned above. But to do so, you need to build 2 sets for each classification label: a reference set of correct values, and a test set of observed values. Below is a modified version of the code from the previous article, where we trained a Naive Bayes Classifier. This time, instead of measuring accuracy, we'll collect reference values and observed values for each label (pos or neg), then use those sets to calculate the precision, recall, and F-measure of the naive bayes classifier. The actual values collected are simply the index of each featureset using enumerate.
import collections
import nltk.metrics
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
def word_feats(words):
return dict([(word, True) for word in words])
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')
negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]
negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4
trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats))
classifier = NaiveBayesClassifier.train(trainfeats)
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
for i, (feats, label) in enumerate(testfeats):
refsets[label].add(i)
observed = classifier.classify(feats)
testsets[observed].add(i)
print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])
print 'pos F-measure:', nltk.metrics.f_measure(refsets['pos'], testsets['pos'])
print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg'])
print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg'])
print 'neg F-measure:', nltk.metrics.f_measure(refsets['neg'], testsets['neg'])
Precision and Recall for Positive and Negative Reviews
I found the results quite interesting:
pos precision: 0.651595744681 pos recall: 0.98 pos F-measure: 0.782747603834 neg precision: 0.959677419355 neg recall: 0.476 neg F-measure: 0.636363636364
So what does this mean?
- Nearly every file that is pos is correctly identified as such, with 98% recall. This means very few false negatives in the pos class.
- But, a file given a pos classification is only 65% likely to be correct. Not so good precision leads to 35% false positives for the pos label.
- Any file that is identified as neg is 96% likely to be correct (high precision). This means very few false positives for the neg class.
- But many files that are neg are incorrectly classified. Low recall causes 52% false negatives for the neg label.
- F-measure provides no useful information. There's no insight to be gained from having it, and we wouldn't lose any knowledge if it was taken away.
Improving Results with Better Feature Selection
One possible explanation for the above results is that people use normally positives words in negative reviews, but the word is preceded by "not" (or some other negative word), such as "not great". And since the classifier uses the bag of words model, which assumes every word is independent, it cannot learn that "not great" is a negative. If this is the case, then these metrics should improve if we also train on multiple words, a topic I'll explore in a future article.
Another possibility is the abundance of naturally neutral words, the kind of words that are devoid of sentiment. But the classifier treats all words the same, and has to assign each word to either pos or neg. So maybe otherwise neutral or meaningless words are being placed in the pos class because the classifier doesn't know what else to do. If this is the case, then the metrics should improve if we eliminate the neutral or meaningless words from the featuresets, and only classify using sentiment rich words. This is usually done using the concept of information gain, aka mutual information, to improve feature selection, which I'll also explore in a future article.
If you have your own theories to explain the results, or ideas on how to improve precision and recall, please share in the comments.
Text Classification for Sentiment Analysis – Naive Bayes Classifier
Sentiment analysis is becoming a popular area of research and social media analysis, especially around user reviews and tweets. It is a special case of text mining generally focused on identifying opinion polarity, and while it's often not very accurate, it can still be useful. For simplicity (and because the training data is easily accessible) I'll focus on 2 possible sentiment classifications: positive and negative.
NLTK Naive Bayes Classification
NLTK comes with all the pieces you need to get started on sentiment analysis: a movie reviews corpus with reviews categorized into pos and neg categories, and a number of trainable classifiers. We'll start with a simple NaiveBayesClassifier as a baseline, using boolean word feature extraction.
Bag of Words Feature Extraction
All of the NLTK classifiers work with featstructs, which can be simple dictionaries mapping a feature name to a feature value. For text, we'll use a simplified bag of words model where every word is feature name with a value of True. Here's the feature extraction method:
def word_feats(words): return dict([(word, True) for word in words])
Training Set vs Test Set and Accuracy
The movie reviews corpus has 1000 positive files and 1000 negative files. We'll use 3/4 of them as the training set, and the rest as the test set. This gives us 1500 training instances and 500 test instances. The classifier training method expects to be given a list of tokens in the form of [(feats, label)] where feats is a feature dictionary and label is the classification label. In our case, feats will be of the form {word: True} and label will be one of 'pos' or 'neg'. For accuracy evaluation, we can use nltk.classify.util.accuracy with the test set as the gold standard.
Training and Testing the Naive Bayes Classifier
Here's the complete python code for training and testing a Naive Bayes Classifier on the movie review corpus.
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
def word_feats(words):
return dict([(word, True) for word in words])
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')
negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]
negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4
trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats))
classifier = NaiveBayesClassifier.train(trainfeats)
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
classifier.show_most_informative_features()
And the output is:
train on 1500 instances, test on 500 instances
accuracy: 0.728
Most Informative Features
magnificent = True pos : neg = 15.0 : 1.0
outstanding = True pos : neg = 13.6 : 1.0
insulting = True neg : pos = 13.0 : 1.0
vulnerable = True pos : neg = 12.3 : 1.0
ludicrous = True neg : pos = 11.8 : 1.0
avoids = True pos : neg = 11.7 : 1.0
uninvolving = True neg : pos = 11.7 : 1.0
astounding = True pos : neg = 10.3 : 1.0
fascination = True pos : neg = 10.3 : 1.0
idiotic = True neg : pos = 9.8 : 1.0As you can see, the 10 most informative features are, for the most part, highly descriptive adjectives. The only 2 words that seem a bit odd are "vulnerable" and "avoids". Perhaps these words refer to important plot points or character development that signify a good movie. Whatever the case, with simple assumptions and very little code we're able to get almost 73% accuracy. This is somewhat near human accuracy, as apparently people agree on sentiment only around 80% of the time. Future articles in this series will cover precision & recall metrics, alternative classifiers, and techniques for improving accuracy.
Part of Speech Tagging with NLTK Part 4 – Brill Tagger vs Classifier Taggers
In previous installments on part-of-speech tagging, we saw that a Brill Tagger provides significant accuracy improvements over the Ngram Taggers combined with Regex and Affix Tagging.
With the latest 2.0 beta releases (2.0b8 as of this writing), NLTK has included a ClassifierBasedTagger as well as a pre-trained tagger used by the nltk.tag.pos_tag method. Based on the name, then pre-trained tagger appears to be a ClassifierBasedTagger trained on the treebank corpus using a MaxentClassifier. So let's see how a classifier tagger compares to the brill tagger.
Training Sets
For the brown corpus, I trained on 2/3 of the reviews, lore, and romance categories, and tested against the remaining 1/3. For conll2000, I used the standard train.txt vs test.txt. And for treebank, I again used a 2/3 vs 1/3 split.
import itertools
from nltk.corpus import brown, conll2000, treebank
brown_reviews = brown.tagged_sents(categories=['reviews'])
brown_reviews_cutoff = len(brown_reviews) * 2 / 3
brown_lore = brown.tagged_sents(categories=['lore'])
brown_lore_cutoff = len(brown_lore) * 2 / 3
brown_romance = brown.tagged_sents(categories=['romance'])
brown_romance_cutoff = len(brown_romance) * 2 / 3
brown_train = list(itertools.chain(brown_reviews[:brown_reviews_cutoff],
brown_lore[:brown_lore_cutoff], brown_romance[:brown_romance_cutoff]))
brown_test = list(itertools.chain(brown_reviews[brown_reviews_cutoff:],
brown_lore[brown_lore_cutoff:], brown_romance[brown_romance_cutoff:]))
conll_train = conll2000.tagged_sents('train.txt')
conll_test = conll2000.tagged_sents('test.txt')
treebank_cutoff = len(treebank.tagged_sents()) * 2 / 3
treebank_train = treebank.tagged_sents()[:treebank_cutoff]
treebank_test = treebank.tagged_sents()[treebank_cutoff:]
Classifier Taggers
There are 3 new taggers referenced below:
cposis an instance of ClassifierBasedPOSTagger using the default NaiveBayesClassifier. It was trained by doingClassifierBasedPOSTagger(train=train_sents)craubtis likecpos, but has theraubttagger from part 2 as a backoff tagger by doingClassifierBasedPOSTagger(train=train_sents,backoff=raubt)bcposis a BrillTagger usingcposas its initial tagger instead ofraubt.
The raubt tagger is the same as from part 2, and braubt is from part 3.
postag is NLTK's pre-trained tagger used by the pos_tag function. It can be loaded using nltk.data.load(nltk.tag._POS_TAGGER).
Accuracy Evaluation
Tagger accuracy was determined by calling the evaluate method with the test set on each trained tagger. Here are the results:
Conclusions
The above results are quite interesting, and lead to a few conclusions:
- Training data is hugely significant when it comes to accuracy. This is why
postagtakes a huge nose dive onbrown, while at the same time can get near 100% accuracy ontreebank. - A ClassifierBasedPOSTagger does not need a backoff tagger, since
cposaccuracy is exactly the same as forcraubtacross all corpora. - The ClassifierBasedPOSTagger is not necessarily more accurate than the
bcraubttagger from part 3 (at least with the default feature detector). It also takes much longer to train and tag (more details below) and so may not be worth the tradeoff in efficiency. - Using BrillTagger will nearly always increase the accuracy of your initial tagger, but not by much.
I was also surprised at how much more accurate postag was compared to cpos. Thinking that postag was probably trained on the full treebank corpus, I did the same, and re-evaluated:
cpos = ClassifierBasedPOSTagger(train=treebank.tagged_sents()) cpos.evaluate(treebank_test)
The result was 98.08% accuracy. So the remaining 2% difference must be due to the MaxentClassifier being more accurate than NaiveBayesClassifier, and/or the use of a different feature detector. I tried again with classifier_builder=MaxentClassifier.train and only got to 98.4% accuracy. So I can only conclude that a different feature detector is used. Hopefully the NLTK leaders will publish the training method so we can all know for sure.
Efficiency
On the nltk-users list, there was a question about which tagger is the most computationaly economic. I can't tell you the right answer, but I can definitely say that ClassifierBasedPOSTagger is the wrong answer. During accuracy evaluation, I noticed that the cpos tagger took a lot longer than raubt or braubt. So I ran timeit on the tag method of each tagger, and got the following results:
| Tagger | secs/pass |
|---|---|
| raubt | 0.00005 |
| braubt | 0.00009 |
| cpos | 0.02219 |
| bcpos | 0.02259 |
| postag | 0.01241 |
This was run with python 2.6.4 on an Athlon 64 Dual Core 4600+ with 3G RAM, but the important thing is the relative times. braubt is over 246 times faster than cpos! To put it another way, braubt can process over 66666 words/sec, where cpos can only do 270 words/sec and postag only 483 words/sec. So the lesson is: do not use a classifier based tagger if speed is an issue.
Here's the code for timing postag. You can do the same thing for any other pickled tagger by replacing nltk.tag._POS_TAGGER with a nltk.data accessible path with a .pickle suffix for the load method.
import nltk, timeit
text = nltk.word_tokenize('And now for something completely different')
setup = 'import nltk.data, nltk.tag; tagger = nltk.data.load(nltk.tag._POS_TAGGER)'
t = timeit.Timer('tagger.tag(%s)' % text, setup)
print 'timing postag 1000 times'
spent = t.timeit(number=1000)
print 'took %.5f secs/pass' % (spent / 1000)
File Size
There's also a significant difference in the file size of the pickled taggers (trained on treebank):
| Tagger | Size |
|---|---|
| raubt | 272K |
| braubt | 273K |
| cpos | 3.8M |
| bcpos | 3.8M |
| postag | 8.2M |
Fin
I think there's a lot of room for experimentation with classifier based taggers and their feature detectors. But if speed is an issue for you, don't even bother. In that case, stick with a simpler tagger that's nearly as accurate and orders of magnitude faster.
Python Logging Filters
The python logging package provides a Filter class that can be used for filtering log records. This is a simple way to ensure that a logger or handler will only output desired log messages. Here's an example filter that only allows INFO messages to be logged:
import logging class InfoFilter(logging.Filter): def filter(self, rec): return rec.levelno == logging.INFO
Configuring Python Logging Filters
Filters can be added to a logger instance or a handler instance using the addFilter(filt) method. For a logger, the best time to do this is probably right after calling getLogger, like so:
log = logging.getLogger() log.addFilter(InfoFilter())
What about adding a filter to a handler? If you're programmatically configuring handlers with addHandler(hdlr), then you can do the same thing by calling addFilter(filt) on the handler instance. But if you're using fileConfig to configure handlers and loggers, it's a little bit harder. Unfortunately, the logging configuration format does not support adding filters. And it's not always clear which logger the handler instances are attached to in the logger hierarchy. So the simplest way to add a filter to a handler in this case is to subclass the handler:
class InfoHandler(logging.StreamHandler): def __init__(self, *args, **kwargs): StreamHandler.__init__(self, *args, **kwargs) self.addFilter(InfoFilter())
Then in your file config, make sure to set the class value for your custom handler to a complete code path for import:
[handler_infohandler] class=mypackage.mylogging.InfoHandler level=INFO
Now your handler will only handle the log records that pass your custom filter. As long your handlers aren't changing much, the above method is much more reusable than having to call addFilter(filt) everytime a new logger is instantiated.
Python Point-in-Polygon with Shapely
Shapely is an offshoot of the GIS-Python project that provides spatial geometry functions independent of any geo-enabled database. In particular, it makes python point-in-polygon calculations very easy.
Creating a Polygon
First, you need to create a polygon. If you already have an ordered list of coordinate points that define a closed ring, you can create a Polygon directly, like so:
from shapely.geometry import Polygon poly = Polygon(((0, 0), (0, 1), (1, 1), (1, 0)))
But what if you just have a bunch of points in no particular order? Then you can create a MultiPoint geometry and get the convex hull polygon.
from shapely.geometry import MultiPoint # coords is a list of (x, y) tuples poly = MultiPoint(coords).convex_hull
Point-in-Polygon
Now that you have a polygon, determining whether a point is inside it is very easy. There's 2 ways to do it.
point.within(polygon)polygon.contains(point)
point should be an instance of the Point class, and poly is of course an instance of Polygon. within and contains are the converse of each other, so whichever method you use is entirely up to you.
Overlapping Polygons
In addition to point-in-polygon, you can also determine whether shapely geometries overlap each other. poly.within(poly) and poly.contains(poly) can be used to determine if one polygon is completely within another polygon. For partial overlaps, you can use the intersects method, or call intersection to get the overlapping area as a polygon.
There's a lot more you can do with this very useful python geometry package, so take a look at the Shapely Manual as well as some usage examples.
Python and Django Testing and Continuous Integration Links
Django Continuous Integration:
- Continuous Integration with Django and Hudson CI (Day 1)
- Django continuous integration with Hudson and Nose
- jbalogh's django-nose
Python Testing:
NLTK Classifier Based Chunker Accuracy
The NLTK Book has been updated with an explanation of how to train a classifier based chunker, and I wanted to compare it's accuracy versus my previous tagger based chunker.
Tag Chunker
I already covered how to train a tagger based chunker, with the the discovery that a Unigram-Bigram TagChunker is the narrow favorite. I'll use this Unigram-Bigram Chunker as a baseline for comparison below.
Classifier Chunker
A Classifier based Chunker uses a classifier such as the MaxentClassifier to determine which IOB chunk tags to use. It's very similar to the TagChunker in that the Chunker class is really a wrapper around a Classifier based part-of-speech tagger. And both are trainable alternatives to a regular expression parser. So first we need to create a ClassifierTagger, and then we can wrap it with a ClassifierChunker.
Classifier Tagger
The ClassifierTagger below is an abstracted version of what's described in the Information Extraction chapter of the NLTK Book. It should theoretically work with any feature extractor and classifier class when created with the train classmethod. The kwargs are passed to the classifier constructor.
from nltk.tag import TaggerI, untag class ClassifierTagger(TaggerI): '''Abstracted from "Training Classifier-Based Chunkers" section of http://nltk.googlecode.com/svn/trunk/doc/book/ch07.html ''' def __init__(self, feature_extractor, classifier): self.feature_extractor = feature_extractor self.classifier = classifier def tag(self, sent): history = [] for i, word in enumerate(sent): featureset = self.feature_extractor(sent, i, history) tag = self.classifier.classify(featureset) history.append(tag) return zip(sent, history) @classmethod def train(cls, train_sents, feature_extractor, classifier_cls, **kwargs): train_set = [] for tagged_sent in train_sents: untagged_sent = untag(tagged_sent) history = [] for i, (word, tag) in enumerate(tagged_sent): featureset = feature_extractor(untagged_sent, i, history) train_set.append((featureset, tag)) history.append(tag) classifier = classifier_cls.train(train_set, **kwargs) return cls(feature_extractor, classifier)
Classifier Chunker
The ClassifierChunker is a thin wrapper around the ClassifierTagger that converts between tagged tuples and parse trees. args and kwargs in __init__ are passed in to ClassifierTagger.train().
from nltk.chunk import ChunkParserI, tree2conlltags, conlltags2tree class ClassifierChunker(nltk.chunk.ChunkParserI): def __init__(self, train_sents, *args, **kwargs): tag_sents = [tree2conlltags(sent) for sent in train_sents] train_chunks = [[((w,t),c) for (w,t,c) in sent] for sent in tag_sents] self.tagger = ClassifierTagger.train(train_chunks, *args, **kwargs) def parse(self, tagged_sent): if not tagged_sent: return None chunks = self.tagger.tag(tagged_sent) return conlltags2tree([(w,t,c) for ((w,t),c) in chunks])
Feature Extractors
Classifiers work on featuresets, which are created with feature extraction functions. Below are the feature extractors I evaluated, partly copied from the NLTK Book.
def pos(sent, i, history):
word, pos = sent[i]
return {'pos': pos}
def pos_word(sent, i, history):
word, pos = sent[i]
return {'pos': pos, 'word': word}
def prev_pos(sent, i, history):
word, pos = sent[i]
if i == 0:
prevword, prevpos = '<START>', '<START>'
else:
prevword, prevpos = sent[i-1]
return {'pos': pos, 'prevpos': prevpos}
def prev_pos_word(sent, i, history):
word, pos = sent[i]
if i == 0:
prevword, prevpos = '<START>', '<START>'
else:
prevword, prevpos = sent[i-1]
return {'pos': pos, 'prevpos': prevpos, 'word': word}
def next_pos(sent, i, history):
word, pos = sent[i]
if i == len(sent) - 1:
nextword, nextpos = '<END>', '<END>'
else:
nextword, nextpos = sent[i+1]
return {'pos': pos, 'nextpos': nextpos}
def next_pos_word(sent, i, history):
word, pos = sent[i]
if i == len(sent) - 1:
nextword, nextpos = '<END>', '<END>'
else:
nextword, nextpos = sent[i+1]
return {'pos': pos, 'nextpos': nextpos, 'word': word}
def prev_next_pos(sent, i, history):
word, pos = sent[i]
if i == 0:
prevword, prevpos = '<START>', '<START>'
else:
prevword, prevpos = sent[i-1]
if i == len(sent) - 1:
nextword, nextpos = '<END>', '<END>'
else:
nextword, nextpos = sent[i+1]
return {'pos': pos, 'nextpos': nextpos, 'prevpos': prevpos}
def prev_next_pos_word(sent, i, history):
word, pos = sent[i]
if i == 0:
prevword, prevpos = '<START>', '<START>'
else:
prevword, prevpos = sent[i-1]
if i == len(sent) - 1:
nextword, nextpos = '<END>', '<END>'
else:
nextword, nextpos = sent[i+1]
return {'pos': pos, 'nextpos': nextpos, 'word': word, 'prevpos': prevpos}
Training
Now that we have all the pieces, we can put them together with training.
NOTE: training the classifier takes a long time. If you want to reduce the time, you can increase min_lldelta or decrease max_iter, but you risk reducing the accuracy. Also note that the MaxentClassifier will sometimes produce nan for the log likelihood (I'm guessing this is a divide-by-zero error somewhere). If you hit Ctrl-C once at this point, you can stop the training and continue.
from nltk.corpus import conll2000
from nltk.classify import MaxentClassifier
train_sents = conll2000.chunked_sents('train.txt')
# featx is one of the feature extractors defined above
chunker = ClassifierChunker(train_sents, featx, MaxentClassifier,
min_lldelta=0.01, max_iter=10)
Accuracy
I ran the above training code for each feature extractor defined above, and generated the charts below. ub still refers to the TagChunker, which is included to provide a comparison baseline. All the other labels on the X-Axis refer to a classifier trained with one of the above feature extraction functions, using the first letter of each part of the name (p refers to pos(), pnpw refers to prev_next_pos_word(), etc).
One of the most interesting results of this test is how including the word in the featureset affects the accuracy. The only time including the word improves the accuracy is if the previous part-of-speech tag is also included in the featureset. Otherwise, including the word decreases accuracy. And looking ahead with next_pos() and next_pos_word() produces the worst results of all, until the previous part-of-speech tag is included. So whatever else you have in a featureset, the most important features are the current & previous pos tags, which, not surprisingly, is exactly what the TagChunker trains on.
Custom Training Data
Not only can the ClassifierChunker be significantly more accurate than the TagChunker, it is also superior for custom training data. For my own custom chunk corpus, I was unable to get above 94% accuracy with the TagChunker. That may seem pretty good, but it means the chunker is unable to parse over 1000 known chunks! However, after training the ClassifierChunker with the prev_next_pos_word feature extractor, I was able to get 100% parsing accuracy on my own chunk corpus. This is a huge win, and means that the behavior of the ClassifierChunker is much more controllable thru manualation.




