Text Classification for Sentiment Analysis – Precision and Recall
Accuracy is not the only metric for evaluating the effectiveness of a classifier. Two other useful metrics are precision and recall. These two metrics can provide much greater insight into the performance characteristics of a binary classifier.
Classifier Precision
Precision measures the exactness of a classifier. A higher precision means less false positives, while a lower precision means more false positives. This is often at odds with recall, as an easy way to improve precision is to decrease recall.
Classifier Recall
Recall measures the completeness, or sensitivity, of a classifier. Higher recall means less false negatives, while lower recall means more false negatives. Improving recall can often decrease precision because it gets increasingly harder to be precise as the sample space increases.
F-measure Metric
Precision and recall can be combined to produce a single metric known as F-measure, which is the weighted harmonic mean of precision and recall. I find F-measure to be about as useful as accuracy. Or in other words, compared to precision & recall, F-measure is mostly useless, as you'll see below.
Measuring Precision and Recall of a Naive Bayes Classifier
The NLTK metrics module provides functions for calculating all three metrics mentioned above. But to do so, you need to build 2 sets for each classification label: a reference set of correct values, and a test set of observed values. Below is a modified version of the code from the previous article, where we trained a Naive Bayes Classifier. This time, instead of measuring accuracy, we'll collect reference values and observed values for each label (pos or neg), then use those sets to calculate the precision, recall, and F-measure of the naive bayes classifier. The actual values collected are simply the index of each featureset using enumerate.
import collections
import nltk.metrics
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
def word_feats(words):
return dict([(word, True) for word in words])
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')
negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]
negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4
trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats))
classifier = NaiveBayesClassifier.train(trainfeats)
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
for i, (feats, label) in enumerate(testfeats):
refsets[label].add(i)
observed = classifier.classify(feats)
testsets[observed].add(i)
print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])
print 'pos F-measure:', nltk.metrics.f_measure(refsets['pos'], testsets['pos'])
print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg'])
print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg'])
print 'neg F-measure:', nltk.metrics.f_measure(refsets['neg'], testsets['neg'])
Precision and Recall for Positive and Negative Reviews
I found the results quite interesting:
pos precision: 0.651595744681 pos recall: 0.98 pos F-measure: 0.782747603834 neg precision: 0.959677419355 neg recall: 0.476 neg F-measure: 0.636363636364
So what does this mean?
- Nearly every file that is pos is correctly identified as such, with 98% recall. This means very few false negatives in the pos class.
- But, a file given a pos classification is only 65% likely to be correct. Not so good precision leads to 35% false positives for the pos label.
- Any file that is identified as neg is 96% likely to be correct (high precision). This means very few false positives for the neg class.
- But many files that are neg are incorrectly classified. Low recall causes 52% false negatives for the neg label.
- F-measure provides no useful information. There's no insight to be gained from having it, and we wouldn't lose any knowledge if it was taken away.
Improving Results with Better Feature Selection
One possible explanation for the above results is that people use normally positives words in negative reviews, but the word is preceded by "not" (or some other negative word), such as "not great". And since the classifier uses the bag of words model, which assumes every word is independent, it cannot learn that "not great" is a negative. If this is the case, then these metrics should improve if we also train on multiple words, a topic I'll explore in a future article.
Another possibility is the abundance of naturally neutral words, the kind of words that are devoid of sentiment. But the classifier treats all words the same, and has to assign each word to either pos or neg. So maybe otherwise neutral or meaningless words are being placed in the pos class because the classifier doesn't know what else to do. If this is the case, then the metrics should improve if we eliminate the neutral or meaningless words from the featuresets, and only classify using sentiment rich words. This is usually done using the concept of information gain, aka mutual information, to improve feature selection, which I'll also explore in a future article.
If you have your own theories to explain the results, or ideas on how to improve precision and recall, please share in the comments.





May 17th, 2010 - 08:56
Nothing much to add, Jacob, just wanted to let you know that I really appreciate these posts on natural language processing, good stuff. It's an exciting but underappreciated field of study. Cheers!
May 17th, 2010 - 10:12
Thanks Stijn. I'm hoping my posts help increase NLP appreciation, or at least awareness of how to do it effectively. Looks like you're trying to do the same for information design, and now I have a new blog to explore
June 26th, 2010 - 10:24
Thank you Jacob! Your posts are wonderful.
June 28th, 2010 - 10:41
Your welcome Marcos, glad you like the posts.
June 28th, 2010 - 17:41
Your welcome Marcos, glad you like the posts.
August 5th, 2010 - 00:04
Thanks Jacob, good stuff here. I'm really wanting to do this on some of my own data but am scratching my head as to how my sentiment data needs to be formatted to read in the sentiment id. Any idea how this works?
August 5th, 2010 - 14:53
The movie reviews corpus is in 2 directories, one for “pos” and another for “neg”. Then it uses the CategorizedCorpusReader to specify the categories based on which directory each file is in. So I'd read up on the CategorizedCorpusReader at http://nltk.googlecode.com/svn/trunk/doc/api/nl... and look at some of the other categorized corpora for examples (brown, reuters) to figure out what would work best for organizing your own data.
August 12th, 2010 - 19:56
Thanks Jacob. After I posted this, I found that info as well. I am using the CategorizedPlainTextCorpusReader. It looks like I use the constructor to create my own reader. I am including another category neutral to the mix.