Accuracy is not the only metric for evaluating the effectiveness of a classifier. Two other useful metrics are precision and recall. These two metrics can provide much greater insight into the performance characteristics of a binary classifier.
Classifier Precision
Precision measures the exactness of a classifier. A higher precision means less false positives, while a lower precision means more false positives. This is often at odds with recall, as an easy way to improve precision is to decrease recall.
Classifier Recall
Recall measures the completeness, or sensitivity, of a classifier. Higher recall means less false negatives, while lower recall means more false negatives. Improving recall can often decrease precision because it gets increasingly harder to be precise as the sample space increases.
F-measure Metric
Precision and recall can be combined to produce a single metric known as F-measure, which is the weighted harmonic mean of precision and recall. I find F-measure to be about as useful as accuracy. Or in other words, compared to precision & recall, F-measure is mostly useless, as you’ll see below.
Measuring Precision and Recall of a Naive Bayes Classifier
The NLTK metrics module provides functions for calculating all three metrics mentioned above. But to do so, you need to build 2 sets for each classification label: a reference set of correct values, and a test set of observed values. Below is a modified version of the code from the previous article, where we trained a Naive Bayes Classifier. This time, instead of measuring accuracy, we’ll collect reference values and observed values for each label (pos or neg), then use those sets to calculate the precision, recall, and F-measure of the naive bayes classifier. The actual values collected are simply the index of each featureset using enumerate.
import collections import nltk.metrics from nltk.classify import NaiveBayesClassifier from nltk.corpus import movie_reviews def word_feats(words): return dict([(word, True) for word in words]) negids = movie_reviews.fileids('neg') posids = movie_reviews.fileids('pos') negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids] posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids] negcutoff = len(negfeats)*3/4 poscutoff = len(posfeats)*3/4 trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff] testfeats = negfeats[negcutoff:] + posfeats[poscutoff:] print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats)) classifier = NaiveBayesClassifier.train(trainfeats) refsets = collections.defaultdict(set) testsets = collections.defaultdict(set) for i, (feats, label) in enumerate(testfeats): refsets[label].add(i) observed = classifier.classify(feats) testsets[observed].add(i) print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos']) print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos']) print 'pos F-measure:', nltk.metrics.f_measure(refsets['pos'], testsets['pos']) print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg']) print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg']) print 'neg F-measure:', nltk.metrics.f_measure(refsets['neg'], testsets['neg'])Precision and Recall for Positive and Negative Reviews
I found the results quite interesting:
pos precision: 0.651595744681 pos recall: 0.98 pos F-measure: 0.782747603834 neg precision: 0.959677419355 neg recall: 0.476 neg F-measure: 0.636363636364So what does this mean?
- Nearly every file that is pos is correctly identified as such, with 98% recall. This means very few false negatives in the pos class.
- But, a file given a pos classification is only 65% likely to be correct. Not so good precision leads to 35% false positives for the pos label.
- Any file that is identified as neg is 96% likely to be correct (high precision). This means very few false positives for the neg class.
- But many files that are neg are incorrectly classified. Low recall causes 52% false negatives for the neg label.
- F-measure provides no useful information. There’s no insight to be gained from having it, and we wouldn’t lose any knowledge if it was taken away.
Improving Results with Better Feature Selection
One possible explanation for the above results is that people use normally positives words in negative reviews, but the word is preceded by “not” (or some other negative word), such as “not great”. And since the classifier uses the bag of words model, which assumes every word is independent, it cannot learn that “not great” is a negative. If this is the case, then these metrics should improve if we also train on multiple words, a topic I’ll explore in a future article.
Another possibility is the abundance of naturally neutral words, the kind of words that are devoid of sentiment. But the classifier treats all words the same, and has to assign each word to either pos or neg. So maybe otherwise neutral or meaningless words are being placed in the pos class because the classifier doesn’t know what else to do. If this is the case, then the metrics should improve if we eliminate the neutral or meaningless words from the featuresets, and only classify using sentiment rich words. This is usually done using the concept of information gain, aka mutual information, to improve feature selection, which I’ll also explore in a future article.
If you have your own theories to explain the results, or ideas on how to improve precision and recall, please share in the comments.