Text Classification for Sentiment Analysis – Precision and Recall

Accuracy is not the only metric for evaluating the effectiveness of a classifier. Two other useful metrics are precision and recall. These two metrics can provide much greater insight into the performance characteristics of a binary classifier.

Classifier Precision

Precision measures the exactness of a classifier. A higher precision means less false positives, while a lower precision means more false positives. This is often at odds with recall, as an easy way to improve precision is to decrease recall.

Classifier Recall

Recall measures the completeness, or sensitivity, of a classifier. Higher recall means less false negatives, while lower recall means more false negatives. Improving recall can often decrease precision because it gets increasingly harder to be precise as the sample space increases.

F-measure Metric

Precision and recall can be combined to produce a single metric known as F-measure, which is the weighted harmonic mean of precision and recall. I find F-measure to be about as useful as accuracy. Or in other words, compared to precision & recall, F-measure is mostly useless, as you’ll see below.

Measuring Precision and Recall of a Naive Bayes Classifier

The NLTK metrics module provides functions for calculating all three metrics mentioned above. But to do so, you need to build 2 sets for each classification label: a reference set of correct values, and a test set of observed values. Below is a modified version of the code from the previous article, where we trained a Naive Bayes Classifier. This time, instead of measuring accuracy, we’ll collect reference values and observed values for each label (pos or neg), then use those sets to calculate the precision, recall, and F-measure of the naive bayes classifier. The actual values collected are simply the index of each featureset using enumerate.

import collections
import nltk.metrics
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews

def word_feats(words):
	return dict([(word, True) for word in words])

negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')

negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4

trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats))

classifier = NaiveBayesClassifier.train(trainfeats)
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)

for i, (feats, label) in enumerate(testfeats):
	observed = classifier.classify(feats)

print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])
print 'pos F-measure:', nltk.metrics.f_measure(refsets['pos'], testsets['pos'])
print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg'])
print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg'])
print 'neg F-measure:', nltk.metrics.f_measure(refsets['neg'], testsets['neg'])

Precision and Recall for Positive and Negative Reviews

I found the results quite interesting:

pos precision: 0.651595744681
pos recall: 0.98
pos F-measure: 0.782747603834
neg precision: 0.959677419355
neg recall: 0.476
neg F-measure: 0.636363636364

So what does this mean?

  1. Nearly every file that is pos is correctly identified as such, with 98% recall. This means very few false negatives in the pos class.
  2. But, a file given a pos classification is only 65% likely to be correct. Not so good precision leads to 35% false positives for the pos label.
  3. Any file that is identified as neg is 96% likely to be correct (high precision). This means very few false positives for the neg class.
  4. But many files that are neg are incorrectly classified. Low recall causes 52% false negatives for the neg label.
  5. F-measure provides no useful information. There’s no insight to be gained from having it, and we wouldn’t lose any knowledge if it was taken away.

Improving Results with Better Feature Selection

One possible explanation for the above results is that people use normally positives words in negative reviews, but the word is preceded by “not” (or some other negative word), such as “not great”. And since the classifier uses the bag of words model, which assumes every word is independent, it cannot learn that “not great” is a negative. If this is the case, then these metrics should improve if we also train on multiple words, a topic I’ll explore in a future article.

Another possibility is the abundance of naturally neutral words, the kind of words that are devoid of sentiment. But the classifier treats all words the same, and has to assign each word to either pos or neg. So maybe otherwise neutral or meaningless words are being placed in the pos class because the classifier doesn’t know what else to do. If this is the case, then the metrics should improve if we eliminate the neutral or meaningless words from the featuresets, and only classify using sentiment rich words. This is usually done using the concept of information gain, aka mutual information, to improve feature selection, which I’ll also explore in a future article.

If you have your own theories to explain the results, or ideas on how to improve precision and recall, please share in the comments.

  • Nothing much to add, Jacob, just wanted to let you know that I really appreciate these posts on natural language processing, good stuff. It's an exciting but underappreciated field of study. Cheers!

  • Thanks Stijn. I'm hoping my posts help increase NLP appreciation, or at least awareness of how to do it effectively. Looks like you're trying to do the same for information design, and now I have a new blog to explore ūüôā

  • Pingback: Tweets that mention Text Classification for Sentiment Analysis ‚Äď Precision and Recall ¬ęstreamhacker.com -- Topsy.com()

  • Marcos

    Thank you Jacob! Your posts are wonderful.

  • Your welcome Marcos, glad you like the posts.

  • Your welcome Marcos, glad you like the posts.

  • Adam P Leary

    Thanks Jacob, good stuff here. I'm really wanting to do this on some of my own data but am scratching my head as to how my sentiment data needs to be formatted to read in the sentiment id. Any idea how this works?

  • The movie reviews corpus is in 2 directories, one for “pos” and another for “neg”. Then it uses the CategorizedCorpusReader to specify the categories based on which directory each file is in. So I'd read up on the CategorizedCorpusReader at http://nltk.googlecode.com/svn/trunk/doc/api/nl… and look at some of the other categorized corpora for examples (brown, reuters) to figure out what would work best for organizing your own data.

  • Adam P Leary

    Thanks Jacob. After I posted this, I found that info as well. I am using the CategorizedPlainTextCorpusReader. It looks like I use the constructor to create my own reader. I am including another category neutral to the mix.

  • guest

    Thanks for the information. But I like to know how can we categorize a text ie. how can we find a text(given by the user) is pos/neg?

  • Once you have a trained classifier, then for every piece of text you want classify, get the bag of words and pass that into the classifier, like classifier.classify(word_feats(text)). This will return one of the known labels, such as pos or neg.

  • Wow … this blog is a gem. Looking forward to getting much more of your perspective through the links.

  • How do you decide on the weights for harmonic mean?

  • Thanks, glad you like it.

  • The default weight is 0.5 (see http://nltk.googlecode.com/svn/trunk/doc/api/nltk.metrics-module.html#f_measure for more details). But as I said above, I find the harmonic mean pretty useless compared to accuracy, precision, and recall, so I’ve never had reason to question or modify the default.

  • DJ

    Hi, I like to know whether we can classify the text as pos,neutral,neg instead of pos, neg?

  • Sure, you just need a corpus of neutral text. My sentiment demo at http://text-processing.com/demo/sentiment/ uses movie descriptions from the subjectivity dataset at http://www.cs.cornell.edu/people/pabo/movie-review-data/ to determine if text is subjective/polar or objective/neutral.

  • DJ

    Thank you, but sorry for troubling you again. Since I am new to this, I just want to make sure what I am doing is correct. I downloaded the Subjectivity datasets into a new directory called rotten_imdb under nltk_data/corpora/ and changed the file names to neutral.txt and polar.txt. Then I created a reader and categorized the new corpora as – polar_neutral_review = CategorizedPlaintextCorpusReader(root, ‘.*txt’, cat_pattern='(w+).txt’). Then I used the same code given for text classification using bigram algorithm except I changed the the following line – trainfeats = negfeats + posfeats, since there is only one file for each category. Upto this point it doesnt give any error, but after training when I categorize the user given text, it always returns ‘polar’. I have no idea where I went wrong.Need Help

  • Have you measured the precision & recall of the trainfeats? I’m pretty sure it’ll be skewed one way, and you’ll have to follow the instructions in the next article of this series: use only high information words. Or, since you have the files named appropriately, you should also be able to use train_classifier.py in github.com/japerk/nltk-trainer with options like –sents –min_score 3 to train a classifier.

  • Suresh

    Nice blog, would you like to share positive/negative libraries?

  • Thanks. What libraries are you thinking of? I just use NLTK to train objects using my tools at http://github.com/japerk/nltk-trainer.

  • Sam

    Hi Jacob! Thank you for your brilliant tutorials, they are really helpful =)
    I have couple of questions.

    1) When training the classifier with smaller corpus sometimes the most informative features function shows words with the value ‘None’. According to NLTK’s documentation: “The feature value ‘None’ is reserved for unseen feature values” but using your example I get None values for words which are in both training and test sets. Why? I don’t get it.

    2) To implement neutrality you just used another classifier with the subjectivity dataset, right?

    Thanks in advance!!

  • Hi Sam,

    1) I haven’t seen None before, but it doesn’t sound it should be there. Could you post the output, and maybe the code you use to train?

    2) Exactly, and I outlined the method in http://streamhacker.com/2011/01/05/hierarchical-classification/

  • Sam

    Wow, what a quick response, Jacob! Thanks!

    1) Yes, it’s pretty simple. Just reduce a lot your training and test sets. In the classifier shown here (http://streamhacker.com/2010/05/10/text-classification-sentiment-analysis-naive-bayes-classifier/) just add these below line #9.

    negids = negids[:10]
    posids = posids[:10]

    We’re reducing our sets to 10 files for each label. Obviously this is not what we want in a real environment and this doesn’t happen by training a relatively large corpus, but it just caught my attention to have ‘None’ values for words that appear both in the training and test sets. Shouldn’t all be True?

    2) Great! I have to make a similar implementation using the Spanish language so it’s time for me to compile a neutral corpus ūüôā

  • So what I think is going on is that words with None have been seen in one of the categories, but not the other, and so not seeing them becomes an indicator of the category to choose. For example, “really = None pos : neg = 3.7 : 1.0” would mean that if “really” is not in a featureset, it’s more likely to be positive (since the classifier only saw it in negative training examples).

  • Schillermika


    I’m classifying text as either vulgar or clean. I ran the metrics on the data several times and got slightly different results each time. For example, notice that every score has changed slightly from the first to the second evaluation even though it’s the exact same data. Any idea why this might be?¬†

    Accuracy: 0.875968992248
    dirty precision: 0.965517241379
    dirty recall: 0.8
    clean precision: 0.802816901408
    clean recall: 0.966101694915
    Most Informative Features
                           ! = True            Dirty : Clean  =     19.8 : 1.0
                           . = None            Dirty : Clean  =      9.2 : 1.0
                         ass = True            Dirty : Clean  =      6.8 : 1.0
                        even = True            Clean : Dirty  =      6.1 : 1.0
                        feel = True            Clean : Dirty  =      5.7 : 1.0
                   something = True            Clean : Dirty  =      5.3 : 1.0
                        your = True            Dirty : Clean  =      5.2 : 1.0
                         had = True            Dirty : Clean  =      5.2 : 1.0
                       would = True            Clean : Dirty  =      4.7 : 1.0
                       being = True            Clean : Dirty  =      4.5 : 1.0

    Accuracy: 0.744186046512dirty precision: 1.0dirty recall: 0.521739130435clean precision: 0.645161290323clean recall: 1.0Most Informative Features                       ! = True            Dirty : Clean  =     19.5 : 1.0                      be = True            Clean : Dirty  =      8.7 : 1.0                     don = True            Clean : Dirty  =      7.9 : 1.0                     get = True            Clean : Dirty  =      6.2 : 1.0                     not = True            Clean : Dirty  =      6.2 : 1.0               something = True            Clean : Dirty  =      6.2 : 1.0                    this = True            Clean : Dirty  =      6.2 : 1.0                      by = True            Clean : Dirty  =      6.2 : 1.0                   would = True            Clean : Dirty  =      6.2 : 1.0                   never = True            Clean : Dirty  =      6.2 : 1.0                     off = True            Dirty : Clean  =      6.1 : 1.0                   being = True            Clean : Dirty  =      5.4 : 1.0                     did = True            Clean : Dirty  =      5.4 : 1.0                    feel = True            Clean : Dirty  =      5.2 : 1.0                       . = None            Dirty : Clean  =      5.2 : 1.0

  • The only reason this would be happening is if the training set is changing. Maybe you changed the fraction? Or there’s some randomness being introduced somewhere?

  • Schillermika

    What do you mean by change the fraction? …don’t really see any bugs that might be causing this either. Here’s the code I’m running. Sorry if it gets all jumbled looking.¬†

    ¬†import nltkimport collectionsimport randomimport nltk.metricsfrom nltk.corpus.reader import PlaintextCorpusReaderfrom nltk.corpus.util import LazyCorpusLoaderdirty_teen_corpus = LazyCorpusLoader(‘cookbook’, PlaintextCorpusReader, [‘dirty_teen.txt’])clean_teen_corpus = LazyCorpusLoader(‘cookbook’, PlaintextCorpusReader, [‘clean_teen.txt’])raw_dataset = ([(sentence, “Dirty”) for sentence in dirty_teen_corpus.sents()] +¬† ¬† ¬† ¬† ¬† ¬†[(sentence, “Clean”) for sentence in clean_teen_corpus.sents()])def sexy_text(sentence):¬† ¬† return dict([(word.lower(), True) for word in sentence])featuresets = [(sexy_text(sentence), label) for (sentence, label) in raw_dataset]random.shuffle(featuresets)train_feats, test_feats = featuresets[:270], featuresets[270:]classifier = nltk.NaiveBayesClassifier.train(train_feats)refsets = collections.defaultdict(set)testsets = collections.defaultdict(set)for i, (feats, label) in enumerate(test_feats): refsets[label].add(i) observed = classifier.classify(feats) testsets[observed].add(i)print ‘Accuracy:’, nltk.classify.accuracy(classifier, test_feats)print “dirty precision:”, nltk.metrics.precision(refsets[“Dirty”], testsets[“Dirty”])print “dirty recall:”, nltk.metrics.recall(refsets[“Dirty”], testsets[“Dirty”])print “clean precision:”, nltk.metrics.precision(refsets[“Clean”], testsets[“Clean”])print “clean recall:”, nltk.metrics.recall(refsets[“Clean”], testsets[“Clean”])print classifier.show_most_informative_features(15)

  • Here’s your problem:¬†random.shuffle(featuresets)
    If you do this before splitting the train_feats & test_feats, then of course you’ll be getting different results, because the training features are different each time.

  • Schillermika

    Why didn’t I see that?…thnx!¬†Really good¬†blog, btw. I’ve learned more practical stuff¬† on here and¬†through your book than anywhere else.

  • Fahd

    Thanks Jac,
    I found this article very helpful.

    I wonder if there is a way to calculate the overall precision, recall and F-measure for all classes.

    I believe this would be very helpful instead of calculating the average of these measures.


  • You can do this with binary classifiers, if you assume one class is the positive class, and the other is the negative class (this isn’t referring to sentiment, but positive as in true, and negative as in false). Then you count the number of true positives, false positives, and false negatives, and calculate the precision and recall as defined at¬†https://en.wikipedia.org/wiki/Precision_and_recall#Definition_.28classification_context.29

  • Pingback: Thinknook | Testing & Diagnosing a Text Classification Algorithm()

  • virendhar

    In improving results with better feature selection, you have mentioned that the sentiments change by negative words like “Not Great” will be on multiple words/Ngrams. Have you got chance to do that? Let me know where I can refer to your work. thanks

  • I covered this a bit in the post on collocations: http://streamhacker.com/2010/05/24/text-classification-sentiment-analysis-stopwords-collocations/
    But you can also just use nltk.util.bigrams(words) to generate bigrams for a list of words, then use those bigrams as features.

  • ClGuy

    One question. This returns the precision and recall of the classifier on each single label (pos & neg). What if I want to measure the overall precision and recall of the system?

  • You can average the precision and recall for each label.

  • DottorEuler

    Hi Jacob,

    outstanding tutorial. You make NLTK easy for “human beginners” .
    I’m trying to generate a ROC Curve after the analysis, however so far for NLTK the most approachable library is PyROC and, still, it is hard to use because of the neverending incompatibility between lists/strings/dics.

    What do you suggest me in order to generate a ROC curve?

  • Take a look at scikit-learn: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
    It doesn’t generate any charts, but maybe you can produce a CSV for excel or some other charting program.

  • Pingback: Python/nltk: Naive vs Naive Bayes vs Decision Tree at Mark Needham()

  • K-IFY

    Hi Jacob,
    Thanks so much for the support you are rendering via your blog.

    I have a couple of questions to ask you?

    1. We are working on a similar text classification for sentiment analysis problem – Pull Request Comments, with positive, negative, and neutral labels. Using NTLK Libraries and Bigram Techniques as explained here, can we split our dataset into three files : positive_f, negative_f, and neutral_f, each file containing only the corpus comments for the purpose of text mining?

    2. If we want to plot the distribution of the said labels across the detaset or aggregating the data by any attribute in the dataset, how do we accomplish these using your guidelines?
    Thanks in advance.

  • Hi,

    1. Yes, and if you separate each comment into blanklines, then you it should be easy to treat each comment as a separate instance, maybe by using one of NLTK’s corpus readers and the paragraphs() method.

    2. I’m not sure about plotting, lots of people do that differently, but matplotlib is fairly popular.

  • K-IFY

    Thanks so much and sorry for late response.

    Regarding question 1, I understand from your code that I could use one instance of dataset to account for the three classes instead of dataset as follows:

    from nltk.corpus import dataset_name

    negids = dataset_name.fileids(‘neg’)
    posids = dataset_name.fileids(‘pos’)

    neuids = dataset_name.fileids(‘neu’)…
    Thus, we could plot the distribution of the said labels across the detaset or aggregating the data by any attribute in the dataset using the Matplotlib Library you suggested.

    I’m a greenhorn in Machine Learning and Python language. I will appreciate it if you could give me more guidance on how to go about it.

    Again, thanks in advance.

  • I don’t know matplotlib, so I tend to stick with simple counting, such as “how many pos fileids?”

  • Bhushan Tembhurne

    The code is not working!
    The previous code to compute sentiment analysis using naive bayes classifier worked fine for me but when I tried to run this code it does not worked. It says error “name nltk is not defined”.
    When I add “import nltk ” to code it returns another error AttributeError: ‘module’ object has no attribute ‘precision’.
    Please resolve the problem ASAP!

  • Maybe your code is wrong. I don’t see any code above that is just “import nltk”.

  • Nikos Spatiotis

    i had the same problem….you have to insert this code at the beginning: from nltk.metrics import precision, recall, f_measure and then change print and write this print ‘pos precision:’, precision(refsets[‘pos’], testsets[‘pos’])

    print ‘pos recall:’, recall(refsets[‘pos’], testsets[‘pos’])

    print ‘pos F-measure:’, f_measure(refsets[‘pos’], testsets[‘pos’])

    print ‘neg precision:’, precision(refsets[‘neg’], testsets[‘neg’])

    print ‘neg recall:’, recall(refsets[‘neg’], testsets[‘neg’])

    print ‘neg F-measure:’, f_measure(refsets[‘neg’], testsets[‘neg’])