Text Classification for Sentiment Analysis – Naive Bayes Classifier
Sentiment analysis is becoming a popular area of research and social media analysis, especially around user reviews and tweets. It is a special case of text mining generally focused on identifying opinion polarity, and while it's often not very accurate, it can still be useful. For simplicity (and because the training data is easily accessible) I'll focus on 2 possible sentiment classifications: positive and negative.
NLTK Naive Bayes Classification
NLTK comes with all the pieces you need to get started on sentiment analysis: a movie reviews corpus with reviews categorized into pos and neg categories, and a number of trainable classifiers. We'll start with a simple NaiveBayesClassifier as a baseline, using boolean word feature extraction.
Bag of Words Feature Extraction
All of the NLTK classifiers work with featstructs, which can be simple dictionaries mapping a feature name to a feature value. For text, we'll use a simplified bag of words model where every word is feature name with a value of True. Here's the feature extraction method:
def word_feats(words): return dict([(word, True) for word in words])
Training Set vs Test Set and Accuracy
The movie reviews corpus has 1000 positive files and 1000 negative files. We'll use 3/4 of them as the training set, and the rest as the test set. This gives us 1500 training instances and 500 test instances. The classifier training method expects to be given a list of tokens in the form of [(feats, label)] where feats is a feature dictionary and label is the classification label. In our case, feats will be of the form {word: True} and label will be one of 'pos' or 'neg'. For accuracy evaluation, we can use nltk.classify.util.accuracy with the test set as the gold standard.
Training and Testing the Naive Bayes Classifier
Here's the complete python code for training and testing a Naive Bayes Classifier on the movie review corpus.
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
def word_feats(words):
return dict([(word, True) for word in words])
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')
negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]
negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4
trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats))
classifier = NaiveBayesClassifier.train(trainfeats)
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
classifier.show_most_informative_features()
And the output is:
train on 1500 instances, test on 500 instances
accuracy: 0.728
Most Informative Features
magnificent = True pos : neg = 15.0 : 1.0
outstanding = True pos : neg = 13.6 : 1.0
insulting = True neg : pos = 13.0 : 1.0
vulnerable = True pos : neg = 12.3 : 1.0
ludicrous = True neg : pos = 11.8 : 1.0
avoids = True pos : neg = 11.7 : 1.0
uninvolving = True neg : pos = 11.7 : 1.0
astounding = True pos : neg = 10.3 : 1.0
fascination = True pos : neg = 10.3 : 1.0
idiotic = True neg : pos = 9.8 : 1.0As you can see, the 10 most informative features are, for the most part, highly descriptive adjectives. The only 2 words that seem a bit odd are "vulnerable" and "avoids". Perhaps these words refer to important plot points or character development that signify a good movie. Whatever the case, with simple assumptions and very little code we're able to get almost 73% accuracy. This is somewhat near human accuracy, as apparently people agree on sentiment only around 80% of the time. Future articles in this series will cover precision & recall metrics, alternative classifiers, and techniques for improving accuracy.





June 6th, 2010 - 18:51
Dear Sir,
Is there a way that polarity can be determined on sense level?
Thanking you
June 6th, 2010 - 19:45
Possibly, but it'd probably require a dictionary mapping sense to polarity, and of course you'd need to know the sense of the words first.
June 7th, 2010 - 18:11
Thanks for your reply,
Like you said “senses of the words first”. would you have a classifier that you can post the code for? or second way can be instead of finding the senses SentiWordNet can be used straight away as senses are marked with the polarity already.
June 7th, 2010 - 19:54
I don't have any word sense disambiguation code, and I've never used SentiWordNet, so I'm afraid I can't help you there. But please let me know if you're able to figure something out and get it working.
June 8th, 2010 - 03:49
Sure I am working on that these days
June 8th, 2010 - 10:49
Sure I am working on that these days
July 31st, 2010 - 21:26
very useful serie. i'm wondering.. can the naive bayesian classifier be trained with term weights (ie. tf-idf) or you must use a boolean word model (presence or absence of the term) ?
August 1st, 2010 - 20:59
The NLTK bayes classifier would not work with term weights, or at least it would consider each weight as a separate feature value. However, I believe there's an algorithm for multi-nomial naive bayes that could take into account frequencies or weights, if you want to implement it yourself. But research shows that simple word presence is about as good as a multi-nomial model.