StreamHacker Weotta be Hacking

10May/1061

Text Classification for Sentiment Analysis – Naive Bayes Classifier

Sentiment analysis is becoming a popular area of research and social media analysis, especially around user reviews and tweets. It is a special case of text mining generally focused on identifying opinion polarity, and while it's often not very accurate, it can still be useful. For simplicity (and because the training data is easily accessible) I'll focus on 2 possible sentiment classifications: positive and negative.

NLTK Naive Bayes Classification

NLTK comes with all the pieces you need to get started on sentiment analysis: a movie reviews corpus with reviews categorized into pos and neg categories, and a number of trainable classifiers. We'll start with a simple NaiveBayesClassifier as a baseline, using boolean word feature extraction.

Bag of Words Feature Extraction

All of the NLTK classifiers work with featstructs, which can be simple dictionaries mapping a feature name to a feature value. For text, we'll use a simplified bag of words model where every word is feature name with a value of True. Here's the feature extraction method:

def word_feats(words):
		return dict([(word, True) for word in words])

Training Set vs Test Set and Accuracy

The movie reviews corpus has 1000 positive files and 1000 negative files. We'll use 3/4 of them as the training set, and the rest as the test set. This gives us 1500 training instances and 500 test instances. The classifier training method expects to be given a list of tokens in the form of [(feats, label)] where feats is a feature dictionary and label is the classification label. In our case, feats will be of the form {word: True} and label will be one of 'pos' or 'neg'. For accuracy evaluation, we can use nltk.classify.util.accuracy with the test set as the gold standard.

Training and Testing the Naive Bayes Classifier

Here's the complete python code for training and testing a Naive Bayes Classifier on the movie review corpus.

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews

def word_feats(words):
	return dict([(word, True) for word in words])

negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')

negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4

trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats))

classifier = NaiveBayesClassifier.train(trainfeats)
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
classifier.show_most_informative_features()

And the output is:

train on 1500 instances, test on 500 instances
accuracy: 0.728
Most Informative Features
         magnificent = True              pos : neg    =     15.0 : 1.0
         outstanding = True              pos : neg    =     13.6 : 1.0
           insulting = True              neg : pos    =     13.0 : 1.0
          vulnerable = True              pos : neg    =     12.3 : 1.0
           ludicrous = True              neg : pos    =     11.8 : 1.0
              avoids = True              pos : neg    =     11.7 : 1.0
         uninvolving = True              neg : pos    =     11.7 : 1.0
          astounding = True              pos : neg    =     10.3 : 1.0
         fascination = True              pos : neg    =     10.3 : 1.0
             idiotic = True              neg : pos    =      9.8 : 1.0

As you can see, the 10 most informative features are, for the most part, highly descriptive adjectives. The only 2 words that seem a bit odd are "vulnerable" and "avoids". Perhaps these words refer to important plot points or character development that signify a good movie. Whatever the case, with simple assumptions and very little code we're able to get almost 73% accuracy. This is somewhat near human accuracy, as apparently people agree on sentiment only around 80% of the time. Future articles in this series will cover precision & recall metrics, alternative classifiers, and techniques for improving accuracy.

  • anonymous

    can u tell me how do we write a cosine similarity for these reviews when we r creating a dictionary of these features

  • http://streamhacker.com/ Jacob Perkins

    It’s not exactly cosine similarity, but I wrote about using information to eliminate low information features at http://streamhacker.com/2010/06/16/text-classification-sentiment-analysis-eliminate-low-information-features/.

  • http://www.facebook.com/sonia.gupta.1004837 Sonia Gupta

    consider i have a sentence that contain multiple sense word and using sentiwordnet i am getting multiple score of respective word then how can i calcualte positive or negative .can you elobrate ??

  • http://www.facebook.com/sonia.gupta.1004837 Sonia Gupta

    consider i have a sentence that contain multiple sense word and using sentiwordnet i am getting multiple score of respective word then how can i calcualte positive or negative .can you elobrate ?? if there is any way to do plz. i am not able to use sentiwordnet due to this reason???plz

  • http://streamhacker.com/ Jacob Perkins

    You need to look into Word Sense Disambiguation: https://en.wikipedia.org/wiki/Word_sense_disambiguation

  • http://www.facebook.com/ashanghavi Amar Shanghavi

    Dear Jacob, thank you for such a great intro to NLTK. I am reading your book closely too to get a better understanding of text analysis. I would like to know if there is already a pre existing corpus for news (tv transcript or print) which has been classified by positive and negative. I would like to do some sentiment analysis of tv news transcripts and wanted to start from an existing database before I create my own classifications (as a first pass).

  • http://streamhacker.com/ Jacob Perkins

    Hi Amar,

    I don’t know of any news sentiment corpus, but you might want to look into “corpus bootstrapping”, which is a way to create your own custom corpus based on existing corpora and/or models. Here’s a presentation I gave on the topic: http://www.slideshare.net/japerk/corpus-bootstrapping-with-nltk

  • http://www.facebook.com/ashanghavi Amar Shanghavi

    Dear Jacob,

    I have tried working with the code you wrote in your book but get stuck on one point which I am not sure why will not execute. When I try to run the negation replacer, I get the following message:

    ‘AntonymReplacer’ object has no attribute ‘replace_negations’

    I am sure I have copied everything exactly as your code.

    Thanks

  • http://streamhacker.com/ Jacob Perkins

    On Page 42, the AntonymReplacer class is defined with 2 methods: replace & replace_negations. Based on the error message, you either did not define the replace_negations method, or defined it incorrectly.

  • chaoprokia

    is it possible for me to use to train 3 classes?

    negids = movie_reviews.fileids(‘neg’)

    posids = movie_reviews.fileids(‘pos’)

    neuIds = movie_review.fileids(‘neu’0

  • http://streamhacker.com/ Jacob Perkins

    Sure, NLTK classifiers work with any number of classes, but most classifiers tend to get less accurate as you go beyond 2 classes.

  • Praveen Gr

    Can anyone help me with Sentiment Analysis code or link which gives considerably good result ?

  • tarik setia

    How can i use nltk to calculate a priori probabilities and probability of each word in the feature?

  • http://streamhacker.com/ Jacob Perkins

    The probability module has many useful functions & classes for calculating probabilities: http://nltk.org/api/nltk.html#module-nltk.probability