Text Classification for Sentiment Analysis – Naive Bayes Classifier

Sentiment analysis is becoming a popular area of research and social media analysis, especially around user reviews and tweets. It is a special case of text mining generally focused on identifying opinion polarity, and while it’s often not very accurate, it can still be useful. For simplicity (and because the training data is easily accessible) I’ll focus on 2 possible sentiment classifications: positive and negative.

NLTK Naive Bayes Classification

NLTK comes with all the pieces you need to get started on sentiment analysis: a movie reviews corpus with reviews categorized into pos and neg categories, and a number of trainable classifiers. We’ll start with a simple NaiveBayesClassifier as a baseline, using boolean word feature extraction.

Bag of Words Feature Extraction

All of the NLTK classifiers work with featstructs, which can be simple dictionaries mapping a feature name to a feature value. For text, we’ll use a simplified bag of words model where every word is feature name with a value of True. Here’s the feature extraction method:

def word_feats(words):
		return dict([(word, True) for word in words])

Training Set vs Test Set and Accuracy

The movie reviews corpus has 1000 positive files and 1000 negative files. We’ll use 3/4 of them as the training set, and the rest as the test set. This gives us 1500 training instances and 500 test instances. The classifier training method expects to be given a list of tokens in the form of [(feats, label)] where feats is a feature dictionary and label is the classification label. In our case, feats will be of the form {word: True} and label will be one of ‘pos’ or ‘neg’. For accuracy evaluation, we can use nltk.classify.util.accuracy with the test set as the gold standard.

Training and Testing the Naive Bayes Classifier

Here’s the complete python code for training and testing a Naive Bayes Classifier on the movie review corpus.

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews

def word_feats(words):
	return dict([(word, True) for word in words])

negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')

negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4

trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats))

classifier = NaiveBayesClassifier.train(trainfeats)
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)

And the output is:

train on 1500 instances, test on 500 instances
accuracy: 0.728
Most Informative Features
         magnificent = True              pos : neg    =     15.0 : 1.0
         outstanding = True              pos : neg    =     13.6 : 1.0
           insulting = True              neg : pos    =     13.0 : 1.0
          vulnerable = True              pos : neg    =     12.3 : 1.0
           ludicrous = True              neg : pos    =     11.8 : 1.0
              avoids = True              pos : neg    =     11.7 : 1.0
         uninvolving = True              neg : pos    =     11.7 : 1.0
          astounding = True              pos : neg    =     10.3 : 1.0
         fascination = True              pos : neg    =     10.3 : 1.0
             idiotic = True              neg : pos    =      9.8 : 1.0

As you can see, the 10 most informative features are, for the most part, highly descriptive adjectives. The only 2 words that seem a bit odd are “vulnerable” and “avoids”. Perhaps these words refer to important plot points or character development that signify a good movie. Whatever the case, with simple assumptions and very little code we’re able to get almost 73% accuracy. This is somewhat near human accuracy, as apparently people agree on sentiment only around 80% of the time. Future articles in this series will cover precision & recall metrics, alternative classifiers, and techniques for improving accuracy.

  • Vandana

    When I am tagging hindi text file then i am getting correct output as below:

    ???? Unk
    ?? Unk
    ???? Unk
    ???? Unk
    ???? Unk
    ?? QFNUM
    ??? Unk
    ?? VAUX

    but when writing this output to the file, it is writing as follows:

    ? ? ? ?Unk
    ? ?Unk
    ? ? ? ?Unk
    ? ? ? ?Unk
    ? ? ? ?Unk
    ? ?QFNUM
    ? ? ?Unk
    ? ?VAUX

    and because it is writing character wise (not word wise), when I am using IndianCorpusReader on the following data:

    ? ? ? ?/PREP ? ?/PREP ? ? ? ?/VFM ? ? ? ?/NN ? ? ? ?/NNP ? ?/QFNUM ? ? ?/PRP ? ?/VAUX

    to find words and tagged sentences by using built in function then it gives no error but wrong output as follows:

    >>> reader.words()[0:10]
    [‘?’,’ ?’,’?’,’ ?/PREP’,’?’,’ ?/PREP’,’?’,’ ?’,’?’,’ ?/VFM’…]

    but we want the output in the following format:


    What is the problem and how to debug it?

  • http://streamhacker.com/ Jacob Perkins

    That exception is an indication that you are not use dictionaries as featuresets, and instead using strings.

  • http://streamhacker.com/ Jacob Perkins

    It looks like the tokenization is incorrect. NLTK has a bunch of different word tokenization options, many of which are demoed at http://text-processing.com/demo/tokenize/. I’d suggest trying your text their to see if any tokenize the words correctly.

  • Manjunath Nadagouda N

    hi Jacob,
    please can you share with me that which tagger and tokenizer you have used in your nltk-demos for hindi language.

    i need this information for my research purpose.
    thanks in advance.

  • http://streamhacker.com/ Jacob Perkins

    The tokenizer is nltk.tokenize.wordpunct_tokenize and the tagger I trained on the hindi.pos file in NLTK’s indian corpus, using train_tagger.py from https://github.com/japerk/nltk-trainer