Text Classification for Sentiment Analysis – Naive Bayes Classifier

Sentiment analysis is becoming a popular area of research and social media analysis, especially around user reviews and tweets. It is a special case of text mining generally focused on identifying opinion polarity, and while it’s often not very accurate, it can still be useful. For simplicity (and because the training data is easily accessible) I’ll focus on 2 possible sentiment classifications: positive and negative.

NLTK Naive Bayes Classification

NLTK comes with all the pieces you need to get started on sentiment analysis: a movie reviews corpus with reviews categorized into pos and neg categories, and a number of trainable classifiers. We’ll start with a simple NaiveBayesClassifier as a baseline, using boolean word feature extraction.

Bag of Words Feature Extraction

All of the NLTK classifiers work with featstructs, which can be simple dictionaries mapping a feature name to a feature value. For text, we’ll use a simplified bag of words model where every word is feature name with a value of True. Here’s the feature extraction method:

def word_feats(words):
		return dict([(word, True) for word in words])

Training Set vs Test Set and Accuracy

The movie reviews corpus has 1000 positive files and 1000 negative files. We’ll use 3/4 of them as the training set, and the rest as the test set. This gives us 1500 training instances and 500 test instances. The classifier training method expects to be given a list of tokens in the form of [(feats, label)] where feats is a feature dictionary and label is the classification label. In our case, feats will be of the form {word: True} and label will be one of ‘pos’ or ‘neg’. For accuracy evaluation, we can use nltk.classify.util.accuracy with the test set as the gold standard.

Training and Testing the Naive Bayes Classifier

Here’s the complete python code for training and testing a Naive Bayes Classifier on the movie review corpus.

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews

def word_feats(words):
	return dict([(word, True) for word in words])

negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')

negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4

trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats))

classifier = NaiveBayesClassifier.train(trainfeats)
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
classifier.show_most_informative_features()

And the output is:

train on 1500 instances, test on 500 instances
accuracy: 0.728
Most Informative Features
         magnificent = True              pos : neg    =     15.0 : 1.0
         outstanding = True              pos : neg    =     13.6 : 1.0
           insulting = True              neg : pos    =     13.0 : 1.0
          vulnerable = True              pos : neg    =     12.3 : 1.0
           ludicrous = True              neg : pos    =     11.8 : 1.0
              avoids = True              pos : neg    =     11.7 : 1.0
         uninvolving = True              neg : pos    =     11.7 : 1.0
          astounding = True              pos : neg    =     10.3 : 1.0
         fascination = True              pos : neg    =     10.3 : 1.0
             idiotic = True              neg : pos    =      9.8 : 1.0

As you can see, the 10 most informative features are, for the most part, highly descriptive adjectives. The only 2 words that seem a bit odd are “vulnerable” and “avoids”. Perhaps these words refer to important plot points or character development that signify a good movie. Whatever the case, with simple assumptions and very little code we’re able to get almost 73% accuracy. This is somewhat near human accuracy, as apparently people agree on sentiment only around 80% of the time. Future articles in this series will cover precision & recall metrics, alternative classifiers, and techniques for improving accuracy.

  • Jerome Kuebler

    Thanks for the tutorial! I went through and troubleshot it for python 3.6.1. The final, working code is at https://github.com/jerkuebler/Sentiment

  • Cool, thanks for posting the code

  • Pingback: Exploring the Association of Movie Trailer Performance on YouTube and Box Office Success using Neural Net, Python, and R | Data On The Rocks()

  • Pingback: Sentiment Analysis Series 1 | Xikai's Blog()

  • Adarsh Kumar

    What if I want to analyse my own sentence for this polarity check?
    I’ve tokenized the input and passed the list to word_feats. The thing is, I’m getting a lot of ambiguous results such as: ‘like’ is negative, ‘dislike’ is positive.
    I might say some of the sentences give good results.
    But although the accuracy is upto 81% after bigram collocations, what is the reason behind getting so many unexpected outputs?

  • Adarsh Kumar

    #Run once, to store the trained classifier
    save_clf = open(“Sentiment.pkl”,”wb”)
    pickle.dump(classifier, save_clf)
    save_clf.close()

    But, After we pickle the classifier, it takes more time to load than even training it normally. What is the reason?

  • If your pickle file is very large, then loading from disk can take a while, depending on your disk speed. You’re measuring CPU speed vs disk speed in this comparison.

  • This really depends on the training data. Try using the prob_classify() method of the classifier to see the strength of each classification label.

  • Adarsh Kumar

    Where can I get a better sentiment data where I get a significant amount of correct outputs? Please suggest.
    I’ve been looking this up recently but found only one file consisting of negative words and one file containing positive words here: https://github.com/jeffreybreen/twitter-sentiment-analysis-tutorial-201107/tree/master/data/opinion-lexicon-English

  • Pingback: R ? Python ???????8 ?????? - ?????()

  • rando

    Great tutorial i actually learned a lot from this.Can a get a prediction on the polarity of a text if i input a string as,

    “The beer was amazing. But the hangover was horrible. My boss was not happy.” how can i get a polarity for this text?

  • rando

    @disqus_lQatfZJ04S:disqus i am stuck with a issue where i cannot categorize a text in to polarities. I have trained a Naive Bayes but i have no idea how to categorize the input text. Any idea on how that can be done?

  • Adarsh Kumar

    #Store the trained classifier in a variable.
    context = classifier.classify(word_feats(p)) #p is the list of tokenized words
    if(str(context)==’pos’):
    print(“Positive statement”);
    elif(str(context)==’neg’)
    print(“Negative statement”);

  • rando

    @disqus_lQatfZJ04S:disqus Thank you for replying. It categorizes all words as negative it is going to the elif statement. 🙁

  • Adarsh Kumar

    Which dataset did you use for the classification?

  • rando

    I am using a dataset which is similar to IMDB movie review which has review in text files and categorized into negative and positive folders

  • rando

    @disqus_lQatfZJ04S:disqus Can i know that if your program which analyzes sentences for the polarity check implement the same logic. Where it analyzes text which is added externally for the categorization?

  • Adarsh Kumar

    Send your code on iamadarshkumar@gmail.com. I’ll see if I can debug that.

  • You’d have to split the text into sentences, then analyze each sentence separately

  • rando

    Thank you for your response. When i try to classify a sentence it always predicts the sentence as negative even if the sentence is clearly positive. Any idea why this happens?
    P.S i am using a dataset which has two main categories with many text files in them basically it is like the movie review datset found in NLTK corpus but added my own text files.

  • Classification really depends on the training data and model(s) used. Usually bigram features work better. I’d suggest using the most_informative_features method to see what features are strongest one way or the other.

  • rando

    This is what i am getting for most_informative_features method. Non from the negative folder is higher here all features are from positive. But all predictions by classify() are negative.

  • This sounds like a data/feature issue that only you can debug, but here’s some questions to think about: Is your input text to classify similar to the training text? What was the model’s accuracy during training? Do you have a lot more negative examples in the training data than positive?

  • rando

    The model was reaching upper 80s. That’s what made me think why the false predictions. Yes, I think that may be the reason because i created the negative dataset and it has higher number of sentences than the positive dataset. Thanks again i will look in to it.

  • rando

    Hi Jacob, thanks to you, by enhancing your code to my problem i am getting really good predictions for files which are either positive or negative.
    But the problem i have now is, when i add a file with a mix of positive and negative sentiments is there a way where i can show that this file had positive sentiment of this percentage and negative sentiment of this percentage?In addition to predicting the overall sentiment.

  • Glad you’re getting good results now. For a new file, you can do some simple counting in python: total count of items, count of positive, count of negative. Then you can calculate percentage of each.