streamhacker.com Weotta be Hacking

10May/108

Text Classification for Sentiment Analysis – Naive Bayes Classifier

Sentiment analysis is becoming a popular area of research and social media analysis, especially around user reviews and tweets. It is a special case of text mining generally focused on identifying opinion polarity, and while it's often not very accurate, it can still be useful. For simplicity (and because the training data is easily accessible) I'll focus on 2 possible sentiment classifications: positive and negative.

NLTK Naive Bayes Classification

NLTK comes with all the pieces you need to get started on sentiment analysis: a movie reviews corpus with reviews categorized into pos and neg categories, and a number of trainable classifiers. We'll start with a simple NaiveBayesClassifier as a baseline, using boolean word feature extraction.

Bag of Words Feature Extraction

All of the NLTK classifiers work with featstructs, which can be simple dictionaries mapping a feature name to a feature value. For text, we'll use a simplified bag of words model where every word is feature name with a value of True. Here's the feature extraction method:

def word_feats(words):
		return dict([(word, True) for word in words])

Training Set vs Test Set and Accuracy

The movie reviews corpus has 1000 positive files and 1000 negative files. We'll use 3/4 of them as the training set, and the rest as the test set. This gives us 1500 training instances and 500 test instances. The classifier training method expects to be given a list of tokens in the form of [(feats, label)] where feats is a feature dictionary and label is the classification label. In our case, feats will be of the form {word: True} and label will be one of 'pos' or 'neg'. For accuracy evaluation, we can use nltk.classify.util.accuracy with the test set as the gold standard.

Training and Testing the Naive Bayes Classifier

Here's the complete python code for training and testing a Naive Bayes Classifier on the movie review corpus.

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews

def word_feats(words):
	return dict([(word, True) for word in words])

negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')

negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4

trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats))

classifier = NaiveBayesClassifier.train(trainfeats)
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
classifier.show_most_informative_features()

And the output is:

train on 1500 instances, test on 500 instances
accuracy: 0.728
Most Informative Features
         magnificent = True              pos : neg    =     15.0 : 1.0
         outstanding = True              pos : neg    =     13.6 : 1.0
           insulting = True              neg : pos    =     13.0 : 1.0
          vulnerable = True              pos : neg    =     12.3 : 1.0
           ludicrous = True              neg : pos    =     11.8 : 1.0
              avoids = True              pos : neg    =     11.7 : 1.0
         uninvolving = True              neg : pos    =     11.7 : 1.0
          astounding = True              pos : neg    =     10.3 : 1.0
         fascination = True              pos : neg    =     10.3 : 1.0
             idiotic = True              neg : pos    =      9.8 : 1.0

As you can see, the 10 most informative features are, for the most part, highly descriptive adjectives. The only 2 words that seem a bit odd are "vulnerable" and "avoids". Perhaps these words refer to important plot points or character development that signify a good movie. Whatever the case, with simple assumptions and very little code we're able to get almost 73% accuracy. This is somewhat near human accuracy, as apparently people agree on sentiment only around 80% of the time. Future articles in this series will cover precision & recall metrics, alternative classifiers, and techniques for improving accuracy.

  • Delicious
  • StumbleUpon
  • Reddit
  • Digg
  • Twitter
  • FriendFeed
  • Facebook
  • Share/Bookmark

Related posts

Comments (8) Trackbacks (0)
  1. Dear Sir,

    Is there a way that polarity can be determined on sense level?
    Thanking you

  2. Possibly, but it'd probably require a dictionary mapping sense to polarity, and of course you'd need to know the sense of the words first.

  3. Thanks for your reply,
    Like you said “senses of the words first”. would you have a classifier that you can post the code for? or second way can be instead of finding the senses SentiWordNet can be used straight away as senses are marked with the polarity already.

  4. I don't have any word sense disambiguation code, and I've never used SentiWordNet, so I'm afraid I can't help you there. But please let me know if you're able to figure something out and get it working.

  5. Sure I am working on that these days

  6. Sure I am working on that these days

  7. very useful serie. i'm wondering.. can the naive bayesian classifier be trained with term weights (ie. tf-idf) or you must use a boolean word model (presence or absence of the term) ?

  8. The NLTK bayes classifier would not work with term weights, or at least it would consider each weight as a separate feature value. However, I believe there's an algorithm for multi-nomial naive bayes that could take into account frequencies or weights, if you want to implement it yourself. But research shows that simple word presence is about as good as a multi-nomial model.


Leave a comment


No trackbacks yet.

blog comments powered by Disqus