Text Classification for Sentiment Analysis – Naive Bayes Classifier

Sentiment analysis is becoming a popular area of research and social media analysis, especially around user reviews and tweets. It is a special case of text mining generally focused on identifying opinion polarity, and while it’s often not very accurate, it can still be useful. For simplicity (and because the training data is easily accessible) I’ll focus on 2 possible sentiment classifications: positive and negative.

NLTK Naive Bayes Classification

NLTK comes with all the pieces you need to get started on sentiment analysis: a movie reviews corpus with reviews categorized into pos and neg categories, and a number of trainable classifiers. We’ll start with a simple NaiveBayesClassifier as a baseline, using boolean word feature extraction.

Bag of Words Feature Extraction

All of the NLTK classifiers work with featstructs, which can be simple dictionaries mapping a feature name to a feature value. For text, we’ll use a simplified bag of words model where every word is feature name with a value of True. Here’s the feature extraction method:

def word_feats(words):
		return dict([(word, True) for word in words])

Training Set vs Test Set and Accuracy

The movie reviews corpus has 1000 positive files and 1000 negative files. We’ll use 3/4 of them as the training set, and the rest as the test set. This gives us 1500 training instances and 500 test instances. The classifier training method expects to be given a list of tokens in the form of [(feats, label)] where feats is a feature dictionary and label is the classification label. In our case, feats will be of the form {word: True} and label will be one of ‘pos’ or ‘neg’. For accuracy evaluation, we can use nltk.classify.util.accuracy with the test set as the gold standard.

Training and Testing the Naive Bayes Classifier

Here’s the complete python code for training and testing a Naive Bayes Classifier on the movie review corpus.

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews

def word_feats(words):
	return dict([(word, True) for word in words])

negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')

negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4

trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats))

classifier = NaiveBayesClassifier.train(trainfeats)
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
classifier.show_most_informative_features()

And the output is:

train on 1500 instances, test on 500 instances
accuracy: 0.728
Most Informative Features
         magnificent = True              pos : neg    =     15.0 : 1.0
         outstanding = True              pos : neg    =     13.6 : 1.0
           insulting = True              neg : pos    =     13.0 : 1.0
          vulnerable = True              pos : neg    =     12.3 : 1.0
           ludicrous = True              neg : pos    =     11.8 : 1.0
              avoids = True              pos : neg    =     11.7 : 1.0
         uninvolving = True              neg : pos    =     11.7 : 1.0
          astounding = True              pos : neg    =     10.3 : 1.0
         fascination = True              pos : neg    =     10.3 : 1.0
             idiotic = True              neg : pos    =      9.8 : 1.0

As you can see, the 10 most informative features are, for the most part, highly descriptive adjectives. The only 2 words that seem a bit odd are “vulnerable” and “avoids”. Perhaps these words refer to important plot points or character development that signify a good movie. Whatever the case, with simple assumptions and very little code we’re able to get almost 73% accuracy. This is somewhat near human accuracy, as apparently people agree on sentiment only around 80% of the time. Future articles in this series will cover precision & recall metrics, alternative classifiers, and techniques for improving accuracy.

  • Vandana

    When I am tagging hindi text file then i am getting correct output as below:

    ???? Unk
    ?? Unk
    ???? Unk
    ???? Unk
    ???? Unk
    ?? QFNUM
    ??? Unk
    ?? VAUX

    but when writing this output to the file, it is writing as follows:

    ? ? ? ?Unk
    ? ?Unk
    ? ? ? ?Unk
    ? ? ? ?Unk
    ? ? ? ?Unk
    ? ?QFNUM
    ? ? ?Unk
    ? ?VAUX

    and because it is writing character wise (not word wise), when I am using IndianCorpusReader on the following data:

    ? ? ? ?/PREP ? ?/PREP ? ? ? ?/VFM ? ? ? ?/NN ? ? ? ?/NNP ? ?/QFNUM ? ? ?/PRP ? ?/VAUX

    to find words and tagged sentences by using built in function then it gives no error but wrong output as follows:

    >>> reader.words()[0:10]
    [‘?’,’ ?’,’?’,’ ?/PREP’,’?’,’ ?/PREP’,’?’,’ ?’,’?’,’ ?/VFM’…]

    but we want the output in the following format:

    [‘????’,’??’,’????’,’????’,…]

    What is the problem and how to debug it?

  • http://streamhacker.com/ Jacob Perkins

    That exception is an indication that you are not use dictionaries as featuresets, and instead using strings.

  • http://streamhacker.com/ Jacob Perkins

    It looks like the tokenization is incorrect. NLTK has a bunch of different word tokenization options, many of which are demoed at http://text-processing.com/demo/tokenize/. I’d suggest trying your text their to see if any tokenize the words correctly.

  • Manjunath Nadagouda N

    hi Jacob,
    please can you share with me that which tagger and tokenizer you have used in your nltk-demos for hindi language.

    i need this information for my research purpose.
    thanks in advance.

  • http://streamhacker.com/ Jacob Perkins

    The tokenizer is nltk.tokenize.wordpunct_tokenize and the tagger I trained on the hindi.pos file in NLTK’s indian corpus, using train_tagger.py from https://github.com/japerk/nltk-trainer

  • Manjunath Nadagouda N

    hi Jacob,

    i tried all the tagging methods with are specified in your cookbook. there was no problem in using them.

    i have read all the post with respect to tagging.

    in my previous question as you said to use train_tagger.py for tagging.

    i have all the requirements as you specified in readme.rst file.

    but the problem is i am not able to use you train_tagger.py source file because its giving error. currently i am keeping the file in /home/manjunath/nltk_data directory. can you suggest me where i am going wrong??

    thanks in advance.

  • Manjunath Nadagouda N

    hi Jacob
    I want to use your train_tagger.py for svm classifier for training on Hindi data. how can i achieve with your train_tagger.py.
    thanks

  • http://streamhacker.com/ Jacob Perkins

    scikit-learn provides a few SVM classifiers, and if you have it installed, then if you run train_tagger.py –help, you’ll see a long list of available classifiers, including sklearn.LinearSVC.

  • http://streamhacker.com/ Jacob Perkins

    I have no idea what might be wrong without seeing an error report. https://github.com/japerk/nltk-trainer is the best place to report an issue.

  • neetika narang

    trainf = [
    (‘I love this sandwich.’, ‘pos’),
    (‘This is an amazing place!’, ‘pos’),
    (‘I feel very good about these beers.’, ‘pos’),
    (‘This is my best work.’, ‘pos’)
    ]
    test = [
    (‘The beer was good.’, ‘pos’),
    (‘I do not enjoy my job’, ‘neg’),
    (“I ain’t feeling dandy today.”, ‘neg’),
    (“I feel amazing!”, ‘pos’)
    ]

    c1=NaiveBayesClassifier(trainf) gives me the error– cl = NaiveBayesClassifier(trainf)
    TypeError: __init__() takes exactly 3 arguments (2 given)
    >>>

    how do i solve it. i have no idea as m just a beginner

  • http://streamhacker.com/ Jacob Perkins

    It’s NaiveBayesClassifier.train(trainf)

  • Mahesh Sreekumar

    Hi,Very nice tutorial.I would really like to know what are the steps to perform sentiment analysis using naive bayes classifier for a set of webpages(offline pages converted to text) which i give as input.(I need to perform line by line sentiment polarity).Can that be done?Please help me out.

  • neetika narang

    Sir after i wrote this – NaiveBayesClassifier.train(trainf)
    it gives me this error—
    in train
    for fname, fval in featureset.items():
    AttributeError: ‘str’ object has no attribute ‘items’

  • http://streamhacker.com/ Jacob Perkins

    That’s because you’re using strings instead of feature dictionaries. trainf and test should have elements that look like this: ({“I”: True, “feel”: True, “amazing”: True}, “pos”)

  • neetika narang

    ok thanku so much sir,,
    and i want to ask one more thing..
    To calculate the accuracy of naive .. i wrote this code

    Testtweet = ‘Larry is my best friend’
    from nltk.classify.util import accuracy
    print(accuracy(classifier,Testtweet))

    and it gave me error as in accuracy
    results = classifier.classify_many([fs for (fs,l) in gold])
    ValueError: need more than 1 value to unpack

    where am i wrong can u help please …

  • http://streamhacker.com/ Jacob Perkins

    The accuracy functions takes a list just like the list you pass into train(). It should be a list of tuple pairs of (featureset, label).

  • http://streamhacker.com/ Jacob Perkins

    webpage -> text -> [sentence]
    Then for each sentence, transform that into a feature dictionary using the word_feats function above. You can pass that into the classifier’s classify() function to get a sentiment label.
    To get the sentences & words, you’ll need to use sentence tokenization & word tokenization. nltk.sent_tokenize(text) and nltk.wordpunct_tokenize(sentence).

  • Mahesh Sreekumar

    Thank you sir.As i was discussing about this with my friend,he was talking about thinking to find sentiment analysis for google search results for a particular keyword.If that is the case,how should i design the feature dictionary to be?How will be the training data and the classifier and how will their tags(pos/neg) be?Can that be done?I am a starter.Please help me.

  • http://streamhacker.com/ Jacob Perkins

    You will need to manually categorize search results to create your own training data. The easiest is to have a structured data sets similar to NLTK’s movie_reviews corpus. Then you can train a classifier in a very similar manner to what I’ve shown above.

  • Mahesh Sreekumar

    Sorry for the late reply.Sir,I am working on google search results.So i have to manually give categorize.right?Now i think in this scenario,unsupervised learning will be the best.Will it be?If so which unsupervised learning will be best suited here?

  • http://streamhacker.com/ Jacob Perkins

    Take a look at topic modeling or LDA

  • Mahesh Sreekumar

    Thank you very much sir.Thanks for your reply.Then i will go through it first.

  • n.a.s

    Hi, I tried to follow the same approach here with different dataset ,and different feature. A the end I got the accuracy=1 . What might be wrong? …,thanks

  • Thanh Tran

    Hi Jacob Perkins. Thanks for sharing this, I would like to build a Sentiment Classifier. Could you share me the link to your corpus (training + testing data). Thank you very much

  • http://streamhacker.com/ Jacob Perkins

    The easiest thing to do is download the NLTK data & use the movie reviews corpus, as demonstrated above. Here’s instructions on downloading the data: http://www.nltk.org/data.html

  • Shashank Sharma

    could u please explain me the most descriptive features, how are we getting these values magnificent = True pos : neg = 15.0 : 1.0

  • http://streamhacker.com/ Jacob Perkins

    It’s like odds in betting, so in the case of “maginificent = True”, there’s a 15:1 chance it’s positive.

  • soham

    could please elaborate, and explain the same from the concept of naive bayes- how are we getting these values

  • http://streamhacker.com/ Jacob Perkins
  • erica

    Hi, I read your tutorial about multi classifier for Reuters corpus. By using your code, I want to predict the category of a latest reuters news retrived from the Reuters website. I changed my training and test set, and made a multi-classifier by following your steps. However, after the line
    >>>multi_classifier = MiltiBinaryClassifier(* classifiers.items())
    I don’t know how to write predict function for my new multi_test_feats.
    the composition of my multi_test_feats is (feats, labels), where labels are none in this case.

    Could you please teach me how to do the prediciton for a new article instead of doing evalution by using the test data with corresponding categoires?

    Thank you so much!

  • http://streamhacker.com/ Jacob Perkins

    Every classifier has a classify() method that takes a single featureset, like feats. For the MultiBinaryClassifier, you’ll get back a set of labels.

  • Mei Silviana Saputri

    Hi,
    I have a problem when I tried to load my own corpus. The structure file of corpus is pos_1.txt, pos_2.txt,…pos_n.txt
    neg_1.txt,neg_2.txt,…..neg_n.txt

    my code is like this (I followed the example on chapter 3 on your book):

    newcorpus = nltk.corpus.reader.CategorizedPlaintextCorpusReader(root, r'(w+)_[1-300].txt’, cat_pattern=r'(w+)_[1-300].txt’)

    but when I called newcorpus.fileids(‘neg’) dan newcorpus.fileids(‘neg’), it only show these files: neg_1.txt, neg_2.txt, neg_3.txt and pos_1.txt, pos_2.txt, pos_3.txt

    Can you help me to decide the correct index number should be use in cat_pattern so all files can be loaded?

    Thanks

  • http://streamhacker.com/ Jacob Perkins

    Instead of doing [1-300], try d+