Text Classification for Sentiment Analysis – Naive Bayes Classifier

Sentiment analysis is becoming a popular area of research and social media analysis, especially around user reviews and tweets. It is a special case of text mining generally focused on identifying opinion polarity, and while it’s often not very accurate, it can still be useful. For simplicity (and because the training data is easily accessible) I’ll focus on 2 possible sentiment classifications: positive and negative.

NLTK Naive Bayes Classification

NLTK comes with all the pieces you need to get started on sentiment analysis: a movie reviews corpus with reviews categorized into pos and neg categories, and a number of trainable classifiers. We’ll start with a simple NaiveBayesClassifier as a baseline, using boolean word feature extraction.

Bag of Words Feature Extraction

All of the NLTK classifiers work with featstructs, which can be simple dictionaries mapping a feature name to a feature value. For text, we’ll use a simplified bag of words model where every word is feature name with a value of True. Here’s the feature extraction method:

def word_feats(words):
		return dict([(word, True) for word in words])

Training Set vs Test Set and Accuracy

The movie reviews corpus has 1000 positive files and 1000 negative files. We’ll use 3/4 of them as the training set, and the rest as the test set. This gives us 1500 training instances and 500 test instances. The classifier training method expects to be given a list of tokens in the form of [(feats, label)] where feats is a feature dictionary and label is the classification label. In our case, feats will be of the form {word: True} and label will be one of ‘pos’ or ‘neg’. For accuracy evaluation, we can use nltk.classify.util.accuracy with the test set as the gold standard.

Training and Testing the Naive Bayes Classifier

Here’s the complete python code for training and testing a Naive Bayes Classifier on the movie review corpus.

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews

def word_feats(words):
	return dict([(word, True) for word in words])

negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')

negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4

trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats))

classifier = NaiveBayesClassifier.train(trainfeats)
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)

And the output is:

train on 1500 instances, test on 500 instances
accuracy: 0.728
Most Informative Features
         magnificent = True              pos : neg    =     15.0 : 1.0
         outstanding = True              pos : neg    =     13.6 : 1.0
           insulting = True              neg : pos    =     13.0 : 1.0
          vulnerable = True              pos : neg    =     12.3 : 1.0
           ludicrous = True              neg : pos    =     11.8 : 1.0
              avoids = True              pos : neg    =     11.7 : 1.0
         uninvolving = True              neg : pos    =     11.7 : 1.0
          astounding = True              pos : neg    =     10.3 : 1.0
         fascination = True              pos : neg    =     10.3 : 1.0
             idiotic = True              neg : pos    =      9.8 : 1.0

As you can see, the 10 most informative features are, for the most part, highly descriptive adjectives. The only 2 words that seem a bit odd are “vulnerable” and “avoids”. Perhaps these words refer to important plot points or character development that signify a good movie. Whatever the case, with simple assumptions and very little code we’re able to get almost 73% accuracy. This is somewhat near human accuracy, as apparently people agree on sentiment only around 80% of the time. Future articles in this series will cover precision & recall metrics, alternative classifiers, and techniques for improving accuracy.

  • Vandana

    When I am tagging hindi text file then i am getting correct output as below:

    ???? Unk
    ?? Unk
    ???? Unk
    ???? Unk
    ???? Unk
    ?? QFNUM
    ??? Unk
    ?? VAUX

    but when writing this output to the file, it is writing as follows:

    ? ? ? ?Unk
    ? ?Unk
    ? ? ? ?Unk
    ? ? ? ?Unk
    ? ? ? ?Unk
    ? ?QFNUM
    ? ? ?Unk
    ? ?VAUX

    and because it is writing character wise (not word wise), when I am using IndianCorpusReader on the following data:

    ? ? ? ?/PREP ? ?/PREP ? ? ? ?/VFM ? ? ? ?/NN ? ? ? ?/NNP ? ?/QFNUM ? ? ?/PRP ? ?/VAUX

    to find words and tagged sentences by using built in function then it gives no error but wrong output as follows:

    >>> reader.words()[0:10]
    [‘?’,’ ?’,’?’,’ ?/PREP’,’?’,’ ?/PREP’,’?’,’ ?’,’?’,’ ?/VFM’…]

    but we want the output in the following format:


    What is the problem and how to debug it?

  • That exception is an indication that you are not use dictionaries as featuresets, and instead using strings.

  • It looks like the tokenization is incorrect. NLTK has a bunch of different word tokenization options, many of which are demoed at http://text-processing.com/demo/tokenize/. I’d suggest trying your text their to see if any tokenize the words correctly.

  • Manjunath Nadagouda N

    hi Jacob,
    please can you share with me that which tagger and tokenizer you have used in your nltk-demos for hindi language.

    i need this information for my research purpose.
    thanks in advance.

  • The tokenizer is nltk.tokenize.wordpunct_tokenize and the tagger I trained on the hindi.pos file in NLTK’s indian corpus, using train_tagger.py from https://github.com/japerk/nltk-trainer

  • Manjunath Nadagouda N

    hi Jacob,

    i tried all the tagging methods with are specified in your cookbook. there was no problem in using them.

    i have read all the post with respect to tagging.

    in my previous question as you said to use train_tagger.py for tagging.

    i have all the requirements as you specified in readme.rst file.

    but the problem is i am not able to use you train_tagger.py source file because its giving error. currently i am keeping the file in /home/manjunath/nltk_data directory. can you suggest me where i am going wrong??

    thanks in advance.

  • Manjunath Nadagouda N

    hi Jacob
    I want to use your train_tagger.py for svm classifier for training on Hindi data. how can i achieve with your train_tagger.py.

  • scikit-learn provides a few SVM classifiers, and if you have it installed, then if you run train_tagger.py –help, you’ll see a long list of available classifiers, including sklearn.LinearSVC.

  • I have no idea what might be wrong without seeing an error report. https://github.com/japerk/nltk-trainer is the best place to report an issue.

  • neetika narang

    trainf = [
    (‘I love this sandwich.’, ‘pos’),
    (‘This is an amazing place!’, ‘pos’),
    (‘I feel very good about these beers.’, ‘pos’),
    (‘This is my best work.’, ‘pos’)
    test = [
    (‘The beer was good.’, ‘pos’),
    (‘I do not enjoy my job’, ‘neg’),
    (“I ain’t feeling dandy today.”, ‘neg’),
    (“I feel amazing!”, ‘pos’)

    c1=NaiveBayesClassifier(trainf) gives me the error– cl = NaiveBayesClassifier(trainf)
    TypeError: __init__() takes exactly 3 arguments (2 given)

    how do i solve it. i have no idea as m just a beginner

  • It’s NaiveBayesClassifier.train(trainf)

  • Mahesh Sreekumar

    Hi,Very nice tutorial.I would really like to know what are the steps to perform sentiment analysis using naive bayes classifier for a set of webpages(offline pages converted to text) which i give as input.(I need to perform line by line sentiment polarity).Can that be done?Please help me out.

  • neetika narang

    Sir after i wrote this – NaiveBayesClassifier.train(trainf)
    it gives me this error—
    in train
    for fname, fval in featureset.items():
    AttributeError: ‘str’ object has no attribute ‘items’

  • That’s because you’re using strings instead of feature dictionaries. trainf and test should have elements that look like this: ({“I”: True, “feel”: True, “amazing”: True}, “pos”)

  • neetika narang

    ok thanku so much sir,,
    and i want to ask one more thing..
    To calculate the accuracy of naive .. i wrote this code

    Testtweet = ‘Larry is my best friend’
    from nltk.classify.util import accuracy

    and it gave me error as in accuracy
    results = classifier.classify_many([fs for (fs,l) in gold])
    ValueError: need more than 1 value to unpack

    where am i wrong can u help please …

  • The accuracy functions takes a list just like the list you pass into train(). It should be a list of tuple pairs of (featureset, label).

  • webpage -> text -> [sentence]
    Then for each sentence, transform that into a feature dictionary using the word_feats function above. You can pass that into the classifier’s classify() function to get a sentiment label.
    To get the sentences & words, you’ll need to use sentence tokenization & word tokenization. nltk.sent_tokenize(text) and nltk.wordpunct_tokenize(sentence).

  • Mahesh Sreekumar

    Thank you sir.As i was discussing about this with my friend,he was talking about thinking to find sentiment analysis for google search results for a particular keyword.If that is the case,how should i design the feature dictionary to be?How will be the training data and the classifier and how will their tags(pos/neg) be?Can that be done?I am a starter.Please help me.

  • You will need to manually categorize search results to create your own training data. The easiest is to have a structured data sets similar to NLTK’s movie_reviews corpus. Then you can train a classifier in a very similar manner to what I’ve shown above.

  • Mahesh Sreekumar

    Sorry for the late reply.Sir,I am working on google search results.So i have to manually give categorize.right?Now i think in this scenario,unsupervised learning will be the best.Will it be?If so which unsupervised learning will be best suited here?

  • Take a look at topic modeling or LDA

  • Mahesh Sreekumar

    Thank you very much sir.Thanks for your reply.Then i will go through it first.

  • n.a.s

    Hi, I tried to follow the same approach here with different dataset ,and different feature. A the end I got the accuracy=1 . What might be wrong? …,thanks

  • Thanh Tran

    Hi Jacob Perkins. Thanks for sharing this, I would like to build a Sentiment Classifier. Could you share me the link to your corpus (training + testing data). Thank you very much

  • The easiest thing to do is download the NLTK data & use the movie reviews corpus, as demonstrated above. Here’s instructions on downloading the data: http://www.nltk.org/data.html

  • Shashank Sharma

    could u please explain me the most descriptive features, how are we getting these values magnificent = True pos : neg = 15.0 : 1.0

  • It’s like odds in betting, so in the case of “maginificent = True”, there’s a 15:1 chance it’s positive.

  • soham

    could please elaborate, and explain the same from the concept of naive bayes- how are we getting these values

  • erica

    Hi, I read your tutorial about multi classifier for Reuters corpus. By using your code, I want to predict the category of a latest reuters news retrived from the Reuters website. I changed my training and test set, and made a multi-classifier by following your steps. However, after the line
    >>>multi_classifier = MiltiBinaryClassifier(* classifiers.items())
    I don’t know how to write predict function for my new multi_test_feats.
    the composition of my multi_test_feats is (feats, labels), where labels are none in this case.

    Could you please teach me how to do the prediciton for a new article instead of doing evalution by using the test data with corresponding categoires?

    Thank you so much!

  • Every classifier has a classify() method that takes a single featureset, like feats. For the MultiBinaryClassifier, you’ll get back a set of labels.

  • Mei Silviana Saputri

    I have a problem when I tried to load my own corpus. The structure file of corpus is pos_1.txt, pos_2.txt,…pos_n.txt

    my code is like this (I followed the example on chapter 3 on your book):

    newcorpus = nltk.corpus.reader.CategorizedPlaintextCorpusReader(root, r'(w+)_[1-300].txt’, cat_pattern=r'(w+)_[1-300].txt’)

    but when I called newcorpus.fileids(‘neg’) dan newcorpus.fileids(‘neg’), it only show these files: neg_1.txt, neg_2.txt, neg_3.txt and pos_1.txt, pos_2.txt, pos_3.txt

    Can you help me to decide the correct index number should be use in cat_pattern so all files can be loaded?


  • Instead of doing [1-300], try d+

  • keevee09

    Thanks Jacob for a very good article (‘pos’)
    I have a naive question: given a sample series of news articles, published over a period of years from one media source, would nltk provide the ability to assess the rate of “dumbing down” (or the opposite or no change!) of a sample article with respect to time. This is a common complaint about media in general and it would be worthwhile having some verifiable method of deciding the truth to this statement.

  • What you’re looking for is some kind of reading-level metric. NLTK could help here – you analyze grammar patterns using part-of-speech tags, count the number of words in a sentence, and maybe figure out which words are more common/complicated using wordnet. I’ve seen studies like this before, so I’d recommend doing some research on evaluating reading-level complexity.

  • keevee09

    Thanks Jacob, wordnet is new to me as I read my way through the nltk book and test the examples given. After a day’s online research I see I have a lot to go over.
    This site:
    should give me a quick comparison between existing tests.

  • fievelk

    Hi! I’ve got a doubt about your feature extraction process.
    It seems that you extract all word features from all the documents (so you turn each document in a feature vector) and then you split these feature vectors into a train and a test set. Isn’t this unusual? If I’m not mistaken we usually only extract features from the train set, and we have to pretend that we never saw the test set before.

    For example, if the word ‘Chewbacca” never appeared in the train set we won’t have its feature {…, Chewbacca: True, …} in the feature representation of a new document. Am I wrong?
    Wouldn’t it be better to split TRAIN/TEST sets _before_ extracting features, and only extract features from the train set?

  • That’s not really a problem here, because the test set isn’t polluting the train set. It is a problem when it comes to calculating information gain, which someone rightly pointed out in the comments at http://streamhacker.com/2010/06/16/text-classification-sentiment-analysis-eliminate-low-information-features/

  • fievelk

    Yes, probably it doesn’t make a big difference in this case. I think it becomes more relevant when we implement a complex modular system that we can feed with with different types of features.
    However, here is the link of the comment you were talking about: http://streamhacker.com/2010/06/16/text-classification-sentiment-analysis-eliminate-low-information-features/#comment-165561803

    Thank you 🙂

  • Anjum Akseer

    hy,we want to develop an algo through which we can sentiment of articles?can u help us??

  • Shubham Ringne

    hello sir, can you please share the code for implementing a simple one sentence sentiment classifier

  • That’s pretty much what this post is about. Have you tried using the above code?

  • Sami Nizam Afridi

    Hi Jacob, I used Stanford dependency parser to get Nouns and adjective modifiers or adverbs modifier and verbs, for example My university is great, i got {university, Great} now i want to use these adjectives and adverbs to calculate sentiment polarity for those nouns, can you explain how can i do this ?

  • Hi Sami, a simple way to do it is to find a sentiment wordlist. Something that will have values like “great,1” or “bad,-1”. Then you can lookup the adjective/adverb in the wordlist. Alternatively, you could train a classifier, as I explain in this article, and then pass in the adjective/adverb as a featureset (like {“great”: True}).

  • Srikanth Poolla

    Hi , I am actually a newbie and to understand what your code does, i tried executing your code but it gives me the following error :
    at this step : classifier = NaiveBayesClassifier.train(trainfeats)

    Traceback (most recent call last):
    File “E:btech projecttest.py”, line 18, in
    classifier = NaiveBayesClassifier.train(trainfeats)
    File “C:Python34libsite-packagesnltkclassifynaivebayes.py”, line 196, in train
    for fname, fval in featureset.items():
    AttributeError: ‘tuple’ object has no attribute ‘items’

    suggest any changes to be made .

  • That error indicates your featureset is not a dictionary, but a tuple. If you look at the code above, you can see that the word_feats function returns a dictionary, and the trainfeats list should be a list of dictionaries.