StreamHacker Weotta be Hacking

10May/1090

Text Classification for Sentiment Analysis – Naive Bayes Classifier

Sentiment analysis is becoming a popular area of research and social media analysis, especially around user reviews and tweets. It is a special case of text mining generally focused on identifying opinion polarity, and while it's often not very accurate, it can still be useful. For simplicity (and because the training data is easily accessible) I'll focus on 2 possible sentiment classifications: positive and negative.

NLTK Naive Bayes Classification

NLTK comes with all the pieces you need to get started on sentiment analysis: a movie reviews corpus with reviews categorized into pos and neg categories, and a number of trainable classifiers. We'll start with a simple NaiveBayesClassifier as a baseline, using boolean word feature extraction.

Bag of Words Feature Extraction

All of the NLTK classifiers work with featstructs, which can be simple dictionaries mapping a feature name to a feature value. For text, we'll use a simplified bag of words model where every word is feature name with a value of True. Here's the feature extraction method:

def word_feats(words):
		return dict([(word, True) for word in words])

Training Set vs Test Set and Accuracy

The movie reviews corpus has 1000 positive files and 1000 negative files. We'll use 3/4 of them as the training set, and the rest as the test set. This gives us 1500 training instances and 500 test instances. The classifier training method expects to be given a list of tokens in the form of [(feats, label)] where feats is a feature dictionary and label is the classification label. In our case, feats will be of the form {word: True} and label will be one of 'pos' or 'neg'. For accuracy evaluation, we can use nltk.classify.util.accuracy with the test set as the gold standard.

Training and Testing the Naive Bayes Classifier

Here's the complete python code for training and testing a Naive Bayes Classifier on the movie review corpus.

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews

def word_feats(words):
	return dict([(word, True) for word in words])

negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')

negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4

trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats))

classifier = NaiveBayesClassifier.train(trainfeats)
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
classifier.show_most_informative_features()

And the output is:

train on 1500 instances, test on 500 instances
accuracy: 0.728
Most Informative Features
         magnificent = True              pos : neg    =     15.0 : 1.0
         outstanding = True              pos : neg    =     13.6 : 1.0
           insulting = True              neg : pos    =     13.0 : 1.0
          vulnerable = True              pos : neg    =     12.3 : 1.0
           ludicrous = True              neg : pos    =     11.8 : 1.0
              avoids = True              pos : neg    =     11.7 : 1.0
         uninvolving = True              neg : pos    =     11.7 : 1.0
          astounding = True              pos : neg    =     10.3 : 1.0
         fascination = True              pos : neg    =     10.3 : 1.0
             idiotic = True              neg : pos    =      9.8 : 1.0

As you can see, the 10 most informative features are, for the most part, highly descriptive adjectives. The only 2 words that seem a bit odd are "vulnerable" and "avoids". Perhaps these words refer to important plot points or character development that signify a good movie. Whatever the case, with simple assumptions and very little code we're able to get almost 73% accuracy. This is somewhat near human accuracy, as apparently people agree on sentiment only around 80% of the time. Future articles in this series will cover precision & recall metrics, alternative classifiers, and techniques for improving accuracy.

  • Pulkit

    Dear Sir,

    Is there a way that polarity can be determined on sense level?
    Thanking you

  • http://streamhacker.com/ Jacob Perkins

    Possibly, but it'd probably require a dictionary mapping sense to polarity, and of course you'd need to know the sense of the words first.

  • Pulkit

    Thanks for your reply,
    Like you said “senses of the words first”. would you have a classifier that you can post the code for? or second way can be instead of finding the senses SentiWordNet can be used straight away as senses are marked with the polarity already.

  • http://streamhacker.com/ Jacob Perkins

    I don't have any word sense disambiguation code, and I've never used SentiWordNet, so I'm afraid I can't help you there. But please let me know if you're able to figure something out and get it working.

  • Pulkit Kathuria

    Sure I am working on that these days

  • Pulkit Kathuria

    Sure I am working on that these days

  • http://pulse.yahoo.com/_YU5KVC5PTJR22LOKDJIN32VMPU gunzip

    very useful serie. i'm wondering.. can the naive bayesian classifier be trained with term weights (ie. tf-idf) or you must use a boolean word model (presence or absence of the term) ?

  • http://streamhacker.com/ Jacob Perkins

    The NLTK bayes classifier would not work with term weights, or at least it would consider each weight as a separate feature value. However, I believe there's an algorithm for multi-nomial naive bayes that could take into account frequencies or weights, if you want to implement it yourself. But research shows that simple word presence is about as good as a multi-nomial model.

  • Clement Levallois

    Hi,
    Thanks for these great tutorials – I bought the book to get more of it!

    I try to run the above code on my own corpus (called “wordlists”).
    This data is two sets of files, each in a separate folder. Files with a positive connotation have “pos” in their name, those with negative sentiment have “neg” in theirs.

    I import this data, following section 2.1 at http://nltk.googlecode.com/svn/trunk/doc/book/ch02.html

    Then, I run the code above but get into this:
    wordlists.fileids(‘neg’)

    TypeError: fileids() takes exactly 1 argument (2 given)

    Beyond fixing this error, what would really help me would be to know where “movie_reviews” is located on my disk, so that I could exactly replicate its data structure (names of files and folders, etc.) – but with my own data. Could you just tell me where movie_reviews is installed?

  • http://streamhacker.com/ Jacob Perkins

    Hi Clement,

    If you’ve installed the nltk data, following the instructions at http://www.nltk.org/data, the movie_reviews will be in /usr/share/nltk_data/corpora on linux/unix, or C:nltk_datacorpora on windows. It’s organized with 2 subdirectories, one for pos, the other for neg.

    In your case, if you don’t change the filenames, you’ll have to tell the CategorizedPlaintextCorpusReader how to determine the categories from the filenames using the cat_pattern kwarg. See http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.api.CategorizedCorpusReader-class.html#__init__. Chapter 3 of my book should also have some examples. Then you can call fileids(categories=['neg']) to get the fileids for every file in the ‘neg’ category.

  • http://www.clementlevallois.net Clement Levallois

    Thanks a lot!
    I figured since that I could already download chapter 3 for free – and it explains how this category-making works. Still working on it… so exciting stuff! Thanks for sharing!

  • http://www.clementlevallois.net Clement Levallois

    Hi Jacob again,

    Sorry to use this comment section as a forum, but I am stuck and I suppose it is very easy to solve. Once I trained the classifier, I run it on my test corpus, and finally on my “real” uncategorized corpus, which consists in a list of documents (files). I’d like to display the results for each file: whether they have been classified as neg or pos (ideally, with a measure of confidence). For the above example on movie reviews, what would such a code look like? Thx!

  • http://streamhacker.com/ Jacob Perkins

    It all depends on what you mean by “display”. If you just want to manually verify results, I think the easiest method is to output a csv with 2 columns: fileid, label. You could also output the probability of each label using the results from prob_classify(). Then you’d have 3 columns: fileid, neg, pos and each row would have a percent for each label.

  • http://www.clementlevallois.net Clement Levallois

    Exactly! And that is what I tried to do but I just error messages. With the example above on movie reviews, could you tell me which lines of code I should write to get a csv with these three columns? (but “print” on screen would be enough!)
    (it is surely so easy that it looks silly to ask – but for the uneducated users out there, once we got tutorials on training and evaluating classifiers, we just need this essential last extra bit to get running: the code to get the actual results on a corpus!
    Thanks again for your patience! :-)

  • http://streamhacker.com/ Jacob Perkins

    To print out the label of each file, after you’ve trained the classifier, the code would be:

    for fileid in movie_reviews.fileids():
    feats = word_feats(movie_reviews.words(fileids=[f]))
    label = classifier.classify(feats)
    print fileid, label

    To create a csv, first you open a file & create a csv.writer, then replace the print line with writer.append([fileid, label]). Does that answer your question?

  • http://www.clementlevallois.net Clement Levallois

    It does, thank you so much.

  • http://hq-recovery.com Dmitry Chaplinsky

    Funny, that using only half of dataset as training will increase the accuracy to 0.811 and even 1/4th of data producing result better than 3/4.

  • http://streamhacker.com/ Jacob Perkins

    Without any pruning, adding more data increases noise, decreasing accuracy. But if you read the last article in the series, http://streamhacker.com/2010/06/16/text-classification-sentiment-analysis-eliminate-low-information-features/, you’ll see that with good pruning the results get much better.

  • http://philgo20.com/ philgo20

    Do you have a cue on where to start to classify in multiple categories ?
    I mean not positive/negative classification but one or multiple of 280 categories ?

  • http://streamhacker.com/ Jacob Perkins

    One of the most common ways of doing it is to train 1 binary classifier for each category, with negative training examples coming from all other categories. Then you combine all the binary classifiers into a multi-label classifier. I cover this using the reuters corpus in my book, at the of the Text Classification chapter. There’s also a bunch of research papers out there on multi-label classification.

  • Pingback: Critique Progress #1 « Chexee in a few words.

  • Alex

    Where can I get the movie reviews corpus has 1000 positive files and 1000 negative files? I have a naive bayes classifier and would like to use this to training

  • http://streamhacker.com/ Jacob Perkins

    The movie_reviews corpus is included with NLTK, but you can also get it from http://www.cs.cornell.edu/people/pabo/movie-review-data/

  • http://twitter.com/aditya_herlamba Aditya Herlambang

    pardon my stupid question, but does the bag of words model means the same thing with using boolean word feature extraction.

  • http://streamhacker.com/ Jacob Perkins

    I haven’t encountered that term, but it sounds like the same thing, where you just store True for every word found.

  • Kellegher

    Hi Jacob, I’ve got a corpus of 1000 negative files and 850 positive files. I want my training set and test set to be the same length for each category. Is there a way of randomly selecting an amount of negative files, say 650, for training purposes and 200 for testing as opposed to manually taking out the excess negative files? Want my analysis to be completely unbiased

  • http://streamhacker.com/ Jacob Perkins

    What you could do is limit the fileids used. For example, in my code above, instead of negids = movie_reviews.fileids(‘neg’) you could do negids = movie_reviews.fileids(‘neg’)[:850]. Then both pos & neg have an equal number of files, and you can divide them up for training & testing however you want.

  • Kellegher

    thanks for your help Jacob

  • Heroes del Silencio

    Hi Jacob, I try to do an analyze of feelings for tweet, I want to use your code. I’m using the api tweepy like this: api=tweepy.API()busqueda = api.search(“movie”, ‘en’,”,’50′)i get the tweets.Then I want to know if positive or negative in the following way:for tweet in busqueda:    twe = tweet.text.encode(‘utf-8′)    feats = word_feats(twe.words())    label = classifier.classify(feats)    print twe, labelthrough your code, but i got this error:Traceback (most recent call last):  File “cl.py”, line 27, in     feats = word_feats(twe.words())AttributeError: ‘str’ object has no attribute ‘wordsCould you help please?, Or if there is a better way. thanks

  • http://streamhacker.com/ Jacob Perkins

    You will need to tokenize the words in the tweet. The easiest way to do it is something like this: word_feats(nltk.tokenize.word_tokenize(twe))

    You will have to import nltk.tokenize first. Also take a look at http://text-processing.com/demo/tokenize/ for more tokenization options.

  • Heroes del Silencio

    thanks for your answer. I have another question, in your code, you know how I can eliminate the less important words and just let the words of most relevance. Thanks, hopefully you can help me.

  • http://streamhacker.com/ Jacob Perkins

    Please see my followup article on this topic: http://streamhacker.com/2010/06/16/text-classification-sentiment-analysis-eliminate-low-information-features/

    However, the words that are significant for movie reviews are likely to be different than words that are significant for tweets, so for best results, you will probably want your own corpus to learn from.

  • Ritvik Mathur

    Hi, Nice Explanation! I am working on a similar project and wanted to know if there is a way to save the trained model somehow and then be able to use/reload it later to classify news data that I input? Because right now every time I run the script it takes a long time to train the classifier since the training set is huge (300K samples).

  • http://streamhacker.com/ Jacob Perkins

    Yes, just pickle the trained classifier to a file, then reload/unpickle later. If you store the classifier in a nltk_data directory, you can also use nltk.data.load to load & unpickle the classifier.

  • Spin_maker

    Hi,
    explaination is worth seeing , can you shed some light on how we can implement car evaluvation using navie bay’s classification algo

  • http://streamhacker.com/ Jacob Perkins

    It all depends on what you want to classify. But whatever it is, you need a training corpus, ideally structured similarly to the movie_reviews corpus. Once you’ve got that, you can train a classifier in a very similar way.

  • Pingback: Utilizar NLTK desde IronPython 2.7 y Visual Studio « Sebastian Durandeu Blog

  • http://twitter.com/dieselboris jorrit

    Hi, i am about to use you API for my thesis. Could you perhaps point me to an published (scientifc ) article which explains the underlying model of your implementation? 

    Also: would you perhaps know about an API which returns sentiment as valence (pos vs neg) and arousal (calm vs activated)? I would like to do a comparison between the two types!

    thanks alot and for all your work at the NLTK!

  • http://streamhacker.com/ Jacob Perkins

    The underlying model is an ensemble of binary NaiveBayes and MaximumEntropy classifiers, setup in a hierarchy of neutral-polar, then pos-neg. I’m sure there’s papers on those topics, but there’s nothing specific I referenced to create the API.

    I don’t know of any API that does arousal in addition to valence. If you find one, please let me know.

  • Pingback: » A Text Analysis of Supreme Court Oral Arguments jarv.org

  • Hiral

     hi i , i am working on sentiment analysis  same positive and negative i want to use
    naive bayes but i dnt know technique plz guide me from basic thank you

  • Abhisek052

    hi, how  we can implement other machine learning technics such as SVM,decision tree for sentiment analysis

  • http://streamhacker.com/ Jacob Perkins

    NLTK has had decision tree support for quite a while, and the latest release includes a SVM classifier. Both can be used similarity to the NaiveBayesClassifier.

  • Joe C

    HI Jacob, 

    I was wondering if you could clarify some stuff here. If I had a sentence: 
    pos_sent = ‘I love pizza. It is awesome.’

    and I tokenize:

    tolk_posset = word_tokenize(pos_sent) #Change to tok to rem punc

    where in this code do I feed the tokenized sentience so that I can estimate the sentiment of each token and where do I get the result. 

    I messed around with replacing your “testfeats” with my tolk_posset but I can’t seem to get it to work. Any clarification would be awesome. 

    BTW, your site is the shit. 

    - Joe

  • http://streamhacker.com/ Jacob Perkins

    Thanks Joe,

    You should call word_feats(tolk_posset) which will transform a list of words into a dict that looks like {word: True}

    Then, you can pass that dict in the classify() method of a trained classifier to get the sentiment.

  • Joe C

    Thanks! Got it working.

  • Bill

    Hey Jacob,

    Great write up, but I had a quick question.

    What is the best way to display the neutrality and polarity of a test case?

    Thanks a lot!

  • http://streamhacker.com/ Jacob Perkins

    I’d recommend doing prob_classify(), which gives a ProbDist with the confidence/probabilities of each class, then using that to show how confident your system is for each label.

  • anonymous

    def word_feats(words): return dict([(word, True) for word in words])

    negfeats = [(word_feats(movie_reviews.words(fileids=[f])), ‘neg’) for f in negids]
    can anyone explain wat exactly is he doing over here in this piece of code
    as i analyse a dictionary is created one for negative words and another for positive ,can u high light why r u creating a dictionary of negative and psitive words

  • http://streamhacker.com/ Jacob Perkins

    Quoting from the article, “All of the NLTK classifiers work with featstructs, which can be simple dictionaries mapping afeature name to a feature value. For text, we’ll use a simplified bag of words model where every word is feature name with a value of True.”

%d bloggers like this: