Text Classification for Sentiment Analysis – Stopwords and Collocations

Improving feature extraction can often have a significant positive impact on classifier accuracy (and precision and recall). In this article, I’ll be evaluating two modifications of the word_feats feature extraction method:

  1. filter out stopwords
  2. include bigram collocations

To do this effectively, we’ll modify the previous code so that we can use an arbitrary feature extractor function that takes the words in a file and returns the feature dictionary. As before, we’ll use these features to train a Naive Bayes Classifier.

import collections
import nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews

def evaluate_classifier(featx):
	negids = movie_reviews.fileids('neg')
	posids = movie_reviews.fileids('pos')

	negfeats = [(featx(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
	posfeats = [(featx(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

	negcutoff = len(negfeats)*3/4
	poscutoff = len(posfeats)*3/4

	trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
	testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]

	classifier = NaiveBayesClassifier.train(trainfeats)
	refsets = collections.defaultdict(set)
	testsets = collections.defaultdict(set)

	for i, (feats, label) in enumerate(testfeats):
			observed = classifier.classify(feats)

	print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
	print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
	print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])
	print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg'])
	print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg'])

Baseline Bag of Words Feature Extraction

Here’s the baseline feature extractor for bag of words feature selection.

def word_feats(words):
	return dict([(word, True) for word in words])


The results are the same as in the previous articles, but I’ve included them here for reference:

accuracy: 0.728
pos precision: 0.651595744681
pos recall: 0.98
neg precision: 0.959677419355
neg recall: 0.476
Most Informative Features
         magnificent = True              pos : neg    =     15.0 : 1.0
         outstanding = True              pos : neg    =     13.6 : 1.0
           insulting = True              neg : pos    =     13.0 : 1.0
          vulnerable = True              pos : neg    =     12.3 : 1.0
           ludicrous = True              neg : pos    =     11.8 : 1.0
              avoids = True              pos : neg    =     11.7 : 1.0
         uninvolving = True              neg : pos    =     11.7 : 1.0
          astounding = True              pos : neg    =     10.3 : 1.0
         fascination = True              pos : neg    =     10.3 : 1.0
             idiotic = True              neg : pos    =      9.8 : 1.0

Stopword Filtering

Stopwords are words that are generally considered useless. Most search engines ignore these words because they are so common that including them would greatly increase the size of the index without improving precision or recall. NLTK comes with a stopwords corpus that includes a list of 128 english stopwords. Let’s see what happens when we filter out these words.

from nltk.corpus import stopwords
stopset = set(stopwords.words('english'))

def stopword_filtered_word_feats(words):
	return dict([(word, True) for word in words if word not in stopset])


And the results for a stopword filtered bag of words are:

accuracy: 0.726
pos precision: 0.649867374005
pos recall: 0.98
neg precision: 0.959349593496
neg recall: 0.472

Accuracy went down .2%, and pos precision and neg recall dropped as well! Apparently stopwords add information to sentiment analysis classification. I did not include the most informative features since they did not change.

Bigram Collocations

As mentioned at the end of the article on precision and recall, it’s possible that including bigrams will improve classification accuracy. The hypothesis is that people say things like “not great”, which is a negative expression that the bag of words model could interpret as positive since it sees “great” as a separate word.

To find significant bigrams, we can use nltk.collocations.BigramCollocationFinder along with nltk.metrics.BigramAssocMeasures. The BigramCollocationFinder maintains 2 internal FreqDists, one for individual word frequencies, another for bigram frequencies. Once it has these frequency distributions, it can score individual bigrams using a scoring function provided by BigramAssocMeasures, such chi-square. These scoring functions measure the collocation correlation of 2 words, basically whether the bigram occurs about as frequently as each individual word.

import itertools
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

def bigram_word_feats(words, score_fn=BigramAssocMeasures.chi_sq, n=200):
	bigram_finder = BigramCollocationFinder.from_words(words)
	bigrams = bigram_finder.nbest(score_fn, n)
	return dict([(ngram, True) for ngram in itertools.chain(words, bigrams)])


After some experimentation, I found that using the 200 best bigrams from each file produced great results:

accuracy: 0.816
pos precision: 0.753205128205
pos recall: 0.94
neg precision: 0.920212765957
neg recall: 0.692
Most Informative Features
         magnificent = True              pos : neg    =     15.0 : 1.0
         outstanding = True              pos : neg    =     13.6 : 1.0
           insulting = True              neg : pos    =     13.0 : 1.0
          vulnerable = True              pos : neg    =     12.3 : 1.0
   ('matt', 'damon') = True              pos : neg    =     12.3 : 1.0
      ('give', 'us') = True              neg : pos    =     12.3 : 1.0
           ludicrous = True              neg : pos    =     11.8 : 1.0
         uninvolving = True              neg : pos    =     11.7 : 1.0
              avoids = True              pos : neg    =     11.7 : 1.0
('absolutely', 'no') = True              neg : pos    =     10.6 : 1.0

Yes, you read that right, Matt Damon is apparently one of the best predictors for positive sentiment in movie reviews. But despite this chuckle-worthy result

  • accuracy is up almost 9%
  • pos precision has increased over 10% with only 4% drop in recall
  • neg recall has increased over 21% with just under 4% drop in precision

So it appears that the bigram hypothesis is correct, and including significant bigrams can increase classifier effectiveness. Note that it’s significant bigrams that enhance effectiveness. I tried using nltk.util.bigrams to include all bigrams, and the results were only a few points above baseline. This points to the idea that including only significant features can improve accuracy compared to using all features. In a future article, I’ll try trimming down the single word features to only include significant words.

  • Because your negative class has so few examples, the significant words in that class would naturally convey the most information towards choosing that class.

    Also, you may want to re-evaluate your metrics, given that the positive class has so many more examples that the other classes. For example, if I made a classifier that simply chose the positive class, it would be ~95% accurate.

  • corn

    Ah…that makes sense actually. Thanks.

  • quad

    hey. the size of our handcrafted feature vector is (100,5). the output of TfidfVectorizer weighting is (100,12000). and then we are using hstack to combine these two arrays. there is an error which says “expected 109123 features but only 12000 found”. will you be able to help us with this? our code is like this –
    y = sparse.hstack(fe).tocsr()

    where fv is the handcrafted feature vector and fe=[ ]
    and all_words is an array containing all the words in our corpus.

  • I can’t help you with that, it’s an hstack specified error I’ve never seen before.

  • mesh

    I want to find most co-occurring pairs of words frequency and their probability distributions and measure the distance among probability distributions. can you please help me to do it in python?

  • The code in this article demonstrates how to find collocations, which are co-occurring word pairs. You can also see the NLTK reference on collocations at http://www.nltk.org/api/nltk.html#module-nltk.collocations

  • mesh

    Thank you for your response and information.

  • Arul Verman

    Hello , nice reading your article . We’ve a corpus of about 40,000 files which has been put into POS , NEG and NEU classes. However NEG suffers from very few data to train on of about only 3000 , now we corrected the class imbalance issue by under sampling the POS and NEU classes which have more files in number in comparison to the NEG class but that led to a decrease in Accuracy . The classification has taken in to account High Information Words , Bag of words with Maximum Voting utilizing Maxent IIS , Naive Bayesian and Decision Tree yet the Accuracy is only about 0.60. The files contain raw tweets in french language

    I’ve tried NB Multinomial and Linear SVC the accuracy is nearly the same i.e 0.60 .

    Whats your suggestion on improving the accuracy ?

    Arul Verman

  • Hi Arul,

    My first thought would be to examine the data. Are you sure that every tweet has been accurately classified? One thing you can do is train a classifier that reports probabilities, like NaiveBayes, then find the tweets where the classifier is very confident that the tweet’s class is different than its original class. It could be the classifier is correct, and the tweet is misclassified.

    And maybe the easiest thing to do is just get more NEG examples.

  • Wilian

    Hi, Jacob! I’m doing my college final project and I’m using your example to train my classifier. But I don’t know how I’ll classify my sentences after train my corpus using bigrams like the example you put here. For example, I know I have to use classifier.classify(something), but I don’t know what to put in “something”. Do I put the sentence tokenized that I want to know the sentiment? Do I put classifier.classify(best_bigram_word_feats(sentence_tokenized))? Help me, please! Oh, a very important information. I want to do sentiment analysis in tweets.

  • Hi Wilian, for classify(something), the something is a dictionary of words/strings that looks like {word: True}. That’s what bigram_word_feats() in the above post returns.

  • Wilian

    That’s what I thought! Thanks, Jacob!

  • Alan

    Hi I am trying to utilize this code in Python2.7, but I get this:

    print ‘pos precision:’, nltk.metrics.precision(refsets[‘pos’], testsets[‘pos’])
    AttributeError: ‘module’ object has no attribute ‘precision’. My nltk.metrics – does has only two: nltk.metrics.alignment_error_rate and nltk.metrics.division. Is it about installation? Can you help me, is it the versions issues?

  • Hi Alan,

    It looks like the precision function has moved to nltk.metrics.scores: http://www.nltk.org/api/nltk.metrics.html#nltk.metrics.scores.precision

  • Alan

    Thank you! But this does not work as well. My nltk.metrics – does not have ‘scores’ or any other excluding ‘division’ and ‘alignment’.

  • Ok, I’m not sure what version of NLTK you have, but you’ll have to look at the API docs to find the function, and maybe upgrade your version too.

  • swati

    hello, i want to train my data against positive and negative words. I don’t have any labelled training data. So i have collected a positive and negative words and store it in different txt file. Now my question how i train my large test dataset against these positive and negative words. And as i have read that training data should be 80% and test data 20%, so what i want do is right or not? please reply sir.. thank you

  • Hi, here’s an idea of how to proceed:
    1) Classify your texts using the positive & negative keywords. You can do simple counting.
    2) Keep only the texts that are very clearly positive or very negative.
    3) Manually review your classified texts to make sure they are correct.
    4) Train a normal text classifier using those texts.
    5) Use your classifier on the rest of your unlabelled texts, to find new positive or negative examples.
    6) Go to #3 until you have a good labelled set of texts & classifier.

  • swati

    Thank you sir.. I got your points. I want you suggestions on what I have tried doing.
    I tried doing text classification without any labelled data, in testdata folder I store all the tweets(9 txt files) and traindata folder I store user description data of that tweets(9 txt files) then I applied stop words and string punctuation and word tokenize and then use classifier, I’m getting the results with good accuracy.
    So the way I tried doing is wrong?

  • No, it doesn’t sound like you did anything wrong. If you are getting good results another way, that’s great.

  • Pingback: Sentiment Analysis Series 1 (15-min reading) | Xikai's Blog()

  • Adarsh Kumar

    While using the bigram collocations, is the “stopword_filtered_word_feats” method involved in any form? Because I don’t see any reference to the “stopword_filtered_word_feats” function.
    If not, then was it necessary to remove the stop words?

  • I defined the stopword_filtered_word_feats function in this post, as a way to filter out bigrams with stopwords.

  • swati

    Hello sir, according to your idea i have created a training dataset.. firstly I used a online training dataset(STS gold, sanders dataset) in negative, positive and neutral file and tested on my hybrid classifier which give accuracy of knn -0.81, svm-0.70, hybrid model-0.44 and precision of knn -0.85, svm -0.70, hybrid model-0.44.

    Then i created my own training dataset of twitter tweets and separated the dataset in positive, negative, and neutral in three file. After that i have done pre-processing of each file in which i removed stopwords and slang. Tested on hybrid classifier and got accuracy using 5 fold cv for knn – 0.76, svm-0.63, hybrid model -0.33 and precision of knn- 0.81, svm-0.63 , hybrid model – 0.33.

    Now my Question is-
    1) The online training dataset having neutral file greater than positive and negative file. In my case negative file is greater than positive and neutral file. Does the file size matter here?? Should i make my neutral file greater or not?
    2) My another Question is – as my dataset is of pure twitter tweets which contain hashtags(#) and @ also, so should i remove these # and @???
    the above results of my training data is without removing the # and @.
    3) My above results are good or not??

    I’m tested text sentiment analysis on both training dataset, online one give prefect result of sentiment weather it is positive or negative or neutral, but in my training dataset sometimes it give wrong sentiment result. I think bz of my negative file size is greater it give wrong results.

    please sir reply, I’m waiting for suggestion and guidance.

  • Hi,

    1) You want your training data to reflect the real world, and be relatively balanced in size. Classifiers tend to train better on balanced data.
    2) Yes, remove any words/syntax you can that doesn’t contribute to sentiment
    3) Your results seem ok, but I bet they’d get better with more cleaner data.

  • swati

    hi sir, i have done as you guide. I’m getting good results from my own created datasets but my accuracy, precision, recall and f1_score is coming similar. in knn model my precision value is different but accuracy, f1_score, recall are same and in SVM and Hybrid model all four results output coming is same.
    So, where I’m doing wrong? why its the results are coming same?

  • that’s something I cannot answer, you’ll have to debug your model, try things like cross-fold validation, and maybe tweak your training parameters

  • swati

    thank you sir for ur reply.. I’m already using cross validation cv=5, so should i need to set different cross validation value for accuracy, precision, recall and f1_score or it should be same for all. other thing i will check cross check it again.

  • rando

    I am going through a small problem, I am getting a high precision low recall for positive words and low precision and high recall for negative words. Any guidance?

    Positive precision: 1.0
    Positive recall: 0.06275579809004093
    Positive F-measure: 0.11810012836970478
    Negative precision: 0.517217146872804
    Negative recall: 1.0
    Negative F-measure: 0.6817971283001389

  • cv=5 is fine, another thing to consider is if you need more or different training data

  • Probably you don’t have enough training data, or your classes are imbalanced. It means there are some words that are definitely positive, but your classifier is doesn’t have enough examples of other potentially positive words, so it’s defaulting to negative too often.

  • rando

    First of all thank you for being so helpful! i am totally lost in getting the correct prediction…
    The problem is i will have to continue with the small dataset i have, but i will try to improve it like you said. BTW Can you please give me more information on my classes being imbalanced? Basically my dataset has two categories as Negative and Positive with a total of 750 sentences in both classes.

  • So the classes may be balanced, but the word distribution may not be. And how sure are you that the sentences are correctly categorized? How hard is it to add more training data? How many new sentences could you add in 1 hour?
    Sometimes it actually takes much less time to update your training data than it takes to debug & improve the model.

  • rando

    Yes, true. I will update my training set as you recommend. Thank you again Jacob for all the valuable information.