streamhacker.com Weotta be Hacking

24May/1010

Text Classification for Sentiment Analysis – Stopwords and Collocations

Improving feature extraction can often have a significant positive impact on classifier accuracy (and precision and recall). In this article, I'll be evaluating two modifications of the word_feats feature extraction method:

  1. filter out stopwords
  2. include bigram collocations

To do this effectively, we'll modify the previous code so that we can use an arbitrary feature extractor function that takes the words in a file and returns the feature dictionary. As before, we'll use these features to train a Naive Bayes Classifier.

import collections
import nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews

def evaluate_classifier(featx):
	negids = movie_reviews.fileids('neg')
	posids = movie_reviews.fileids('pos')

	negfeats = [(featx(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
	posfeats = [(featx(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

	negcutoff = len(negfeats)*3/4
	poscutoff = len(posfeats)*3/4

	trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
	testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]

	classifier = NaiveBayesClassifier.train(trainfeats)
	refsets = collections.defaultdict(set)
	testsets = collections.defaultdict(set)

	for i, (feats, label) in enumerate(testfeats):
			refsets[label].add(i)
			observed = classifier.classify(feats)
			testsets[observed].add(i)

	print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
	print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
	print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])
	print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg'])
	print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg'])
	classifier.show_most_informative_features()

Baseline Bag of Words Feature Extraction

Here's the baseline feature extractor for bag of words feature selection.

def word_feats(words):
	return dict([(word, True) for word in words])

evaluate_classifier(word_feats)

The results are the same as in the previous articles, but I've included them here for reference:

accuracy: 0.728
pos precision: 0.651595744681
pos recall: 0.98
neg precision: 0.959677419355
neg recall: 0.476
Most Informative Features
         magnificent = True              pos : neg    =     15.0 : 1.0
         outstanding = True              pos : neg    =     13.6 : 1.0
           insulting = True              neg : pos    =     13.0 : 1.0
          vulnerable = True              pos : neg    =     12.3 : 1.0
           ludicrous = True              neg : pos    =     11.8 : 1.0
              avoids = True              pos : neg    =     11.7 : 1.0
         uninvolving = True              neg : pos    =     11.7 : 1.0
          astounding = True              pos : neg    =     10.3 : 1.0
         fascination = True              pos : neg    =     10.3 : 1.0
             idiotic = True              neg : pos    =      9.8 : 1.0

Stopword Filtering

Stopwords are words that are generally considered useless. Most search engines ignore these words because they are so common that including them would greatly increase the size of the index without improving precision or recall. NLTK comes with a stopwords corpus that includes a list of 128 english stopwords. Let's see what happens when we filter out these words.

from nltk.corpus import stopwords
stopset = set(stopwords.words('english'))

def stopword_filtered_word_feats(words):
	return dict([(word, True) for word in words if word not in stopset])

evaluate_classifier(stopword_filtered_word_feats)

And the results for a stopword filtered bag of words are:

accuracy: 0.726
pos precision: 0.649867374005
pos recall: 0.98
neg precision: 0.959349593496
neg recall: 0.472

Accuracy went down .2%, and pos precision and neg recall dropped as well! Apparently stopwords add information to sentiment analysis classification. I did not include the most informative features since they did not change.

Bigram Collocations

As mentioned at the end of the article on precision and recall, it's possible that including bigrams will improve classification accuracy. The hypothesis is that people say things like "not great", which is a negative expression that the bag of words model could interpret as positive since it sees "great" as a separate word.

To find significant bigrams, we can use nltk.collocations.BigramCollocationFinder along with nltk.metrics.BigramAssocMeasures. The BigramCollocationFinder maintains 2 internal FreqDists, one for individual word frequencies, another for bigram frequencies. Once it has these frequency distributions, it can score individual bigrams using a scoring function provided by BigramAssocMeasures, such chi-square. These scoring functions measure the collocation correlation of 2 words, basically whether the bigram occurs about as frequently as each individual word.

import itertools
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

def bigram_word_feats(words, score_fn=BigramAssocMeasures.chi_sq, n=200):
	bigram_finder = BigramCollocationFinder.from_words(words)
	bigrams = bigram_finder.nbest(score_fn, n)
	return dict([(ngram, True) for ngram in itertools.chain(words, bigrams)])

evaluate_classifier(bigram_word_feats)

After some experimentation, I found that using the 200 best bigrams from each file produced great results:

accuracy: 0.816
pos precision: 0.753205128205
pos recall: 0.94
neg precision: 0.920212765957
neg recall: 0.692
Most Informative Features
         magnificent = True              pos : neg    =     15.0 : 1.0
         outstanding = True              pos : neg    =     13.6 : 1.0
           insulting = True              neg : pos    =     13.0 : 1.0
          vulnerable = True              pos : neg    =     12.3 : 1.0
   ('matt', 'damon') = True              pos : neg    =     12.3 : 1.0
      ('give', 'us') = True              neg : pos    =     12.3 : 1.0
           ludicrous = True              neg : pos    =     11.8 : 1.0
         uninvolving = True              neg : pos    =     11.7 : 1.0
              avoids = True              pos : neg    =     11.7 : 1.0
('absolutely', 'no') = True              neg : pos    =     10.6 : 1.0

Yes, you read that right, Matt Damon is apparently one of the best predictors for positive sentiment in movie reviews. But despite this chuckle-worthy result

  • accuracy is up almost 9%
  • pos precision has increased over 10% with only 4% drop in recall
  • neg recall has increased over 21% with just under 4% drop in precision

So it appears that the bigram hypothesis is correct, and including significant bigrams can increase classifier effectiveness. Note that it's significant bigrams that enhance effectiveness. I tried using nltk.util.bigrams to include all bigrams, and the results were only a few points above baseline. This points to the idea that including only significant features can improve accuracy compared to using all features. In a future article, I'll try trimming down the single word features to only include significant words.

  • Delicious
  • StumbleUpon
  • Reddit
  • Digg
  • Twitter
  • FriendFeed
  • Facebook
  • Share/Bookmark

Related posts

Comments (10) Trackbacks (2)
  1. this is very interesting – thanks for writing it up. It seems however that general n-grams cant be chosen out of the box ? i.e. nltk only has a choice b/w bigrams and trigrams ?

  2. That's true, but I think you could use nltk.collocations.AbstractCollocationFinder with FreqDists you create yourself. The harder part would be a generic ngram scoring function, but it looks like if you extended nltk.metrics.NgramAssocMeasures to implement _contigency and _marginals for ngrams, all the other scoring functions would work.

  3. Interesting! But in Bo Pang et al. 2008, Thumbs up? Sentiment Classification using Machine Learning Techniques, the unigrams presence feature is evaluated to be the best feature set. Using just the unigrams presence feature, Naive Bayes, Maxent and SVM classifiers can all get accuracies better than 80%. Adding the bigrams feature doesn't help. Why do you think the result of your comparison between feature sets is different from theirs?

  4. Sorry, a typo there. It should be Bo Pang et al. 2002…

  5. Sorry, a typo there. It should be Bo Pang et al. 2002…

  6. Hi Ryan, thanks for pointing me to that research. They used cross-validation of 3 folds over 1400 reviews, and made sure not to favor prolific reviewers. Whereas each file of NLTK's movie_corpus contains many reviews with no knowledge of who the reviewer is. And I fed the classifier 1500 reviews in a single batch, with no cross-validation. I also chose the 200 best bigrams on a per-file basis using information gain, whereas they apparently did not do any special bigram selection. As I mentioned at the end of the article, including all bigrams helps a little, but not much. So I think the key differences are most likely cross-validation and bigram selection.

  7. Hi Jacob,

    Your results with the stopwords filter made me curious; It seems counterintuitive. To know why, I tagged the stopwords with the POS. Next, for each POS category, I ran evaluate_classifier filtering only the stopwords in the corresponding category.

    The test's result was that from all stopwords only the adverbs and wh determiners seems to add information. Excluding them from the stopwords and running evaluate_classifier with the filter gave the following results:

    accuracy: 0.73
    pos precision: 0.653333333333
    pos recall: 0.98
    neg precision: 0.96
    neg recall: 0.48

    What do you think?

    Finally, thanks for the posts; they are very informative.

  8. Hi Pierre,

    Thanks for looking into the stopwords. I was definitely surprised when I got the original result, but your findings help make sense of it. The adverbs support verbs, and perhaps the wh determiners imply a rhetorical question (instead of a statement). Makes me think that generic stopwords lists can't be fully trusted in all contexts.

  9. Hi Jacob,

    Your results with the stopwords filter made me curious; It seems counterintuitive. To know why, I tagged the stopwords with the POS. Next, for each POS category, I ran evaluate_classifier filtering only the stopwords in the corresponding category.

    The test's result was that from all stopwords only the adverbs and wh determiners seems to add information. Excluding them from the stopwords and running evaluate_classifier with the filter gave the following results:

    accuracy: 0.73
    pos precision: 0.653333333333
    pos recall: 0.98
    neg precision: 0.96
    neg recall: 0.48

    What do you think?

    Finally, thanks for the posts; they are very informative.

  10. Hi Pierre,

    Thanks for looking into the stopwords. I was definitely surprised when I got the original result, but your findings help make sense of it. The adverbs support verbs, and perhaps the wh determiners imply a rhetorical question (instead of a statement). Makes me think that generic stopwords lists can't be fully trusted in all contexts.


Leave a comment


blog comments powered by Disqus