Text Classification for Sentiment Analysis – Eliminate Low Information Features

When your classification model has hundreds or thousands of features, as is the case for text categorization, it’s a good bet that many (if not most) of the features are low information. These are features that are common across all classes, and therefore contribute little information to the classification process. Individually they are harmless, but in aggregate, low information features can decrease performance.

Eliminating low information features gives your model clarity by removing noisy data. It can save you from overfitting and the curse of dimensionality. When you use only the higher information features, you can increase performance while also decreasing the size of the model, which results in less memory usage along with faster training and classification. Removing features may seem intuitively wrong, but wait till you see the results.

High Information Feature Selection

Using the same evaluate_classifier method as in the previous post on classifying with bigrams, I got the following results using the 10000 most informative words:

evaluating best word features
accuracy: 0.93
pos precision: 0.890909090909
pos recall: 0.98
neg precision: 0.977777777778
neg recall: 0.88
Most Informative Features
             magnificent = True              pos : neg    =     15.0 : 1.0
             outstanding = True              pos : neg    =     13.6 : 1.0
               insulting = True              neg : pos    =     13.0 : 1.0
              vulnerable = True              pos : neg    =     12.3 : 1.0
               ludicrous = True              neg : pos    =     11.8 : 1.0
                  avoids = True              pos : neg    =     11.7 : 1.0
             uninvolving = True              neg : pos    =     11.7 : 1.0
              astounding = True              pos : neg    =     10.3 : 1.0
             fascination = True              pos : neg    =     10.3 : 1.0
                 idiotic = True              neg : pos    =      9.8 : 1.0

Contrast this with the results from the first article on classification for sentiment analysis, where we use all the words as features:

evaluating single word features
accuracy: 0.728
pos precision: 0.651595744681
pos recall: 0.98
neg precision: 0.959677419355
neg recall: 0.476
Most Informative Features
         magnificent = True              pos : neg    =     15.0 : 1.0
         outstanding = True              pos : neg    =     13.6 : 1.0
           insulting = True              neg : pos    =     13.0 : 1.0
          vulnerable = True              pos : neg    =     12.3 : 1.0
           ludicrous = True              neg : pos    =     11.8 : 1.0
              avoids = True              pos : neg    =     11.7 : 1.0
         uninvolving = True              neg : pos    =     11.7 : 1.0
          astounding = True              pos : neg    =     10.3 : 1.0
         fascination = True              pos : neg    =     10.3 : 1.0
             idiotic = True              neg : pos    =      9.8 : 1.0

The accuracy is over 20% higher when using only the best 10000 words and pos precision has increased almost 24% while neg recall improved over 40%. These are huge increases with no reduction in pos recall and even a slight increase in neg precision. Here’s the full code I used to get these results, with an explanation below.

import collections, itertools
import nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews, stopwords
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
from nltk.probability import FreqDist, ConditionalFreqDist

def evaluate_classifier(featx):
	negids = movie_reviews.fileids('neg')
	posids = movie_reviews.fileids('pos')

	negfeats = [(featx(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
	posfeats = [(featx(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

	negcutoff = len(negfeats)*3/4
	poscutoff = len(posfeats)*3/4

	trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
	testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]

	classifier = NaiveBayesClassifier.train(trainfeats)
	refsets = collections.defaultdict(set)
	testsets = collections.defaultdict(set)

	for i, (feats, label) in enumerate(testfeats):
			observed = classifier.classify(feats)

	print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
	print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
	print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])
	print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg'])
	print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg'])

def word_feats(words):
	return dict([(word, True) for word in words])

print 'evaluating single word features'

word_fd = FreqDist()
label_word_fd = ConditionalFreqDist()

for word in movie_reviews.words(categories=['pos']):

for word in movie_reviews.words(categories=['neg']):

# n_ii = label_word_fd[label][word]
# n_ix = word_fd[word]
# n_xi = label_word_fd[label].N()
# n_xx = label_word_fd.N()

pos_word_count = label_word_fd['pos'].N()
neg_word_count = label_word_fd['neg'].N()
total_word_count = pos_word_count + neg_word_count

word_scores = {}

for word, freq in word_fd.iteritems():
	pos_score = BigramAssocMeasures.chi_sq(label_word_fd['pos'][word],
		(freq, pos_word_count), total_word_count)
	neg_score = BigramAssocMeasures.chi_sq(label_word_fd['neg'][word],
		(freq, neg_word_count), total_word_count)
	word_scores[word] = pos_score + neg_score

best = sorted(word_scores.iteritems(), key=lambda (w,s): s, reverse=True)[:10000]
bestwords = set([w for w, s in best])

def best_word_feats(words):
	return dict([(word, True) for word in words if word in bestwords])

print 'evaluating best word features'

def best_bigram_word_feats(words, score_fn=BigramAssocMeasures.chi_sq, n=200):
	bigram_finder = BigramCollocationFinder.from_words(words)
	bigrams = bigram_finder.nbest(score_fn, n)
	d = dict([(bigram, True) for bigram in bigrams])
	return d

print 'evaluating best words + bigram chi_sq word features'

Calculating Information Gain

To find the highest information features, we need to calculate information gain for each word. Information gain for classification is a measure of how common a feature is in a particular class compared to how common it is in all other classes. A word that occurs primarily in positive movie reviews and rarely in negative reviews is high information. For example, the presence of the word “magnificent” in a movie review is a strong indicator that the review is positive. That makes “magnificent” a high information word. Notice that the most informative features above did not change. That makes sense because the point is to use only the most informative features and ignore the rest.

One of the best metrics for information gain is chi square. NLTK includes this in the BigramAssocMeasures class in the metrics package. To use it, first we need to calculate a few frequencies for each word: its overall frequency and its frequency within each class. This is done with a FreqDist for overall frequency of words, and a ConditionalFreqDist where the conditions are the class labels. Once we have those numbers, we can score words with the BigramAssocMeasures.chi_sq function, then sort the words by score and take the top 10000. We then put these words into a set, and use a set membership test in our feature selection function to select only those words that appear in the set. Now each file is classified based on the presence of these high information words.

Signficant Bigrams

The code above also evaluates the inclusion of 200 significant bigram collocations. Here are the results:

evaluating best words + bigram chi_sq word features
accuracy: 0.92
pos precision: 0.913385826772
pos recall: 0.928
neg precision: 0.926829268293
neg recall: 0.912
Most Informative Features
             magnificent = True              pos : neg    =     15.0 : 1.0
             outstanding = True              pos : neg    =     13.6 : 1.0
               insulting = True              neg : pos    =     13.0 : 1.0
              vulnerable = True              pos : neg    =     12.3 : 1.0
       ('matt', 'damon') = True              pos : neg    =     12.3 : 1.0
          ('give', 'us') = True              neg : pos    =     12.3 : 1.0
               ludicrous = True              neg : pos    =     11.8 : 1.0
             uninvolving = True              neg : pos    =     11.7 : 1.0
                  avoids = True              pos : neg    =     11.7 : 1.0
    ('absolutely', 'no') = True              neg : pos    =     10.6 : 1.0

This shows that bigrams don’t matter much when using only high information words. In this case, the best way to evaluate the difference between including bigrams or not is to look at precision and recall. With the bigrams, you we get more uniform performance in each class. Without bigrams, precision and recall are less balanced. But the differences may depend on your particular data, so don’t assume these observations are always true.

Improving Feature Selection

The big lesson here is that improving feature selection will improve your classifier. Reducing dimensionality is one of the single best things you can do to improve classifier performance. It’s ok to throw away data if that data is not adding value. And it’s especially recommended when that data is actually making your model worse.

  • hendra

    Hi jacob, I already using this code but now not working properly. Especially evaluate_classifier(best_bigram_word_feats). Could you give me an answer? Thanks

  • http://streamhacker.com/ Jacob Perkins

    Can you provide details on what’s not working & what happens? It’s very hard to help without any context.

  • hendra

    AttributeError: ‘FreqDist’ object has no attribute ‘inc’ , but I try to replace your code in this part :

    for word in movie_reviews.words(categories=[‘pos’]):

    for word in movie_reviews.words(categories=[‘neg’]):

    and then working but I have different results with you, Am I wrong with this code?

    because I already test with my own corpora accuracy from 90% and then drop to 78%(after evaluating best words + bigram chi_sq word features)

  • http://streamhacker.com/ Jacob Perkins

    The latest version of NLTK doesn’t have FreqDist.inc. Instead, it’s a subclass of Python’s collections.Counter, so replace word_fd.inc(word.lower()) with word_fd[word.lower()] += 1

  • hendra

    Thank you jacob your code is working, how about to change from naive bayes to sklearn with SVC classifier? I already trying but error in array.

  • http://streamhacker.com/ Jacob Perkins

    There’s a SklearnClassifier in NLTK for doing this: http://www.nltk.org/api/nltk.classify.html#module-nltk.classify.scikitlearn. You can use it thru train_classifier.py in https://github.com/japerk/nltk-trainer

  • hendra

    Thanks jacob, I will try to use sklearn.

  • hendra

    Hi jacob, could you give explanation to add confusion matrix from your code?
    Thank you.

  • http://streamhacker.com/ Jacob Perkins
  • hendra

    Hi Jacob, in this link http://streamhacker.com/2010/05/17/text-classification-sentiment-analysis-precision-recall/ I just only see precision, recall and F-measure in percent, then I try to call ref & test sets with:
    print (refsets[‘pos’], testsets[‘pos’])

    only arrays that appear, and I don’t know how to obtained True Positive value before becoming in percent. because I want to calculating accuracy, recall, precision and f-measure manually to cross check the results from nltk.

    below is the formula for accuracy, then I want to calculate this manually.
    Accuracy (ACC) = ? True positive + ? True negative
    ? Total population

    could I know this values with nltk ? True positive, ? True negative?

    Thank you.

  • http://streamhacker.com/ Jacob Perkins

    refsets[‘pos’] is a set of all the doc ids that had label ‘pos’, whereas testsets[‘pos’] is set of all the doc ids that classified as ‘pos’. So True Positives = refsets[‘pos’] & testsets[‘pos’], and True Negatives = refsets[‘neg’] & testsets[‘neg’]

  • hendra

    Thank you jacob, I already do your code below :

    print ‘True pos : ‘, refsets[‘pos’]&testsets[‘pos’]
    print ‘True neg : ‘, refsets[‘neg’]&testsets[‘neg’]
    print ‘False Pos : ‘, refsets[‘neg’]&testsets[‘pos’]
    print ‘False Neg : ‘, refsets[‘pos’]&testsets[‘neg’]

    results :

    True pos : set([25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 40, 41, 42, 43, 44, 45, 46, 47, 48])

    True neg : set([2, 3, 4, 5, 6, 7, 8, 9, 10, 13, 15, 16, 17, 19, 20, 21, 23, 24])

    False Pos : set([0, 1, 11, 12, 14, 18, 22])

    False Neg : set([49, 39])

    Which values should I take from these results to be TP, TN, FP, FN?

    Thank you.

  • http://streamhacker.com/ Jacob Perkins

    The integers in the sets are references to the feature sets. In the post, they are created in the enumerate() loop. But you could also create a list of the feature sets, then use the integers as indexes into the list. The only reason to do a set of integers is because the feature sets are dicts, which are not hashable.

  • hendra

    Thank you jacob for your explanation, I will try to do it.