Text Classification for Sentiment Analysis – Eliminate Low Information Features
When your classification model has hundreds or thousands of features, as is the case for text categorization, it's a good bet that many (if not most) of the features are low information. These are features that are common across all classes, and therefore contribute little information to the classification process. Individually they are harmless, but in aggregate, low information features can decrease performance.
Eliminating low information features gives your model clarity by removing noisy data. It can save you from overfitting and the curse of dimensionality. When you use only the higher information features, you can increase performance while also decreasing the size of the model, which results in less memory usage along with faster training and classification. Removing features may seem intuitively wrong, but wait till you see the results.
High Information Feature Selection
Using the same evaluate_classifier method as in the previous post on classifying with bigrams, I got the following results using the 10000 most informative words:
evaluating best word features
accuracy: 0.93
pos precision: 0.890909090909
pos recall: 0.98
neg precision: 0.977777777778
neg recall: 0.88
Most Informative Features
magnificent = True pos : neg = 15.0 : 1.0
outstanding = True pos : neg = 13.6 : 1.0
insulting = True neg : pos = 13.0 : 1.0
vulnerable = True pos : neg = 12.3 : 1.0
ludicrous = True neg : pos = 11.8 : 1.0
avoids = True pos : neg = 11.7 : 1.0
uninvolving = True neg : pos = 11.7 : 1.0
astounding = True pos : neg = 10.3 : 1.0
fascination = True pos : neg = 10.3 : 1.0
idiotic = True neg : pos = 9.8 : 1.0Contrast this with the results from the first article on classification for sentiment analysis, where we use all the words as features:
evaluating single word features
accuracy: 0.728
pos precision: 0.651595744681
pos recall: 0.98
neg precision: 0.959677419355
neg recall: 0.476
Most Informative Features
magnificent = True pos : neg = 15.0 : 1.0
outstanding = True pos : neg = 13.6 : 1.0
insulting = True neg : pos = 13.0 : 1.0
vulnerable = True pos : neg = 12.3 : 1.0
ludicrous = True neg : pos = 11.8 : 1.0
avoids = True pos : neg = 11.7 : 1.0
uninvolving = True neg : pos = 11.7 : 1.0
astounding = True pos : neg = 10.3 : 1.0
fascination = True pos : neg = 10.3 : 1.0
idiotic = True neg : pos = 9.8 : 1.0The accuracy is over 20% higher when using only the best 10000 words and pos precision has increased almost 24% while neg recall improved over 40%. These are huge increases with no reduction in pos recall and even a slight increase in neg precision. Here's the full code I used to get these results, with an explanation below.
import collections, itertools
import nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews, stopwords
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
from nltk.probability import FreqDist, ConditionalFreqDist
def evaluate_classifier(featx):
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')
negfeats = [(featx(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(featx(movie_reviews.words(fileids=[f])), 'pos') for f in posids]
negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4
trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
classifier = NaiveBayesClassifier.train(trainfeats)
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
for i, (feats, label) in enumerate(testfeats):
refsets[label].add(i)
observed = classifier.classify(feats)
testsets[observed].add(i)
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])
print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg'])
print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg'])
classifier.show_most_informative_features()
def word_feats(words):
return dict([(word, True) for word in words])
print 'evaluating single word features'
evaluate_classifier(word_feats)
word_fd = FreqDist()
label_word_fd = ConditionalFreqDist()
for word in movie_reviews.words(categories=['pos']):
word_fd.inc(word.lower())
label_word_fd['pos'].inc(word.lower())
for word in movie_reviews.words(categories=['neg']):
word_fd.inc(word.lower())
label_word_fd['neg'].inc(word.lower())
# n_ii = label_word_fd[label][word]
# n_ix = word_fd[word]
# n_xi = label_word_fd[label].N()
# n_xx = label_word_fd.N()
pos_word_count = label_word_fd['pos'].N()
neg_word_count = label_word_fd['neg'].N()
total_word_count = pos_word_count + neg_word_count
word_scores = {}
for word, freq in word_fd.iteritems():
pos_score = BigramAssocMeasures.chi_sq(label_word_fd['pos'][word],
(freq, pos_word_count), total_word_count)
neg_score = BigramAssocMeasures.chi_sq(label_word_fd['neg'][word],
(freq, neg_word_count), total_word_count)
word_scores[word] = pos_score + neg_score
best = sorted(word_scores.iteritems(), key=lambda (w,s): s, reverse=True)[:10000]
bestwords = set([w for w, s in best])
def best_word_feats(words):
return dict([(word, True) for word in words if word in bestwords])
print 'evaluating best word features'
evaluate_classifier(best_word_feats)
def best_bigram_word_feats(words, score_fn=BigramAssocMeasures.chi_sq, n=200):
bigram_finder = BigramCollocationFinder.from_words(words)
bigrams = bigram_finder.nbest(score_fn, n)
d = dict([(bigram, True) for bigram in bigrams])
d.update(best_word_feats(words))
return d
print 'evaluating best words + bigram chi_sq word features'
evaluate_classifier(best_bigram_word_feats)
Calculating Information Gain
To find the highest information features, we need to calculate information gain for each word. Information gain for classification is a measure of how common a feature is in a particular class compared to how common it is in all other classes. A word that occurs primarily in positive movie reviews and rarely in negative reviews is high information. For example, the presence of the word "magnificent" in a movie review is a strong indicator that the review is positive. That makes "magnificent" a high information word. Notice that the most informative features above did not change. That makes sense because the point is to use only the most informative features and ignore the rest.
One of the best metrics for information gain is chi square. NLTK includes this in the BigramAssocMeasures class in the metrics package. To use it, first we need to calculate a few frequencies for each word: its overall frequency and its frequency within each class. This is done with a FreqDist for overall frequency of words, and a ConditionalFreqDist where the conditions are the class labels. Once we have those numbers, we can score words with the BigramAssocMeasures.chi_sq function, then sort the words by score and take the top 10000. We then put these words into a set, and use a set membership test in our feature selection function to select only those words that appear in the set. Now each file is classified based on the presence of these high information words.
Signficant Bigrams
The code above also evaluates the inclusion of 200 significant bigram collocations. Here are the results:
evaluating best words + bigram chi_sq word features
accuracy: 0.92
pos precision: 0.913385826772
pos recall: 0.928
neg precision: 0.926829268293
neg recall: 0.912
Most Informative Features
magnificent = True pos : neg = 15.0 : 1.0
outstanding = True pos : neg = 13.6 : 1.0
insulting = True neg : pos = 13.0 : 1.0
vulnerable = True pos : neg = 12.3 : 1.0
('matt', 'damon') = True pos : neg = 12.3 : 1.0
('give', 'us') = True neg : pos = 12.3 : 1.0
ludicrous = True neg : pos = 11.8 : 1.0
uninvolving = True neg : pos = 11.7 : 1.0
avoids = True pos : neg = 11.7 : 1.0
('absolutely', 'no') = True neg : pos = 10.6 : 1.0This shows that bigrams don't matter much when using only high information words. In this case, the best way to evaluate the difference between including bigrams or not is to look at precision and recall. With the bigrams, you we get more uniform performance in each class. Without bigrams, precision and recall are less balanced. But the differences may depend on your particular data, so don't assume these observations are always true.
Improving Feature Selection
The big lesson here is that improving feature selection will improve your classifier. Reducing dimensionality is one of the single best things you can do to improve classifier performance. It's ok to throw away data if that data is not adding value. And it's especially recommended when that data is actually making your model worse.





June 16th, 2010 - 14:30
Why don't you just use a logistic regression classifier with L1 regularization?
June 16th, 2010 - 17:37
Because I have no idea what that is or how to use it
Care to explain?
June 16th, 2010 - 22:56
latent semantic indexing(lsi) is another way of reducing high dimensionality while retaining the variability in the data.
June 18th, 2010 - 11:12
Yep, one of these days I'll get around to using and evaluating lsi/lsa.
June 20th, 2010 - 06:41
Logistic regression: I think you know what that is.
regularized regression: A different spin to something like step-wise regression, where you have some method of trimming down the size of the model that is learned. In this case, there is a penalty added to the objective function you are minimizing that penalizes your model for adding a new term (having non-zero coefficients).
L1 regularization: This is the type of penalty you are baking into the model. L1 boils down to being the sum of the absolute values of the coefficients. In practice, this has the tendency to set non-informative coefs to 0 (here is your pseudo information gain). L2 is the sum of the square of the coefficients (in practice, this doesn't work well in variable selection as it tends to include all coefs, but pushes them towards 0).
Further reading:
http://www-stat.stanford.edu/~tibs/lasso.html
You can also google for “elastic net” which includes a mixture of both L1 and L2 penalties.
June 20th, 2010 - 12:39
I was about to add a comment exactly with the same remark (Logistic Regression or Linear SVM + L1 or ElasticNet). For the python developers, please know that you can use scikits.learn that provide wrappers for liblinear for LR and linear SVM with L1 regularizer, a C optimized implementation of the LASSO and elastic net using coordinate descent, and univariate feature selection and soon an implementation of LARS :
http://scikit-learn.sf.net
And especially:
http://scikit-learn.sourceforge.net/modules/glm...
http://github.com/fseoane/scikit-learn/blob/mas...
June 20th, 2010 - 15:38
Thanks Steve & Olivier. I've got some reading to do now, and more reasons to explore scikits.learn.
June 21st, 2010 - 01:36
For two classes pos_score would always be equal to neg_score, right? So there is no point in word_scores[word] = pos_score + neg_score ?
June 21st, 2010 - 06:55
They'll only be equal if the occurrence frequency of the word is the same for each class, and each class has the same total number of occurrences. But you're right on questioning adding the scores together (I was wondering if someone would do that
. Adding the scores was more of a shortcut – it's probably better to get all the high scoring words for each class separately, then combine them for the final set of best words.
June 22nd, 2010 - 06:30
Here is an example of logistic regression using L2 and L1 penalty that emphasizes the automated feature selection of L1 w.r.t. L2:
http://github.com/ogrisel/scikit-learn/blob/a9a...
June 26th, 2010 - 07:05
I have the same experience (using spanish language): Reducing features based on information gain I obtain a 99 % accuracy between subjective/objective (10 fold cross validation) and 91 % between three classes (neutral, negative, positive).
In this case my training set is reduced ~1/3 (mainly on neutral) so the other 2/3 must be analyzed by a human.
June 28th, 2010 - 10:44
Thanks for sharing, good to get confirmation that this works on other languages.
July 6th, 2010 - 20:03
Also you may want to try Non-Negative Factorization (NMF), which is better than LSI from my previous experience.
July 6th, 2010 - 20:20
Thanks, hadn't heard of NMF. I'll have to look into it more, as the wikipedia article is especially opaque. I'm sure it'd make sense if I already understood it
July 7th, 2010 - 19:41
One thing I forgot is, in the book Programming Collective Intelligence, Chapter 10 you'll find the Python implementation of the NMF. Unfortunately, this part is not available online at Google Book thus you may need to borrow a physical copy.
I had hoped to do it (sentiment analysis) using R myself but I got some more urgent matter to deal with
…
Thanks for sharing your great job!
July 8th, 2010 - 02:41
One thing I forgot is, in the book Programming Collective Intelligence, Chapter 10 you'll find the Python implementation of the NMF. Unfortunately, this part is not available online at Google Book thus you may need to borrow a physical copy.
I had hoped to do it (sentiment analysis) using R myself but I got some more urgent matter to deal with
…
Thanks for sharing your great job!