Text Classification for Sentiment Analysis – Eliminate Low Information Features

When your classification model has hundreds or thousands of features, as is the case for text categorization, it’s a good bet that many (if not most) of the features are low information. These are features that are common across all classes, and therefore contribute little information to the classification process. Individually they are harmless, but in aggregate, low information features can decrease performance.

Eliminating low information features gives your model clarity by removing noisy data. It can save you from overfitting and the curse of dimensionality. When you use only the higher information features, you can increase performance while also decreasing the size of the model, which results in less memory usage along with faster training and classification. Removing features may seem intuitively wrong, but wait till you see the results.

High Information Feature Selection

Using the same evaluate_classifier method as in the previous post on classifying with bigrams, I got the following results using the 10000 most informative words:

evaluating best word features
accuracy: 0.93
pos precision: 0.890909090909
pos recall: 0.98
neg precision: 0.977777777778
neg recall: 0.88
Most Informative Features
             magnificent = True              pos : neg    =     15.0 : 1.0
             outstanding = True              pos : neg    =     13.6 : 1.0
               insulting = True              neg : pos    =     13.0 : 1.0
              vulnerable = True              pos : neg    =     12.3 : 1.0
               ludicrous = True              neg : pos    =     11.8 : 1.0
                  avoids = True              pos : neg    =     11.7 : 1.0
             uninvolving = True              neg : pos    =     11.7 : 1.0
              astounding = True              pos : neg    =     10.3 : 1.0
             fascination = True              pos : neg    =     10.3 : 1.0
                 idiotic = True              neg : pos    =      9.8 : 1.0

Contrast this with the results from the first article on classification for sentiment analysis, where we use all the words as features:

evaluating single word features
accuracy: 0.728
pos precision: 0.651595744681
pos recall: 0.98
neg precision: 0.959677419355
neg recall: 0.476
Most Informative Features
         magnificent = True              pos : neg    =     15.0 : 1.0
         outstanding = True              pos : neg    =     13.6 : 1.0
           insulting = True              neg : pos    =     13.0 : 1.0
          vulnerable = True              pos : neg    =     12.3 : 1.0
           ludicrous = True              neg : pos    =     11.8 : 1.0
              avoids = True              pos : neg    =     11.7 : 1.0
         uninvolving = True              neg : pos    =     11.7 : 1.0
          astounding = True              pos : neg    =     10.3 : 1.0
         fascination = True              pos : neg    =     10.3 : 1.0
             idiotic = True              neg : pos    =      9.8 : 1.0

The accuracy is over 20% higher when using only the best 10000 words and pos precision has increased almost 24% while neg recall improved over 40%. These are huge increases with no reduction in pos recall and even a slight increase in neg precision. Here’s the full code I used to get these results, with an explanation below.

import collections, itertools
import nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews, stopwords
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
from nltk.probability import FreqDist, ConditionalFreqDist

def evaluate_classifier(featx):
	negids = movie_reviews.fileids('neg')
	posids = movie_reviews.fileids('pos')

	negfeats = [(featx(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
	posfeats = [(featx(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

	negcutoff = len(negfeats)*3/4
	poscutoff = len(posfeats)*3/4

	trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
	testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]

	classifier = NaiveBayesClassifier.train(trainfeats)
	refsets = collections.defaultdict(set)
	testsets = collections.defaultdict(set)

	for i, (feats, label) in enumerate(testfeats):
			refsets[label].add(i)
			observed = classifier.classify(feats)
			testsets[observed].add(i)

	print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
	print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
	print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])
	print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg'])
	print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg'])
	classifier.show_most_informative_features()

def word_feats(words):
	return dict([(word, True) for word in words])

print 'evaluating single word features'
evaluate_classifier(word_feats)

word_fd = FreqDist()
label_word_fd = ConditionalFreqDist()

for word in movie_reviews.words(categories=['pos']):
	word_fd.inc(word.lower())
	label_word_fd['pos'].inc(word.lower())

for word in movie_reviews.words(categories=['neg']):
	word_fd.inc(word.lower())
	label_word_fd['neg'].inc(word.lower())

# n_ii = label_word_fd[label][word]
# n_ix = word_fd[word]
# n_xi = label_word_fd[label].N()
# n_xx = label_word_fd.N()

pos_word_count = label_word_fd['pos'].N()
neg_word_count = label_word_fd['neg'].N()
total_word_count = pos_word_count + neg_word_count

word_scores = {}

for word, freq in word_fd.iteritems():
	pos_score = BigramAssocMeasures.chi_sq(label_word_fd['pos'][word],
		(freq, pos_word_count), total_word_count)
	neg_score = BigramAssocMeasures.chi_sq(label_word_fd['neg'][word],
		(freq, neg_word_count), total_word_count)
	word_scores[word] = pos_score + neg_score

best = sorted(word_scores.iteritems(), key=lambda (w,s): s, reverse=True)[:10000]
bestwords = set([w for w, s in best])

def best_word_feats(words):
	return dict([(word, True) for word in words if word in bestwords])

print 'evaluating best word features'
evaluate_classifier(best_word_feats)

def best_bigram_word_feats(words, score_fn=BigramAssocMeasures.chi_sq, n=200):
	bigram_finder = BigramCollocationFinder.from_words(words)
	bigrams = bigram_finder.nbest(score_fn, n)
	d = dict([(bigram, True) for bigram in bigrams])
	d.update(best_word_feats(words))
	return d

print 'evaluating best words + bigram chi_sq word features'
evaluate_classifier(best_bigram_word_feats)

Calculating Information Gain

To find the highest information features, we need to calculate information gain for each word. Information gain for classification is a measure of how common a feature is in a particular class compared to how common it is in all other classes. A word that occurs primarily in positive movie reviews and rarely in negative reviews is high information. For example, the presence of the word “magnificent” in a movie review is a strong indicator that the review is positive. That makes “magnificent” a high information word. Notice that the most informative features above did not change. That makes sense because the point is to use only the most informative features and ignore the rest.

One of the best metrics for information gain is chi square. NLTK includes this in the BigramAssocMeasures class in the metrics package. To use it, first we need to calculate a few frequencies for each word: its overall frequency and its frequency within each class. This is done with a FreqDist for overall frequency of words, and a ConditionalFreqDist where the conditions are the class labels. Once we have those numbers, we can score words with the BigramAssocMeasures.chi_sq function, then sort the words by score and take the top 10000. We then put these words into a set, and use a set membership test in our feature selection function to select only those words that appear in the set. Now each file is classified based on the presence of these high information words.

Signficant Bigrams

The code above also evaluates the inclusion of 200 significant bigram collocations. Here are the results:

evaluating best words + bigram chi_sq word features
accuracy: 0.92
pos precision: 0.913385826772
pos recall: 0.928
neg precision: 0.926829268293
neg recall: 0.912
Most Informative Features
             magnificent = True              pos : neg    =     15.0 : 1.0
             outstanding = True              pos : neg    =     13.6 : 1.0
               insulting = True              neg : pos    =     13.0 : 1.0
              vulnerable = True              pos : neg    =     12.3 : 1.0
       ('matt', 'damon') = True              pos : neg    =     12.3 : 1.0
          ('give', 'us') = True              neg : pos    =     12.3 : 1.0
               ludicrous = True              neg : pos    =     11.8 : 1.0
             uninvolving = True              neg : pos    =     11.7 : 1.0
                  avoids = True              pos : neg    =     11.7 : 1.0
    ('absolutely', 'no') = True              neg : pos    =     10.6 : 1.0

This shows that bigrams don’t matter much when using only high information words. In this case, the best way to evaluate the difference between including bigrams or not is to look at precision and recall. With the bigrams, you we get more uniform performance in each class. Without bigrams, precision and recall are less balanced. But the differences may depend on your particular data, so don’t assume these observations are always true.

Improving Feature Selection

The big lesson here is that improving feature selection will improve your classifier. Reducing dimensionality is one of the single best things you can do to improve classifier performance. It’s ok to throw away data if that data is not adding value. And it’s especially recommended when that data is actually making your model worse.

  • Alexandre Tp

    Why don't you just use a logistic regression classifier with L1 regularization?

  • Pingback: Tweets that mention Text Classification for Sentiment Analysis – Eliminate Low Information Features «streamhacker.com -- Topsy.com()

  • http://streamhacker.com/ Jacob Perkins

    Because I have no idea what that is or how to use it :)
    Care to explain?

  • http://twitter.com/tommychheng tommy chheng

    latent semantic indexing(lsi) is another way of reducing high dimensionality while retaining the variability in the data.

  • http://streamhacker.com/ Jacob Perkins

    Yep, one of these days I'll get around to using and evaluating lsi/lsa.

  • http://lianos.myopenid.com/ Steve Lianoglou

    Logistic regression: I think you know what that is.

    regularized regression: A different spin to something like step-wise regression, where you have some method of trimming down the size of the model that is learned. In this case, there is a penalty added to the objective function you are minimizing that penalizes your model for adding a new term (having non-zero coefficients).

    L1 regularization: This is the type of penalty you are baking into the model. L1 boils down to being the sum of the absolute values of the coefficients. In practice, this has the tendency to set non-informative coefs to 0 (here is your pseudo information gain). L2 is the sum of the square of the coefficients (in practice, this doesn't work well in variable selection as it tends to include all coefs, but pushes them towards 0).

    Further reading:

    http://www-stat.stanford.edu/~tibs/lasso.html

    You can also google for “elastic net” which includes a mixture of both L1 and L2 penalties.

  • http://ogrisel.myopenid.com/ Olivier Grisel

    I was about to add a comment exactly with the same remark (Logistic Regression or Linear SVM + L1 or ElasticNet). For the python developers, please know that you can use scikits.learn that provide wrappers for liblinear for LR and linear SVM with L1 regularizer, a C optimized implementation of the LASSO and elastic net using coordinate descent, and univariate feature selection and soon an implementation of LARS :

    http://scikit-learn.sf.net

    And especially:

    http://scikit-learn.sourceforge.net/modules/glm
    http://github.com/fseoane/scikit-learn/blob/mas

  • http://streamhacker.com/ Jacob Perkins

    Thanks Steve & Olivier. I've got some reading to do now, and more reasons to explore scikits.learn.

  • Pingback: python multiprocessing – text processing | The Largest Forum Archive()

  • Dmitry Chichkov

    For two classes pos_score would always be equal to neg_score, right? So there is no point in word_scores[word] = pos_score + neg_score ?

  • http://streamhacker.com/ Jacob Perkins

    They'll only be equal if the occurrence frequency of the word is the same for each class, and each class has the same total number of occurrences. But you're right on questioning adding the scores together (I was wondering if someone would do that :) . Adding the scores was more of a shortcut – it's probably better to get all the high scoring words for each class separately, then combine them for the final set of best words.

  • http://ogrisel.myopenid.com/ Olivier Grisel

    Here is an example of logistic regression using L2 and L1 penalty that emphasizes the automated feature selection of L1 w.r.t. L2:

    http://github.com/ogrisel/scikit-learn/blob/a9a

  • http://www.nektra.com wslh

    I have the same experience (using spanish language): Reducing features based on information gain I obtain a 99 % accuracy between subjective/objective (10 fold cross validation) and 91 % between three classes (neutral, negative, positive).

    In this case my training set is reduced ~1/3 (mainly on neutral) so the other 2/3 must be analyzed by a human.

  • http://streamhacker.com/ Jacob Perkins

    Thanks for sharing, good to get confirmation that this works on other languages.

  • http://twitter.com/renjl0810 Jianli Ren

    Also you may want to try Non-Negative Factorization (NMF), which is better than LSI from my previous experience.

  • http://streamhacker.com/ Jacob Perkins

    Thanks, hadn't heard of NMF. I'll have to look into it more, as the wikipedia article is especially opaque. I'm sure it'd make sense if I already understood it :)

  • http://twitter.com/renjl0810 Jianli Ren

    One thing I forgot is, in the book Programming Collective Intelligence, Chapter 10 you'll find the Python implementation of the NMF. Unfortunately, this part is not available online at Google Book thus you may need to borrow a physical copy.

    I had hoped to do it (sentiment analysis) using R myself but I got some more urgent matter to deal with :-(…

    Thanks for sharing your great job!

  • http://twitter.com/renjl0810 Jianli Ren

    One thing I forgot is, in the book Programming Collective Intelligence, Chapter 10 you'll find the Python implementation of the NMF. Unfortunately, this part is not available online at Google Book thus you may need to borrow a physical copy.

    I had hoped to do it (sentiment analysis) using R myself but I got some more urgent matter to deal with :-(…

    Thanks for sharing your great job!

  • bighit

    Why do not you eliminate low information features when training?

  • http://streamhacker.com/ Jacob Perkins

    That’s exactly what I do: the only words that are used for training are the high information ones. Each word is checked to see if it’s in the bestwords set, and discarded if it’s not.

  • bighit

    yes, i got it. thanks!

  • http://twitter.com/henrikno Henrik Nordvik

    When calculating information gain, you’re using the whole data set, including the test set. Aren’t you supposed to use only data from the training set when training?

    I tested using only the training set to gather statistics, but it didn’t seem to affect the results much, just a tad lower accuracy.

    Also, check out “Delta TFIDF: An Improved Feature Space for Sentiment Analysis” for a similar method.

  • http://twitter.com/henrikno Henrik Nordvik

    When calculating information gain, you’re using the whole data set, including the test set. Aren’t you supposed to use only data from the training set when training?

    I tested using only the training set to gather statistics, but it didn’t seem to affect the results much, just a tad lower accuracy.

    Also, check out “Delta TFIDF: An Improved Feature Space for Sentiment Analysis” for a similar method.

  • http://streamhacker.com/ Jacob Perkins

    Thanks for that paper reference, looks interesting.

    Perhaps it would be more correct to only use the training set for info gain stats, but I see those stats as a kind of global filter that allows you to determine significant features beforehand (as you can often do in non-text classification). The exact numbers aren’t used directly, but the relative values are used to determine which words are more significant, and that significance should theoretically be similar in the training set, testing set, and both combined. And that assumption appears to be correct based on your tests.

  • Sriram

    Hi,
       Thanks a lot for the all information.For a given test string for example “This is a great movie” if i need to measure the confidence level of classification of the above three methods(best words features,bigram etc) how do i do that?So that i can determine which output to take when i get different classification results from each method. Thanks once again for all the knowledge sharing.

  • http://streamhacker.com/ Jacob Perkins

    You’d need to train separate classifiers, one for each method, then instead of using the classify() function, call prob_classify() to get a ProbDist, which you can use to get the probability of each label.

  • Sriram

    Thanks a lot Jacob.I already had separate classifiers.But didnt know how to use the Prob_classify() to calculate the positive and negative scores.It worked now.Thanks once again.

  • Sriram

    Hi Jacob,
                  Iam trying to implement prob_classify() method in the MaxVoteClassifier class which combines three classifiers(naive,decision,max ent) and does voting.I added the defenition of prob_classify to the MaxVoteClassifier class.But not sure how to pass feature_probdist to the MaxVoteClassifier class.Appreciating your help.

  • http://streamhacker.com/ Jacob Perkins

    You can average the probabilities from the NaiveBayes and Maxent classifiers, but DecisionTree doesn’t support prob_classify, so you’ll have to hardcode a check and use 100% or 0% in your averaging.

  • Sriram

    Thanks a lot Jacob.Let me try it out.

  • Alex Leykin

    Yes, this seems like a lookahead bias here. Bestwords should be computed only on the train set (first 3/4 of the reviews)

  • Nawafpower

    What if I have 5 classes, lets say that I have 5 folders, each folder has text files for a specific writer, and I want to train the system on these 5 classes so when I need to test on ananymous text it will give me that most likely its class 3 for example. how can I do this using Python? 

  • Fahd

    Hi Jacob,
    I’m working on text classification for Arabic. I encounter a strange situation which I could not yet relise the reason behind it.

    I tested my dataset on both NaiveBayes-NLTK and (Multinomial Naive Bayes and Bernoulli Naive Bayes form Scikit Learn).

    Actually I found much differences in all metrics, i.e. precision, recall, f-measure.
    The differences are in the favour of scikit learn even when using the most informative features as stated in this article.
    for example the results are:
             Precision    Recall   F-measure
    NLTK      61         68          64
    Sklearn    84        85          84

    ((much differences!!))

    I believe since all those classifiers are based on Naive Bayes, they should have at least some similarity (and shouldn’t have that big difference as seen bove) in the results. Any idea about that?

    Fahd

  • http://streamhacker.com/ Jacob Perkins

    A multi-class classifier can be trained much like a binary classifier, so read up on NLTK and take a look at https://github.com/japerk/nltk-trainer and train_classifier.py

  • http://streamhacker.com/ Jacob Perkins

    Hi Fahd,

    I haven’t yet done experiments with the scikit-learn algorithms, and haven’t read the code, so the most I can say is that this is clearly a difference in implementation. NLTK’s NaiveBayes uses dictionary features that only match when the exact key & value are equal. Scikit-learn, on the other hand, has many different training options, including using frequency counts, and will actually do numeric differences instead of direct one-to-one comparisons. It also builds feature vectors instead of dictionaries, so I’m not too surprised that scikit-learn can do better estimation than NLTK.

  • Pingback: Thinknook | 10 Ways to Improve your Text Classification Algorithm Accuracy and Performance()

  • Nadeem Mohsin

    I think you might want to update this post or put up a correction about your use of the entire dataset for scoring. As some other people pointed out years ago :P , that means you’re implicitly using knowledge of your test set, which is pretty much the cardinal sin of machine learning. If you restrict it to just the training set, the accuracy gains are relatively smaller. (Assuming my quick modification of your code isn’t bugged.) It’s still enough to make your point about dimensionality reduction, though.

    The reason I mentioned this is that a friend was learning stuff from here, and was confused by this. Given how useful and popular it is among the NLTK/scikit crowd, I think you need to be careful that people don’t pick up bad habits. :P

    Sad story: I remember making a similar mistake 4-5 years ago and thinking I had totally pwned sentiment analysis until I figured out what happened. I was getting 98% accuracy with some scoring function I can’t remember clearly now. Accuracy dropped to something like 80% when I fixed it. Massive geek-heartbreak followed.

    PS: Using BigramAssocMeasures for chi_square with the class labels is a pretty neat hack.

  • gogopg

    Can you incrementally train a Maxent Classifier??

  • http://streamhacker.com/ Jacob Perkins

    Not that I know of. Feature weights are learned over many iterations, and I’m not sure there’s a way to do that incrementally.

  • Rohan c prajapati

    please can u suggest me the link for indian language microblogging sentminet classification..please i really need

  • http://streamhacker.com/ Jacob Perkins

    I wish I could, but I don’t know of anything. You may to create a corpus yourself.

  • Rojin

    Hi, I have a question regarding pickle file. How I can train a system and store it in a pickle file
    Thanks

  • JayCee

    Hey, I’m trying out your algorithm and as I understand, BigramAssocMeasures.chi_sq returns how informative the word is with relation to the ‘pos’ tagged words and with relation to the ‘neg’ tagged words. I have two related questions:
    1. why are the two calls to BigramAssocMeasures.chi_sq for a word always returning equal values pos_score and neg_score.

    2. The formula of chi_sq as I see it considers the frequency of a single word for its calculations. So is it actually considering Bigrams. If no, why is it a member of BigramAssocMeasures

  • JayCee

    PS: I’m new to this and can you please clarify if I’m misunderstanding things.

  • http://streamhacker.com/ Jacob Perkins

    1) Because the information gain is the same regardless of class. A word that is strongly “pos” is a good indication that the “pos” class should be chosen, and it is also a good indication that the “neg” class should not be chosen.

    2) The “bigram” in BigramAssocMeasures just means you’re comparing two things. There’s also a TrigramAssocMeasures for comparing 3 things.

  • Angel Oswaldo Vázquez Patiño

    Very useful post.

    I am wondering whether there is a “general rule” to select the right number of features. I mean you selected the 10000 top words after calculating chi_sq, but why to select 10000 words/features? It seems that the right number of features depends of the classifier that we use (see reference, Forman 2004). For example by using SVM the performance is the same after using a certain number of features but by using Naive Bayes for example, the performance get compromised after that certain number of features.

    For knowing that number of features is it necessary other study by comparing the performance of a classifier with different number of features or maybe there is some other method?

    On the other hand, by means of your approach, you are using the frequency of each word in the document/collection. Is it the same that using chi2 in Sklearn? In short words I have the same question that is presented here http://stackoverflow.com/questions/14573030/perform-chi-2-feature-selection-on-tf-and-tfidf-vectors
    I don’t understand why it said that chi^2 does not take into account the frequency of words in each document but only takes into account if the word/feature is present or not.

    Forman, G., 2004. A pitfall and solution in multi-class feature selection for text classification, in: Proceedings of the Twenty-First International Conference on Machine Learning. ACM, p. 38.

  • http://streamhacker.com/ Jacob Perkins

    I don’t know that there’s a general rule for feature selection. But there are optimization methods that can do automatic feature selection, such as PCA. I think sklearn may have some functionality, and there’s research papers on it too.

    I don’t know about chi2 in sklearn, but I’d assume they’re similar.
    The answer to the stackoverflow question says that frequencies are used.