When your classification model has hundreds or thousands of features, as is the case for text categorization, it’s a good bet that many (if not most) of the features are low information. These are features that are common across all classes, and therefore contribute little information to the classification process. Individually they are harmless, but in aggregate, low information features can decrease performance.
Eliminating low information features gives your model clarity by removing noisy data. It can save you from overfitting and the curse of dimensionality. When you use only the higher information features, you can increase performance while also decreasing the size of the model, which results in less memory usage along with faster training and classification. Removing features may seem intuitively wrong, but wait till you see the results.
High Information Feature Selection
Using the same evaluate_classifier method as in the previous post on classifying with bigrams, I got the following results using the 10000 most informative words:
evaluating best word features accuracy: 0.93 pos precision: 0.890909090909 pos recall: 0.98 neg precision: 0.977777777778 neg recall: 0.88 Most Informative Features magnificent = True pos : neg = 15.0 : 1.0 outstanding = True pos : neg = 13.6 : 1.0 insulting = True neg : pos = 13.0 : 1.0 vulnerable = True pos : neg = 12.3 : 1.0 ludicrous = True neg : pos = 11.8 : 1.0 avoids = True pos : neg = 11.7 : 1.0 uninvolving = True neg : pos = 11.7 : 1.0 astounding = True pos : neg = 10.3 : 1.0 fascination = True pos : neg = 10.3 : 1.0 idiotic = True neg : pos = 9.8 : 1.0
Contrast this with the results from the first article on classification for sentiment analysis, where we use all the words as features:
evaluating single word features accuracy: 0.728 pos precision: 0.651595744681 pos recall: 0.98 neg precision: 0.959677419355 neg recall: 0.476 Most Informative Features magnificent = True pos : neg = 15.0 : 1.0 outstanding = True pos : neg = 13.6 : 1.0 insulting = True neg : pos = 13.0 : 1.0 vulnerable = True pos : neg = 12.3 : 1.0 ludicrous = True neg : pos = 11.8 : 1.0 avoids = True pos : neg = 11.7 : 1.0 uninvolving = True neg : pos = 11.7 : 1.0 astounding = True pos : neg = 10.3 : 1.0 fascination = True pos : neg = 10.3 : 1.0 idiotic = True neg : pos = 9.8 : 1.0
The accuracy is over 20% higher when using only the best 10000 words and pos precision has increased almost 24% while neg recall improved over 40%. These are huge increases with no reduction in pos recall and even a slight increase in neg precision. Here’s the full code I used to get these results, with an explanation below.
import collections, itertools import nltk.classify.util, nltk.metrics from nltk.classify import NaiveBayesClassifier from nltk.corpus import movie_reviews, stopwords from nltk.collocations import BigramCollocationFinder from nltk.metrics import BigramAssocMeasures from nltk.probability import FreqDist, ConditionalFreqDist def evaluate_classifier(featx): negids = movie_reviews.fileids('neg') posids = movie_reviews.fileids('pos') negfeats = [(featx(movie_reviews.words(fileids=[f])), 'neg') for f in negids] posfeats = [(featx(movie_reviews.words(fileids=[f])), 'pos') for f in posids] negcutoff = len(negfeats)*3/4 poscutoff = len(posfeats)*3/4 trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff] testfeats = negfeats[negcutoff:] + posfeats[poscutoff:] classifier = NaiveBayesClassifier.train(trainfeats) refsets = collections.defaultdict(set) testsets = collections.defaultdict(set) for i, (feats, label) in enumerate(testfeats): refsets[label].add(i) observed = classifier.classify(feats) testsets[observed].add(i) print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats) print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos']) print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos']) print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg']) print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg']) classifier.show_most_informative_features() def word_feats(words): return dict([(word, True) for word in words]) print 'evaluating single word features' evaluate_classifier(word_feats) word_fd = FreqDist() label_word_fd = ConditionalFreqDist() for word in movie_reviews.words(categories=['pos']): word_fd.inc(word.lower()) label_word_fd['pos'].inc(word.lower()) for word in movie_reviews.words(categories=['neg']): word_fd.inc(word.lower()) label_word_fd['neg'].inc(word.lower()) # n_ii = label_word_fd[label][word] # n_ix = word_fd[word] # n_xi = label_word_fd[label].N() # n_xx = label_word_fd.N() pos_word_count = label_word_fd['pos'].N() neg_word_count = label_word_fd['neg'].N() total_word_count = pos_word_count + neg_word_count word_scores = {} for word, freq in word_fd.iteritems(): pos_score = BigramAssocMeasures.chi_sq(label_word_fd['pos'][word], (freq, pos_word_count), total_word_count) neg_score = BigramAssocMeasures.chi_sq(label_word_fd['neg'][word], (freq, neg_word_count), total_word_count) word_scores[word] = pos_score + neg_score best = sorted(word_scores.iteritems(), key=lambda (w,s): s, reverse=True)[:10000] bestwords = set([w for w, s in best]) def best_word_feats(words): return dict([(word, True) for word in words if word in bestwords]) print 'evaluating best word features' evaluate_classifier(best_word_feats) def best_bigram_word_feats(words, score_fn=BigramAssocMeasures.chi_sq, n=200): bigram_finder = BigramCollocationFinder.from_words(words) bigrams = bigram_finder.nbest(score_fn, n) d = dict([(bigram, True) for bigram in bigrams]) d.update(best_word_feats(words)) return d print 'evaluating best words + bigram chi_sq word features' evaluate_classifier(best_bigram_word_feats)
Calculating Information Gain
To find the highest information features, we need to calculate information gain for each word. Information gain for classification is a measure of how common a feature is in a particular class compared to how common it is in all other classes. A word that occurs primarily in positive movie reviews and rarely in negative reviews is high information. For example, the presence of the word “magnificent” in a movie review is a strong indicator that the review is positive. That makes “magnificent” a high information word. Notice that the most informative features above did not change. That makes sense because the point is to use only the most informative features and ignore the rest.
One of the best metrics for information gain is chi square. NLTK includes this in the BigramAssocMeasures class in the metrics package. To use it, first we need to calculate a few frequencies for each word: its overall frequency and its frequency within each class. This is done with a FreqDist for overall frequency of words, and a ConditionalFreqDist where the conditions are the class labels. Once we have those numbers, we can score words with the BigramAssocMeasures.chi_sq function, then sort the words by score and take the top 10000. We then put these words into a set, and use a set membership test in our feature selection function to select only those words that appear in the set. Now each file is classified based on the presence of these high information words.
Signficant Bigrams
The code above also evaluates the inclusion of 200 significant bigram collocations. Here are the results:
evaluating best words + bigram chi_sq word features accuracy: 0.92 pos precision: 0.913385826772 pos recall: 0.928 neg precision: 0.926829268293 neg recall: 0.912 Most Informative Features magnificent = True pos : neg = 15.0 : 1.0 outstanding = True pos : neg = 13.6 : 1.0 insulting = True neg : pos = 13.0 : 1.0 vulnerable = True pos : neg = 12.3 : 1.0 ('matt', 'damon') = True pos : neg = 12.3 : 1.0 ('give', 'us') = True neg : pos = 12.3 : 1.0 ludicrous = True neg : pos = 11.8 : 1.0 uninvolving = True neg : pos = 11.7 : 1.0 avoids = True pos : neg = 11.7 : 1.0 ('absolutely', 'no') = True neg : pos = 10.6 : 1.0
This shows that bigrams don’t matter much when using only high information words. In this case, the best way to evaluate the difference between including bigrams or not is to look at precision and recall. With the bigrams, you we get more uniform performance in each class. Without bigrams, precision and recall are less balanced. But the differences may depend on your particular data, so don’t assume these observations are always true.
Improving Feature Selection
The big lesson here is that improving feature selection will improve your classifier. Reducing dimensionality is one of the single best things you can do to improve classifier performance. It’s ok to throw away data if that data is not adding value. And it’s especially recommended when that data is actually making your model worse.