StreamHacker Weotta be Hacking

25Oct/1029

Training Binary Text Classifiers with NLTK Trainer

NLTK-Trainer (available github and bitbucket) was created to make it as easy as possible to train NLTK text classifiers. The train_classifiers.py script provides a command-line interface for training & evaluating classifiers, with a number of options for customizing text feature extraction and classifier training (run python train_classifier.py --help for a complete list of options). Below, I'll show you how to use it to (mostly) replicate the results shown in my previous articles on text classification. You should checkout or download nltk-trainer if you want to run the examples yourself.

NLTK Movie Reviews Corpus

To run the code, we need to make sure everything is setup for training. The most important thing is installing the NLTK data (and of course, you'll need to install NLTK as well). In this case, we need the movie_reviews corpus, which you can download/install by running sudo python -m nltk.downloader movie_reviews. This command will ensure that the movie_reviews corpus is downloaded and/or located in an NLTK data directory, such as /usr/share/nltk_data on Linux, or C:\nltk_data on Windows. The movie_reviews corpus can then be found under the corpora subdirectory.

Training a Naive Bayes Classifier

Now we can use train_classifier.py to replicate the results from the first article on text classification for sentiment analysis with a naive bayes classifier. The complete command is:

python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --show-most-informative 10 --no-pickle movie_reviews

Here's an explanation of each option:

  • --instances files: this says that each file is treated as an individual instance, so that each feature set will contain word: True for each word in a file
  • --fraction 0.75: we'll use 75% of the the files in each category for training, and the remaining 25% of the files for testing
  • --show-most-informative 10: show the 10 most informative words
  • --no-pickle: the default is to store a pickled classifier, but this option lets us do evaluation without pickling the classifier

If you cd into the nltk-trainer directory and the run the above command, your output should look like this:

$ python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --show-most-informative 10 --no-pickle movie_reviews
2 labels: ['neg', 'pos']
1500 training feats, 500 testing feats
training a NaiveBayes classifier
accuracy: 0.726000
neg precision: 0.952000
neg recall: 0.476000
neg f-measure: 0.634667
pos precision: 0.650667
pos recall: 0.976000
pos f-measure: 0.780800
10 most informative features
Most Informative Features
          finest = True              pos : neg    =     13.4 : 1.0
      astounding = True              pos : neg    =     11.0 : 1.0
          avoids = True              pos : neg    =     11.0 : 1.0
          inject = True              neg : pos    =     10.3 : 1.0
       strongest = True              pos : neg    =     10.3 : 1.0
       stupidity = True              neg : pos    =     10.2 : 1.0
           damon = True              pos : neg    =      9.8 : 1.0
            slip = True              pos : neg    =      9.7 : 1.0
          temple = True              pos : neg    =      9.7 : 1.0
          regard = True              pos : neg    =      9.7 : 1.0

If you refer to the article on measuring precision and recall of a classifier, you'll see that the numbers are slightly different. We also ended up with a different top 10 most informative features. This is due to train_classifier.py choosing slightly different training instances than the code in the previous articles. But the results are still basically the same.

Filtering Stopwords

Let's try it again, but this time we'll filter out stopwords (the default is no stopword filtering):

$ python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --no-pickle --filter-stopwords english movie_reviews
2 labels: ['neg', 'pos']
1500 training feats, 500 testing feats
training a NaiveBayes classifier
accuracy: 0.724000
neg precision: 0.944444
neg recall: 0.476000
neg f-measure: 0.632979
pos precision: 0.649733
pos recall: 0.972000
pos f-measure: 0.778846

As shown in text classification with stopwords and collocations, filtering stopwords reduces accuracy. A helpful comment by Pierre explained that adverbs and determiners that start with "wh" can be valuable features, and removing them is what causes the dip in accuracy.

High Information Feature Selection

There's two options that allow you to restrict which words are used by their information gain:

  • --max_feats 10000 will use the 10,000 most informative words, and discard the rest
  • --min_score 3 will use all words whose score is at least 3, and discard any words with a lower score

Here's the results of using --max_feats 10000:

$ python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --no-pickle --max_feats 10000 movie_reviews
2 labels: ['neg', 'pos']
calculating word scores
10000 words meet min_score and/or max_feats
1500 training feats, 500 testing feats
training a NaiveBayes classifier
accuracy: 0.888000
neg precision: 0.970874
neg recall: 0.800000
neg f-measure: 0.877193
pos precision: 0.829932
pos recall: 0.976000
pos f-measure: 0.897059

The accuracy is a bit lower than shown in the article on eliminating low information features, most likely due to the slightly different training & testing instances. Using --min_score 3 instead increases accuracy a little bit:

$ python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --no-pickle --min_score 3 movie_reviews
2 labels: ['neg', 'pos']
calculating word scores
8298 words meet min_score and/or max_feats
1500 training feats, 500 testing feats
training a NaiveBayes classifier
accuracy: 0.894000
neg precision: 0.966825
neg recall: 0.816000
neg f-measure: 0.885033
pos precision: 0.840830
pos recall: 0.972000
pos f-measure: 0.901670

Bigram Features

To include bigram features (pairs of words that occur in a sentence), use the --bigrams option. This is different than finding significant collocations, as all bigrams are considered using the nltk.util.bigrams function. Combining --bigrams with --min_score 3 gives us the highest accuracy yet, 97%!:

  $ python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --no-pickle --min_score 3 --bigrams --show-most-informative 10 movie_reviews
  2 labels: ['neg', 'pos']
  calculating word scores
  28075 words meet min_score and/or max_feats
  1500 training feats, 500 testing feats
  training a NaiveBayes classifier
  accuracy: 0.970000
  neg precision: 0.979592
  neg recall: 0.960000
  neg f-measure: 0.969697
  pos precision: 0.960784
  pos recall: 0.980000
  pos f-measure: 0.970297
  10 most informative features
  Most Informative Features
                finest = True              pos : neg    =     13.4 : 1.0
     ('matt', 'damon') = True              pos : neg    =     13.0 : 1.0
  ('a', 'wonderfully') = True              pos : neg    =     12.3 : 1.0
('everything', 'from') = True              pos : neg    =     12.3 : 1.0
      ('witty', 'and') = True              pos : neg    =     11.0 : 1.0
            astounding = True              pos : neg    =     11.0 : 1.0
                avoids = True              pos : neg    =     11.0 : 1.0
     ('most', 'films') = True              pos : neg    =     11.0 : 1.0
                inject = True              neg : pos    =     10.3 : 1.0
         ('show', 's') = True              pos : neg    =     10.3 : 1.0

Of course, the "Bourne bias" is still present with the ('matt', 'damon') bigram, but you can't argue with the numbers. Every metric is at 96% or greater, clearly showing that high information feature selection with bigrams is hugely beneficial for text classification, at least when using the the NaiveBayes algorithm. This also goes against what I said at the end of the article on high information feature selection:

bigrams don't matter much when using only high information words

In fact, bigrams can make a huge difference, but you can't restrict them to just 200 significant collocations. Instead, you must include all of them, and let the scoring function decide what's significant and what isn't.

16Jun/1039

Text Classification for Sentiment Analysis – Eliminate Low Information Features

When your classification model has hundreds or thousands of features, as is the case for text categorization, it's a good bet that many (if not most) of the features are low information. These are features that are common across all classes, and therefore contribute little information to the classification process. Individually they are harmless, but in aggregate, low information features can decrease performance.

Eliminating low information features gives your model clarity by removing noisy data. It can save you from overfitting and the curse of dimensionality. When you use only the higher information features, you can increase performance while also decreasing the size of the model, which results in less memory usage along with faster training and classification. Removing features may seem intuitively wrong, but wait till you see the results.

High Information Feature Selection

Using the same evaluate_classifier method as in the previous post on classifying with bigrams, I got the following results using the 10000 most informative words:

evaluating best word features
accuracy: 0.93
pos precision: 0.890909090909
pos recall: 0.98
neg precision: 0.977777777778
neg recall: 0.88
Most Informative Features
             magnificent = True              pos : neg    =     15.0 : 1.0
             outstanding = True              pos : neg    =     13.6 : 1.0
               insulting = True              neg : pos    =     13.0 : 1.0
              vulnerable = True              pos : neg    =     12.3 : 1.0
               ludicrous = True              neg : pos    =     11.8 : 1.0
                  avoids = True              pos : neg    =     11.7 : 1.0
             uninvolving = True              neg : pos    =     11.7 : 1.0
              astounding = True              pos : neg    =     10.3 : 1.0
             fascination = True              pos : neg    =     10.3 : 1.0
                 idiotic = True              neg : pos    =      9.8 : 1.0

Contrast this with the results from the first article on classification for sentiment analysis, where we use all the words as features:

evaluating single word features
accuracy: 0.728
pos precision: 0.651595744681
pos recall: 0.98
neg precision: 0.959677419355
neg recall: 0.476
Most Informative Features
         magnificent = True              pos : neg    =     15.0 : 1.0
         outstanding = True              pos : neg    =     13.6 : 1.0
           insulting = True              neg : pos    =     13.0 : 1.0
          vulnerable = True              pos : neg    =     12.3 : 1.0
           ludicrous = True              neg : pos    =     11.8 : 1.0
              avoids = True              pos : neg    =     11.7 : 1.0
         uninvolving = True              neg : pos    =     11.7 : 1.0
          astounding = True              pos : neg    =     10.3 : 1.0
         fascination = True              pos : neg    =     10.3 : 1.0
             idiotic = True              neg : pos    =      9.8 : 1.0

The accuracy is over 20% higher when using only the best 10000 words and pos precision has increased almost 24% while neg recall improved over 40%. These are huge increases with no reduction in pos recall and even a slight increase in neg precision. Here's the full code I used to get these results, with an explanation below.

import collections, itertools
import nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews, stopwords
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
from nltk.probability import FreqDist, ConditionalFreqDist

def evaluate_classifier(featx):
	negids = movie_reviews.fileids('neg')
	posids = movie_reviews.fileids('pos')

	negfeats = [(featx(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
	posfeats = [(featx(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

	negcutoff = len(negfeats)*3/4
	poscutoff = len(posfeats)*3/4

	trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
	testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]

	classifier = NaiveBayesClassifier.train(trainfeats)
	refsets = collections.defaultdict(set)
	testsets = collections.defaultdict(set)

	for i, (feats, label) in enumerate(testfeats):
			refsets[label].add(i)
			observed = classifier.classify(feats)
			testsets[observed].add(i)

	print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
	print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
	print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])
	print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg'])
	print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg'])
	classifier.show_most_informative_features()

def word_feats(words):
	return dict([(word, True) for word in words])

print 'evaluating single word features'
evaluate_classifier(word_feats)

word_fd = FreqDist()
label_word_fd = ConditionalFreqDist()

for word in movie_reviews.words(categories=['pos']):
	word_fd.inc(word.lower())
	label_word_fd['pos'].inc(word.lower())

for word in movie_reviews.words(categories=['neg']):
	word_fd.inc(word.lower())
	label_word_fd['neg'].inc(word.lower())

# n_ii = label_word_fd[label][word]
# n_ix = word_fd[word]
# n_xi = label_word_fd[label].N()
# n_xx = label_word_fd.N()

pos_word_count = label_word_fd['pos'].N()
neg_word_count = label_word_fd['neg'].N()
total_word_count = pos_word_count + neg_word_count

word_scores = {}

for word, freq in word_fd.iteritems():
	pos_score = BigramAssocMeasures.chi_sq(label_word_fd['pos'][word],
		(freq, pos_word_count), total_word_count)
	neg_score = BigramAssocMeasures.chi_sq(label_word_fd['neg'][word],
		(freq, neg_word_count), total_word_count)
	word_scores[word] = pos_score + neg_score

best = sorted(word_scores.iteritems(), key=lambda (w,s): s, reverse=True)[:10000]
bestwords = set([w for w, s in best])

def best_word_feats(words):
	return dict([(word, True) for word in words if word in bestwords])

print 'evaluating best word features'
evaluate_classifier(best_word_feats)

def best_bigram_word_feats(words, score_fn=BigramAssocMeasures.chi_sq, n=200):
	bigram_finder = BigramCollocationFinder.from_words(words)
	bigrams = bigram_finder.nbest(score_fn, n)
	d = dict([(bigram, True) for bigram in bigrams])
	d.update(best_word_feats(words))
	return d

print 'evaluating best words + bigram chi_sq word features'
evaluate_classifier(best_bigram_word_feats)

Calculating Information Gain

To find the highest information features, we need to calculate information gain for each word. Information gain for classification is a measure of how common a feature is in a particular class compared to how common it is in all other classes. A word that occurs primarily in positive movie reviews and rarely in negative reviews is high information. For example, the presence of the word "magnificent" in a movie review is a strong indicator that the review is positive. That makes "magnificent" a high information word. Notice that the most informative features above did not change. That makes sense because the point is to use only the most informative features and ignore the rest.

One of the best metrics for information gain is chi square. NLTK includes this in the BigramAssocMeasures class in the metrics package. To use it, first we need to calculate a few frequencies for each word: its overall frequency and its frequency within each class. This is done with a FreqDist for overall frequency of words, and a ConditionalFreqDist where the conditions are the class labels. Once we have those numbers, we can score words with the BigramAssocMeasures.chi_sq function, then sort the words by score and take the top 10000. We then put these words into a set, and use a set membership test in our feature selection function to select only those words that appear in the set. Now each file is classified based on the presence of these high information words.

Signficant Bigrams

The code above also evaluates the inclusion of 200 significant bigram collocations. Here are the results:

evaluating best words + bigram chi_sq word features
accuracy: 0.92
pos precision: 0.913385826772
pos recall: 0.928
neg precision: 0.926829268293
neg recall: 0.912
Most Informative Features
             magnificent = True              pos : neg    =     15.0 : 1.0
             outstanding = True              pos : neg    =     13.6 : 1.0
               insulting = True              neg : pos    =     13.0 : 1.0
              vulnerable = True              pos : neg    =     12.3 : 1.0
       ('matt', 'damon') = True              pos : neg    =     12.3 : 1.0
          ('give', 'us') = True              neg : pos    =     12.3 : 1.0
               ludicrous = True              neg : pos    =     11.8 : 1.0
             uninvolving = True              neg : pos    =     11.7 : 1.0
                  avoids = True              pos : neg    =     11.7 : 1.0
    ('absolutely', 'no') = True              neg : pos    =     10.6 : 1.0

This shows that bigrams don't matter much when using only high information words. In this case, the best way to evaluate the difference between including bigrams or not is to look at precision and recall. With the bigrams, you we get more uniform performance in each class. Without bigrams, precision and recall are less balanced. But the differences may depend on your particular data, so don't assume these observations are always true.

Improving Feature Selection

The big lesson here is that improving feature selection will improve your classifier. Reducing dimensionality is one of the single best things you can do to improve classifier performance. It's ok to throw away data if that data is not adding value. And it's especially recommended when that data is actually making your model worse.

24May/1038

Text Classification for Sentiment Analysis – Stopwords and Collocations

Improving feature extraction can often have a significant positive impact on classifier accuracy (and precision and recall). In this article, I'll be evaluating two modifications of the word_feats feature extraction method:

  1. filter out stopwords
  2. include bigram collocations

To do this effectively, we'll modify the previous code so that we can use an arbitrary feature extractor function that takes the words in a file and returns the feature dictionary. As before, we'll use these features to train a Naive Bayes Classifier.

import collections
import nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews

def evaluate_classifier(featx):
	negids = movie_reviews.fileids('neg')
	posids = movie_reviews.fileids('pos')

	negfeats = [(featx(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
	posfeats = [(featx(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

	negcutoff = len(negfeats)*3/4
	poscutoff = len(posfeats)*3/4

	trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
	testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]

	classifier = NaiveBayesClassifier.train(trainfeats)
	refsets = collections.defaultdict(set)
	testsets = collections.defaultdict(set)

	for i, (feats, label) in enumerate(testfeats):
			refsets[label].add(i)
			observed = classifier.classify(feats)
			testsets[observed].add(i)

	print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
	print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
	print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])
	print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg'])
	print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg'])
	classifier.show_most_informative_features()

Baseline Bag of Words Feature Extraction

Here's the baseline feature extractor for bag of words feature selection.

def word_feats(words):
	return dict([(word, True) for word in words])

evaluate_classifier(word_feats)

The results are the same as in the previous articles, but I've included them here for reference:

accuracy: 0.728
pos precision: 0.651595744681
pos recall: 0.98
neg precision: 0.959677419355
neg recall: 0.476
Most Informative Features
         magnificent = True              pos : neg    =     15.0 : 1.0
         outstanding = True              pos : neg    =     13.6 : 1.0
           insulting = True              neg : pos    =     13.0 : 1.0
          vulnerable = True              pos : neg    =     12.3 : 1.0
           ludicrous = True              neg : pos    =     11.8 : 1.0
              avoids = True              pos : neg    =     11.7 : 1.0
         uninvolving = True              neg : pos    =     11.7 : 1.0
          astounding = True              pos : neg    =     10.3 : 1.0
         fascination = True              pos : neg    =     10.3 : 1.0
             idiotic = True              neg : pos    =      9.8 : 1.0

Stopword Filtering

Stopwords are words that are generally considered useless. Most search engines ignore these words because they are so common that including them would greatly increase the size of the index without improving precision or recall. NLTK comes with a stopwords corpus that includes a list of 128 english stopwords. Let's see what happens when we filter out these words.

from nltk.corpus import stopwords
stopset = set(stopwords.words('english'))

def stopword_filtered_word_feats(words):
	return dict([(word, True) for word in words if word not in stopset])

evaluate_classifier(stopword_filtered_word_feats)

And the results for a stopword filtered bag of words are:

accuracy: 0.726
pos precision: 0.649867374005
pos recall: 0.98
neg precision: 0.959349593496
neg recall: 0.472

Accuracy went down .2%, and pos precision and neg recall dropped as well! Apparently stopwords add information to sentiment analysis classification. I did not include the most informative features since they did not change.

Bigram Collocations

As mentioned at the end of the article on precision and recall, it's possible that including bigrams will improve classification accuracy. The hypothesis is that people say things like "not great", which is a negative expression that the bag of words model could interpret as positive since it sees "great" as a separate word.

To find significant bigrams, we can use nltk.collocations.BigramCollocationFinder along with nltk.metrics.BigramAssocMeasures. The BigramCollocationFinder maintains 2 internal FreqDists, one for individual word frequencies, another for bigram frequencies. Once it has these frequency distributions, it can score individual bigrams using a scoring function provided by BigramAssocMeasures, such chi-square. These scoring functions measure the collocation correlation of 2 words, basically whether the bigram occurs about as frequently as each individual word.

import itertools
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

def bigram_word_feats(words, score_fn=BigramAssocMeasures.chi_sq, n=200):
	bigram_finder = BigramCollocationFinder.from_words(words)
	bigrams = bigram_finder.nbest(score_fn, n)
	return dict([(ngram, True) for ngram in itertools.chain(words, bigrams)])

evaluate_classifier(bigram_word_feats)

After some experimentation, I found that using the 200 best bigrams from each file produced great results:

accuracy: 0.816
pos precision: 0.753205128205
pos recall: 0.94
neg precision: 0.920212765957
neg recall: 0.692
Most Informative Features
         magnificent = True              pos : neg    =     15.0 : 1.0
         outstanding = True              pos : neg    =     13.6 : 1.0
           insulting = True              neg : pos    =     13.0 : 1.0
          vulnerable = True              pos : neg    =     12.3 : 1.0
   ('matt', 'damon') = True              pos : neg    =     12.3 : 1.0
      ('give', 'us') = True              neg : pos    =     12.3 : 1.0
           ludicrous = True              neg : pos    =     11.8 : 1.0
         uninvolving = True              neg : pos    =     11.7 : 1.0
              avoids = True              pos : neg    =     11.7 : 1.0
('absolutely', 'no') = True              neg : pos    =     10.6 : 1.0

Yes, you read that right, Matt Damon is apparently one of the best predictors for positive sentiment in movie reviews. But despite this chuckle-worthy result

  • accuracy is up almost 9%
  • pos precision has increased over 10% with only 4% drop in recall
  • neg recall has increased over 21% with just under 4% drop in precision

So it appears that the bigram hypothesis is correct, and including significant bigrams can increase classifier effectiveness. Note that it's significant bigrams that enhance effectiveness. I tried using nltk.util.bigrams to include all bigrams, and the results were only a few points above baseline. This points to the idea that including only significant features can improve accuracy compared to using all features. In a future article, I'll try trimming down the single word features to only include significant words.

   
%d bloggers like this: