Training Binary Text Classifiers with NLTK Trainer

NLTK-Trainer (available github and bitbucket) was created to make it as easy as possible to train NLTK text classifiers. The train_classifiers.py script provides a command-line interface for training & evaluating classifiers, with a number of options for customizing text feature extraction and classifier training (run python train_classifier.py --help for a complete list of options). Below, I’ll show you how to use it to (mostly) replicate the results shown in my previous articles on text classification. You should checkout or download nltk-trainer if you want to run the examples yourself.

NLTK Movie Reviews Corpus

To run the code, we need to make sure everything is setup for training. The most important thing is installing the NLTK data (and of course, you’ll need to install NLTK as well). In this case, we need the movie_reviews corpus, which you can download/install by running sudo python -m nltk.downloader movie_reviews. This command will ensure that the movie_reviews corpus is downloaded and/or located in an NLTK data directory, such as /usr/share/nltk_data on Linux, or C:\nltk_data on Windows. The movie_reviews corpus can then be found under the corpora subdirectory.

Training a Naive Bayes Classifier

Now we can use train_classifier.py to replicate the results from the first article on text classification for sentiment analysis with a naive bayes classifier. The complete command is:

python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --show-most-informative 10 --no-pickle movie_reviews

Here’s an explanation of each option:

  • --instances files: this says that each file is treated as an individual instance, so that each feature set will contain word: True for each word in a file
  • --fraction 0.75: we’ll use 75% of the the files in each category for training, and the remaining 25% of the files for testing
  • --show-most-informative 10: show the 10 most informative words
  • --no-pickle: the default is to store a pickled classifier, but this option lets us do evaluation without pickling the classifier

If you cd into the nltk-trainer directory and the run the above command, your output should look like this:

$ python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --show-most-informative 10 --no-pickle movie_reviews
2 labels: ['neg', 'pos']
1500 training feats, 500 testing feats
training a NaiveBayes classifier
accuracy: 0.726000
neg precision: 0.952000
neg recall: 0.476000
neg f-measure: 0.634667
pos precision: 0.650667
pos recall: 0.976000
pos f-measure: 0.780800
10 most informative features
Most Informative Features
          finest = True              pos : neg    =     13.4 : 1.0
      astounding = True              pos : neg    =     11.0 : 1.0
          avoids = True              pos : neg    =     11.0 : 1.0
          inject = True              neg : pos    =     10.3 : 1.0
       strongest = True              pos : neg    =     10.3 : 1.0
       stupidity = True              neg : pos    =     10.2 : 1.0
           damon = True              pos : neg    =      9.8 : 1.0
            slip = True              pos : neg    =      9.7 : 1.0
          temple = True              pos : neg    =      9.7 : 1.0
          regard = True              pos : neg    =      9.7 : 1.0

If you refer to the article on measuring precision and recall of a classifier, you’ll see that the numbers are slightly different. We also ended up with a different top 10 most informative features. This is due to train_classifier.py choosing slightly different training instances than the code in the previous articles. But the results are still basically the same.

Filtering Stopwords

Let’s try it again, but this time we’ll filter out stopwords (the default is no stopword filtering):

$ python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --no-pickle --filter-stopwords english movie_reviews
2 labels: ['neg', 'pos']
1500 training feats, 500 testing feats
training a NaiveBayes classifier
accuracy: 0.724000
neg precision: 0.944444
neg recall: 0.476000
neg f-measure: 0.632979
pos precision: 0.649733
pos recall: 0.972000
pos f-measure: 0.778846

As shown in text classification with stopwords and collocations, filtering stopwords reduces accuracy. A helpful comment by Pierre explained that adverbs and determiners that start with “wh” can be valuable features, and removing them is what causes the dip in accuracy.

High Information Feature Selection

There’s two options that allow you to restrict which words are used by their information gain:

  • --max_feats 10000 will use the 10,000 most informative words, and discard the rest
  • --min_score 3 will use all words whose score is at least 3, and discard any words with a lower score

Here’s the results of using --max_feats 10000:

$ python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --no-pickle --max_feats 10000 movie_reviews
2 labels: ['neg', 'pos']
calculating word scores
10000 words meet min_score and/or max_feats
1500 training feats, 500 testing feats
training a NaiveBayes classifier
accuracy: 0.888000
neg precision: 0.970874
neg recall: 0.800000
neg f-measure: 0.877193
pos precision: 0.829932
pos recall: 0.976000
pos f-measure: 0.897059

The accuracy is a bit lower than shown in the article on eliminating low information features, most likely due to the slightly different training & testing instances. Using --min_score 3 instead increases accuracy a little bit:

$ python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --no-pickle --min_score 3 movie_reviews
2 labels: ['neg', 'pos']
calculating word scores
8298 words meet min_score and/or max_feats
1500 training feats, 500 testing feats
training a NaiveBayes classifier
accuracy: 0.894000
neg precision: 0.966825
neg recall: 0.816000
neg f-measure: 0.885033
pos precision: 0.840830
pos recall: 0.972000
pos f-measure: 0.901670

Bigram Features

To include bigram features (pairs of words that occur in a sentence), use the --bigrams option. This is different than finding significant collocations, as all bigrams are considered using the nltk.util.bigrams function. Combining --bigrams with --min_score 3 gives us the highest accuracy yet, 97%!:

  $ python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --no-pickle --min_score 3 --bigrams --show-most-informative 10 movie_reviews
  2 labels: ['neg', 'pos']
  calculating word scores
  28075 words meet min_score and/or max_feats
  1500 training feats, 500 testing feats
  training a NaiveBayes classifier
  accuracy: 0.970000
  neg precision: 0.979592
  neg recall: 0.960000
  neg f-measure: 0.969697
  pos precision: 0.960784
  pos recall: 0.980000
  pos f-measure: 0.970297
  10 most informative features
  Most Informative Features
                finest = True              pos : neg    =     13.4 : 1.0
     ('matt', 'damon') = True              pos : neg    =     13.0 : 1.0
  ('a', 'wonderfully') = True              pos : neg    =     12.3 : 1.0
('everything', 'from') = True              pos : neg    =     12.3 : 1.0
      ('witty', 'and') = True              pos : neg    =     11.0 : 1.0
            astounding = True              pos : neg    =     11.0 : 1.0
                avoids = True              pos : neg    =     11.0 : 1.0
     ('most', 'films') = True              pos : neg    =     11.0 : 1.0
                inject = True              neg : pos    =     10.3 : 1.0
         ('show', 's') = True              pos : neg    =     10.3 : 1.0

Of course, the “Bourne bias” is still present with the ('matt', 'damon') bigram, but you can’t argue with the numbers. Every metric is at 96% or greater, clearly showing that high information feature selection with bigrams is hugely beneficial for text classification, at least when using the the NaiveBayes algorithm. This also goes against what I said at the end of the article on high information feature selection:

bigrams don’t matter much when using only high information words

In fact, bigrams can make a huge difference, but you can’t restrict them to just 200 significant collocations. Instead, you must include all of them, and let the scoring function decide what’s significant and what isn’t.