NLTK-Trainer (available github and bitbucket) was created to make it as easy as possible to train NLTK text classifiers. The train_classifiers.py script provides a command-line interface for training & evaluating classifiers, with a number of options for customizing text feature extraction and classifier training (run python train_classifier.py --help
for a complete list of options). Below, I’ll show you how to use it to (mostly) replicate the results shown in my previous articles on text classification. You should checkout or download nltk-trainer if you want to run the examples yourself.
NLTK Movie Reviews Corpus
To run the code, we need to make sure everything is setup for training. The most important thing is installing the NLTK data (and of course, you’ll need to install NLTK as well). In this case, we need the movie_reviews
corpus, which you can download/install by running sudo python -m nltk.downloader movie_reviews
. This command will ensure that the movie_reviews
corpus is downloaded and/or located in an NLTK data directory, such as /usr/share/nltk_data
on Linux, or C:\nltk_data
on Windows. The movie_reviews
corpus can then be found under the corpora
subdirectory.
Training a Naive Bayes Classifier
Now we can use train_classifier.py
to replicate the results from the first article on text classification for sentiment analysis with a naive bayes classifier. The complete command is:
python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --show-most-informative 10 --no-pickle movie_reviews
Here’s an explanation of each option:
--instances files
: this says that each file is treated as an individual instance, so that each feature set will contain word: True for each word in a file--fraction 0.75
: we’ll use 75% of the the files in each category for training, and the remaining 25% of the files for testing--show-most-informative 10
: show the 10 most informative words--no-pickle
: the default is to store a pickled classifier, but this option lets us do evaluation without pickling the classifier
If you cd
into the nltk-trainer
directory and the run the above command, your output should look like this:
$ python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --show-most-informative 10 --no-pickle movie_reviews 2 labels: ['neg', 'pos'] 1500 training feats, 500 testing feats training a NaiveBayes classifier accuracy: 0.726000 neg precision: 0.952000 neg recall: 0.476000 neg f-measure: 0.634667 pos precision: 0.650667 pos recall: 0.976000 pos f-measure: 0.780800 10 most informative features Most Informative Features finest = True pos : neg = 13.4 : 1.0 astounding = True pos : neg = 11.0 : 1.0 avoids = True pos : neg = 11.0 : 1.0 inject = True neg : pos = 10.3 : 1.0 strongest = True pos : neg = 10.3 : 1.0 stupidity = True neg : pos = 10.2 : 1.0 damon = True pos : neg = 9.8 : 1.0 slip = True pos : neg = 9.7 : 1.0 temple = True pos : neg = 9.7 : 1.0 regard = True pos : neg = 9.7 : 1.0
If you refer to the article on measuring precision and recall of a classifier, you’ll see that the numbers are slightly different. We also ended up with a different top 10 most informative features. This is due to train_classifier.py choosing slightly different training instances than the code in the previous articles. But the results are still basically the same.
Filtering Stopwords
Let’s try it again, but this time we’ll filter out stopwords (the default is no stopword filtering):
$ python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --no-pickle --filter-stopwords english movie_reviews 2 labels: ['neg', 'pos'] 1500 training feats, 500 testing feats training a NaiveBayes classifier accuracy: 0.724000 neg precision: 0.944444 neg recall: 0.476000 neg f-measure: 0.632979 pos precision: 0.649733 pos recall: 0.972000 pos f-measure: 0.778846
As shown in text classification with stopwords and collocations, filtering stopwords reduces accuracy. A helpful comment by Pierre explained that adverbs and determiners that start with “wh” can be valuable features, and removing them is what causes the dip in accuracy.
High Information Feature Selection
There’s two options that allow you to restrict which words are used by their information gain:
--max_feats 10000
will use the 10,000 most informative words, and discard the rest--min_score 3
will use all words whose score is at least 3, and discard any words with a lower score
Here’s the results of using --max_feats 10000
:
$ python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --no-pickle --max_feats 10000 movie_reviews 2 labels: ['neg', 'pos'] calculating word scores 10000 words meet min_score and/or max_feats 1500 training feats, 500 testing feats training a NaiveBayes classifier accuracy: 0.888000 neg precision: 0.970874 neg recall: 0.800000 neg f-measure: 0.877193 pos precision: 0.829932 pos recall: 0.976000 pos f-measure: 0.897059
The accuracy is a bit lower than shown in the article on eliminating low information features, most likely due to the slightly different training & testing instances. Using --min_score 3
instead increases accuracy a little bit:
$ python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --no-pickle --min_score 3 movie_reviews 2 labels: ['neg', 'pos'] calculating word scores 8298 words meet min_score and/or max_feats 1500 training feats, 500 testing feats training a NaiveBayes classifier accuracy: 0.894000 neg precision: 0.966825 neg recall: 0.816000 neg f-measure: 0.885033 pos precision: 0.840830 pos recall: 0.972000 pos f-measure: 0.901670
Bigram Features
To include bigram features (pairs of words that occur in a sentence), use the --bigrams
option. This is different than finding significant collocations, as all bigrams are considered using the nltk.util.bigrams function. Combining --bigrams
with --min_score 3
gives us the highest accuracy yet, 97%!:
$ python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --no-pickle --min_score 3 --bigrams --show-most-informative 10 movie_reviews 2 labels: ['neg', 'pos'] calculating word scores 28075 words meet min_score and/or max_feats 1500 training feats, 500 testing feats training a NaiveBayes classifier accuracy: 0.970000 neg precision: 0.979592 neg recall: 0.960000 neg f-measure: 0.969697 pos precision: 0.960784 pos recall: 0.980000 pos f-measure: 0.970297 10 most informative features Most Informative Features finest = True pos : neg = 13.4 : 1.0 ('matt', 'damon') = True pos : neg = 13.0 : 1.0 ('a', 'wonderfully') = True pos : neg = 12.3 : 1.0 ('everything', 'from') = True pos : neg = 12.3 : 1.0 ('witty', 'and') = True pos : neg = 11.0 : 1.0 astounding = True pos : neg = 11.0 : 1.0 avoids = True pos : neg = 11.0 : 1.0 ('most', 'films') = True pos : neg = 11.0 : 1.0 inject = True neg : pos = 10.3 : 1.0 ('show', 's') = True pos : neg = 10.3 : 1.0
Of course, the “Bourne bias” is still present with the ('matt', 'damon')
bigram, but you can’t argue with the numbers. Every metric is at 96% or greater, clearly showing that high information feature selection with bigrams is hugely beneficial for text classification, at least when using the the NaiveBayes algorithm. This also goes against what I said at the end of the article on high information feature selection:
bigrams don’t matter much when using only high information words
In fact, bigrams can make a huge difference, but you can’t restrict them to just 200 significant collocations. Instead, you must include all of them, and let the scoring function decide what’s significant and what isn’t.