NLTK-Trainer (available github and bitbucket) was created to make it as easy as possible to train NLTK text classifiers. The train_classifiers.py script provides a command-line interface for training & evaluating classifiers, with a number of options for customizing text feature extraction and classifier training (run python train_classifier.py --help
for a complete list of options). Below, I’ll show you how to use it to (mostly) replicate the results shown in my previous articles on text classification. You should checkout or download nltk-trainer if you want to run the examples yourself.
NLTK Movie Reviews Corpus
To run the code, we need to make sure everything is setup for training. The most important thing is installing the NLTK data (and of course, you’ll need to install NLTK as well). In this case, we need the movie_reviews
corpus, which you can download/install by running sudo python -m nltk.downloader movie_reviews
. This command will ensure that the movie_reviews
corpus is downloaded and/or located in an NLTK data directory, such as /usr/share/nltk_data
on Linux, or C:\nltk_data
on Windows. The movie_reviews
corpus can then be found under the corpora
subdirectory.
Training a Naive Bayes Classifier
Now we can use train_classifier.py
to replicate the results from the first article on text classification for sentiment analysis with a naive bayes classifier. The complete command is:
python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --show-most-informative 10 --no-pickle movie_reviews
Here’s an explanation of each option:
--instances files
: this says that each file is treated as an individual instance, so that each feature set will contain word: True for each word in a file
--fraction 0.75
: we’ll use 75% of the the files in each category for training, and the remaining 25% of the files for testing
--show-most-informative 10
: show the 10 most informative words
--no-pickle
: the default is to store a pickled classifier, but this option lets us do evaluation without pickling the classifier
If you cd
into the nltk-trainer
directory and the run the above command, your output should look like this:
$ python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --show-most-informative 10 --no-pickle movie_reviews
2 labels: ['neg', 'pos']
1500 training feats, 500 testing feats
training a NaiveBayes classifier
accuracy: 0.726000
neg precision: 0.952000
neg recall: 0.476000
neg f-measure: 0.634667
pos precision: 0.650667
pos recall: 0.976000
pos f-measure: 0.780800
10 most informative features
Most Informative Features
finest = True pos : neg = 13.4 : 1.0
astounding = True pos : neg = 11.0 : 1.0
avoids = True pos : neg = 11.0 : 1.0
inject = True neg : pos = 10.3 : 1.0
strongest = True pos : neg = 10.3 : 1.0
stupidity = True neg : pos = 10.2 : 1.0
damon = True pos : neg = 9.8 : 1.0
slip = True pos : neg = 9.7 : 1.0
temple = True pos : neg = 9.7 : 1.0
regard = True pos : neg = 9.7 : 1.0
If you refer to the article on measuring precision and recall of a classifier, you’ll see that the numbers are slightly different. We also ended up with a different top 10 most informative features. This is due to train_classifier.py choosing slightly different training instances than the code in the previous articles. But the results are still basically the same.
Filtering Stopwords
Let’s try it again, but this time we’ll filter out stopwords (the default is no stopword filtering):
$ python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --no-pickle --filter-stopwords english movie_reviews
2 labels: ['neg', 'pos']
1500 training feats, 500 testing feats
training a NaiveBayes classifier
accuracy: 0.724000
neg precision: 0.944444
neg recall: 0.476000
neg f-measure: 0.632979
pos precision: 0.649733
pos recall: 0.972000
pos f-measure: 0.778846
As shown in text classification with stopwords and collocations, filtering stopwords reduces accuracy. A helpful comment by Pierre explained that adverbs and determiners that start with “wh” can be valuable features, and removing them is what causes the dip in accuracy.
High Information Feature Selection
There’s two options that allow you to restrict which words are used by their information gain:
--max_feats 10000
will use the 10,000 most informative words, and discard the rest
--min_score 3
will use all words whose score is at least 3, and discard any words with a lower score
Here’s the results of using --max_feats 10000
:
$ python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --no-pickle --max_feats 10000 movie_reviews
2 labels: ['neg', 'pos']
calculating word scores
10000 words meet min_score and/or max_feats
1500 training feats, 500 testing feats
training a NaiveBayes classifier
accuracy: 0.888000
neg precision: 0.970874
neg recall: 0.800000
neg f-measure: 0.877193
pos precision: 0.829932
pos recall: 0.976000
pos f-measure: 0.897059
The accuracy is a bit lower than shown in the article on eliminating low information features, most likely due to the slightly different training & testing instances. Using --min_score 3
instead increases accuracy a little bit:
$ python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --no-pickle --min_score 3 movie_reviews
2 labels: ['neg', 'pos']
calculating word scores
8298 words meet min_score and/or max_feats
1500 training feats, 500 testing feats
training a NaiveBayes classifier
accuracy: 0.894000
neg precision: 0.966825
neg recall: 0.816000
neg f-measure: 0.885033
pos precision: 0.840830
pos recall: 0.972000
pos f-measure: 0.901670
Bigram Features
To include bigram features (pairs of words that occur in a sentence), use the --bigrams
option. This is different than finding significant collocations, as all bigrams are considered using the nltk.util.bigrams function. Combining --bigrams
with --min_score 3
gives us the highest accuracy yet, 97%!:
$ python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --no-pickle --min_score 3 --bigrams --show-most-informative 10 movie_reviews
2 labels: ['neg', 'pos']
calculating word scores
28075 words meet min_score and/or max_feats
1500 training feats, 500 testing feats
training a NaiveBayes classifier
accuracy: 0.970000
neg precision: 0.979592
neg recall: 0.960000
neg f-measure: 0.969697
pos precision: 0.960784
pos recall: 0.980000
pos f-measure: 0.970297
10 most informative features
Most Informative Features
finest = True pos : neg = 13.4 : 1.0
('matt', 'damon') = True pos : neg = 13.0 : 1.0
('a', 'wonderfully') = True pos : neg = 12.3 : 1.0
('everything', 'from') = True pos : neg = 12.3 : 1.0
('witty', 'and') = True pos : neg = 11.0 : 1.0
astounding = True pos : neg = 11.0 : 1.0
avoids = True pos : neg = 11.0 : 1.0
('most', 'films') = True pos : neg = 11.0 : 1.0
inject = True neg : pos = 10.3 : 1.0
('show', 's') = True pos : neg = 10.3 : 1.0
Of course, the “Bourne bias” is still present with the ('matt', 'damon')
bigram, but you can’t argue with the numbers. Every metric is at 96% or greater, clearly showing that high information feature selection with bigrams is hugely beneficial for text classification, at least when using the the NaiveBayes algorithm. This also goes against what I said at the end of the article on high information feature selection:
bigrams don’t matter much when using only high information words
In fact, bigrams can make a huge difference, but you can’t restrict them to just 200 significant collocations. Instead, you must include all of them, and let the scoring function decide what’s significant and what isn’t.