Now that NLTK versions 2.0.1 & higher include the SklearnClassifier (contributed by Lars Buitinck), it’s much easier to make use of the excellent scikit-learn library of algorithms for text classification. But how well do they work?
Below is a table showing both the accuracy & F-measure of many of these algorithms using different feature extraction methods. Unlike the standard NLTK classifiers, sklearn classifiers are designed for handling numeric features. So there are 3 different values under the feats
column for each algorithm. bow
means bag-of-words feature extraction, where every word gets a 1 if present, or a 0 if not. int
means word counts are used, so if a word occurs twice, it gets the number 2 as its feature value (whereas with bow
it would still get a 1). And tfidf
means the TfidfTransformer is used to produce a floating point number that measures the importance of a word, using the tf-idf algorithm.
All numbers were determined using nltk-trainer, specifically, python train_classifier.py movie_reviews <span class="pre">--no-pickle</span> <span class="pre">--classifier</span> sklearn.ALGORITHM <span class="pre">--fraction</span> 0.75
. For int
features, the option <span class="pre">--value-type</span> int
was used, and for tfidf
features, the options <span class="pre">--value-type</span> float <span class="pre">--tfidf</span>
were used. This was with NLTK 2.0.3 and sklearn 0.12.1.
algorithm | feats | accuracy | neg f-measure | pos f-measure |
---|---|---|---|---|
BernoulliNB | bow | 82.2 | 82.7 | 81.6 |
BernoulliNB | int | 82.2 | 82.7 | 81.6 |
BernoulliNB | tfidf | 82.2 | 82.7 | 81.6 |
GaussianNB | bow | 66.4 | 65.1 | 67.6 |
GaussianNB | int | 66.8 | 66.3 | 67.3 |
MultinomialNB | bow | 82.2 | 82.7 | 81.6 |
MultinomialNB | int | 81.2 | 81.5 | 80.1 |
MultinomialNB | tfidf | 81.6 | 83.0 | 80.0 |
LogisticRegression | bow | 85.6 | 85.8 | 85.4 |
LogisticRegression | int | 83.2 | 83.0 | 83.4 |
LogisticRegression | tfidf | 82.0 | 81.5 | 82.5 |
SVC | bow | 67.6 | 75.3 | 52.9 |
SVC | int | 67.8 | 71.7 | 62.6 |
SVC | tfidf | 50.2 | 0.8 | 66.7 |
LinearSVC | bow | 86.0 | 86.2 | 85.8 |
LinearSVC | int | 81.8 | 81.7 | 81.9 |
LinearSVC | tfidf | 85.8 | 85.5 | 86.0 |
NuSVC | bow | 85.0 | 85.5 | 84.5 |
NuSVC | int | 81.4 | 81.7 | 81.1 |
NuSVC | tfidf | 50.2 | 0.8 | 66.7 |
As you can see, the best algorithms are BernoulliNB, MultinomialNB, LogisticRegression, LinearSVC, and NuSVC. Surprisingly, int
and tfidf
features either provide a very small performance increase, or significantly decrease performance. So let’s see if we can improve performance with the same techniques used in previous articles in this series, specifically bigrams and high information words.
Bigrams
Below is a table showing the accuracy of the top 5 algorithms using just unigrams
(the default, a.k.a single words), and using unigrams + bigrams
(pairs of words) with the option <span class="pre">--ngrams</span> 1 2
.
algorithm | unigrams | bigrams |
---|---|---|
BernoulliNB | 82.2 | 86.0 |
MultinomialNB | 82.2 | 86.0 |
LogisticRegression | 85.6 | 86.6 |
LinearSVC | 86.0 | 86.4 |
NuSVC | 85.0 | 85.2 |
Only BernoulliNB
& MultinomialNB
got a modest boost in accuracy, putting them on-par with the rest of the algorithms. But we can do better than this using feature scoring.
Feature Scoring
As I’ve shown previously, eliminating low information features can have significant positive effects. Below is a table showing the accuracy of each algorithm at different score levels, using the option <span class="pre">--min_score</span> SCORE
(and keeping the <span class="pre">--ngrams</span> 1 2
option to get bigram features).
algorithm | score 1 | score 2 | score 3 |
---|---|---|---|
BernoulliNB | 62.8 | 97.2 | 95.8 |
MultinomialNB | 62.8 | 97.2 | 95.8 |
LogisticRegression | 90.4 | 91.6 | 91.4 |
LinearSVC | 89.8 | 91.4 | 90.2 |
NuSVC | 89.4 | 90.8 | 91.0 |
LogisticRegression
, LinearSVC
, and NuSVC
all get a nice gain of ~4-5%, but the most interesting results are from the BernoulliNB
& MultinomialNB
algorithms, which drop down significantly at <span class="pre">--min_score</span> 1
, but then skyrocket up to 97% with <span class="pre">--min_score</span> 2
. The only explanation I can offer for this is that Naive Bayes classification, because it does not weight features, can be quite sensitive to changes in training data (see Bayesian Poisoning for an example).
Scikit-Learn
If you haven’t yet tried using scikit-learn for text classification, then I hope this article convinces you that it’s worth learning. NLTK’s SklearnClassifier makes the process much easier, since you don’t have to convert feature dictionaries to numpy arrays yourself, or keep track of all known features. The Scikits classifiers also tend to be more memory efficient than the standard NLTK classifiers, due to their use of sparse arrays.