Now that NLTK versions 2.0.1 & higher include the SklearnClassifier (contributed by Lars Buitinck), it’s much easier to make use of the excellent scikit-learn library of algorithms for text classification. But how well do they work?
Below is a table showing both the accuracy & F-measure of many of these algorithms using different feature extraction methods. Unlike the standard NLTK classifiers, sklearn classifiers are designed for handling numeric features. So there are 3 different values under the
feats column for each algorithm.
bow means bag-of-words feature extraction, where every word gets a 1 if present, or a 0 if not.
int means word counts are used, so if a word occurs twice, it gets the number 2 as its feature value (whereas with
bow it would still get a 1). And
tfidf means the TfidfTransformer is used to produce a floating point number that measures the importance of a word, using the tf-idf algorithm.
All numbers were determined using nltk-trainer, specifically,
python train_classifier.py movie_reviews <span class="pre">--no-pickle</span> <span class="pre">--classifier</span> sklearn.ALGORITHM <span class="pre">--fraction</span> 0.75. For
int features, the option
<span class="pre">--value-type</span> int was used, and for
tfidf features, the options
<span class="pre">--value-type</span> float <span class="pre">--tfidf</span> were used. This was with NLTK 2.0.3 and sklearn 0.12.1.
|algorithm||feats||accuracy||neg f-measure||pos f-measure|
As you can see, the best algorithms are BernoulliNB, MultinomialNB, LogisticRegression, LinearSVC, and NuSVC. Surprisingly,
tfidf features either provide a very small performance increase, or significantly decrease performance. So let’s see if we can improve performance with the same techniques used in previous articles in this series, specifically bigrams and high information words.
Below is a table showing the accuracy of the top 5 algorithms using just
unigrams (the default, a.k.a single words), and using unigrams +
bigrams (pairs of words) with the option
<span class="pre">--ngrams</span> 1 2.
MultinomialNB got a modest boost in accuracy, putting them on-par with the rest of the algorithms. But we can do better than this using feature scoring.
As I’ve shown previously, eliminating low information features can have significant positive effects. Below is a table showing the accuracy of each algorithm at different score levels, using the option
<span class="pre">--min_score</span> SCORE (and keeping the
<span class="pre">--ngrams</span> 1 2 option to get bigram features).
|algorithm||score 1||score 2||score 3|
NuSVC all get a nice gain of ~4-5%, but the most interesting results are from the
MultinomialNB algorithms, which drop down significantly at
<span class="pre">--min_score</span> 1, but then skyrocket up to 97% with
<span class="pre">--min_score</span> 2. The only explanation I can offer for this is that Naive Bayes classification, because it does not weight features, can be quite sensitive to changes in training data (see Bayesian Poisoning for an example).
If you haven’t yet tried using scikit-learn for text classification, then I hope this article convinces you that it’s worth learning. NLTK’s SklearnClassifier makes the process much easier, since you don’t have to convert feature dictionaries to numpy arrays yourself, or keep track of all known features. The Scikits classifiers also tend to be more memory efficient than the standard NLTK classifiers, due to their use of sparse arrays.