Text Classification for Sentiment Analysis – NLTK + Scikit-Learn

Now that NLTK versions 2.0.1 & higher include the SklearnClassifier (contributed by Lars Buitinck), it’s much easier to make use of the excellent scikit-learn library of algorithms for text classification. But how well do they work?

Below is a table showing both the accuracy & F-measure of many of these algorithms using different feature extraction methods. Unlike the standard NLTK classifiers, sklearn classifiers are designed for handling numeric features. So there are 3 different values under the feats column for each algorithm. bow means bag-of-words feature extraction, where every word gets a 1 if present, or a 0 if not. int means word counts are used, so if a word occurs twice, it gets the number 2 as its feature value (whereas with bow it would still get a 1). And tfidf means the TfidfTransformer is used to produce a floating point number that measures the importance of a word, using the tf-idf algorithm.

All numbers were determined using nltk-trainer, specifically, python train_classifier.py movie_reviews --no-pickle --classifier sklearn.ALGORITHM --fraction 0.75. For int features, the option --value-type int was used, and for tfidf features, the options --value-type float --tfidf were used. This was with NLTK 2.0.3 and sklearn 0.12.1.

algorithm feats accuracy neg f-measure pos f-measure
BernoulliNB bow 82.2 82.7 81.6
BernoulliNB int 82.2 82.7 81.6
BernoulliNB tfidf 82.2 82.7 81.6
GaussianNB bow 66.4 65.1 67.6
GaussianNB int 66.8 66.3 67.3
MultinomialNB bow 82.2 82.7 81.6
MultinomialNB int 81.2 81.5 80.1
MultinomialNB tfidf 81.6 83.0 80.0
LogisticRegression bow 85.6 85.8 85.4
LogisticRegression int 83.2 83.0 83.4
LogisticRegression tfidf 82.0 81.5 82.5
SVC bow 67.6 75.3 52.9
SVC int 67.8 71.7 62.6
SVC tfidf 50.2 0.8 66.7
LinearSVC bow 86.0 86.2 85.8
LinearSVC int 81.8 81.7 81.9
LinearSVC tfidf 85.8 85.5 86.0
NuSVC bow 85.0 85.5 84.5
NuSVC int 81.4 81.7 81.1
NuSVC tfidf 50.2 0.8 66.7

As you can see, the best algorithms are BernoulliNB, MultinomialNB, LogisticRegression, LinearSVC, and NuSVC. Surprisingly, int and tfidf features either provide a very small performance increase, or significantly decrease performance. So let’s see if we can improve performance with the same techniques used in previous articles in this series, specifically bigrams and high information words.

Bigrams

Below is a table showing the accuracy of the top 5 algorithms using just unigrams (the default, a.k.a single words), and using unigrams + bigrams (pairs of words) with the option --ngrams 1 2.

algorithm unigrams bigrams
BernoulliNB 82.2 86.0
MultinomialNB 82.2 86.0
LogisticRegression 85.6 86.6
LinearSVC 86.0 86.4
NuSVC 85.0 85.2

Only BernoulliNB & MultinomialNB got a modest boost in accuracy, putting them on-par with the rest of the algorithms. But we can do better than this using feature scoring.

Feature Scoring

As I’ve shown previously, eliminating low information features can have significant positive effects. Below is a table showing the accuracy of each algorithm at different score levels, using the option --min_score SCORE (and keeping the --ngrams 1 2 option to get bigram features).

algorithm score 1 score 2 score 3
BernoulliNB 62.8 97.2 95.8
MultinomialNB 62.8 97.2 95.8
LogisticRegression 90.4 91.6 91.4
LinearSVC 89.8 91.4 90.2
NuSVC 89.4 90.8 91.0

LogisticRegression, LinearSVC, and NuSVC all get a nice gain of ~4-5%, but the most interesting results are from the BernoulliNB & MultinomialNB algorithms, which drop down significantly at --min_score 1, but then skyrocket up to 97% with --min_score 2. The only explanation I can offer for this is that Naive Bayes classification, because it does not weight features, can be quite sensitive to changes in training data (see Bayesian Poisoning for an example).

Scikit-Learn

If you haven’t yet tried using scikit-learn for text classification, then I hope this article convinces you that it’s worth learning. NLTK’s SklearnClassifier makes the process much easier, since you don’t have to convert feature dictionaries to numpy arrays yourself, or keep track of all known features. The Scikits classifiers also tend to be more memory efficient than the standard NLTK classifiers, due to their use of sparse arrays.

  • Paul

    Hi Jacob,

    Firstly, thank you for all the great blog post relating to NLTK. Being a really new to this it has got me up the learning curve much quicker than I imagined.

    I do have a question regarding running the skikit-learn classifer posted above. When trying to run the command:

    train_classifier.py movie_reviews –no-pickle –classifier sklearn.LinearSVC –fraction 0.75

    it returns the following error:

    train_classifier.py: error: argument –classifier/–algorithm: invalid choice: ‘sklearn.LinearSVC’

    I’ve made a couple of mods to the train_classifier.py file by adding ‘import sklearn’ and ‘import scipy’, more out of curiousness than knowing the right answer to fix the error.

    Could you point me in the right direction as to where I might going wrong? I believe I have all the correct modules installed.

    TIA
    Paul

  • http://streamhacker.com/ Jacob Perkins

    If sklearn.LinearSVC is an invalid choice, it means you probably need to upgrade NLTK and/or ensure that scikit-learn is fully installed. Everything should work if you can do the following without errors:
    >>> from nltk.classify import scikitlearn

  • Paul

    Hi Jacob. That didn’t work. Thanks for the pointer, I’ll go and hunt down the problem.

  • kpa

    Hi Jacob,

    I was just wondering if you knew whether there was a difference between the NaiveBayes algorithm you use in your other sentiment analysis post, and the MultinomialNB that scikit-learn proposes.

    More specifically – is the MultinomialNB with the bow binary features really multinomial?

    Cheers

  • http://streamhacker.com/ Jacob Perkins

    The NLTK NaiveBayes and sklearn MultinomialNB usually have comparable accuracy and speed. The big difference is in memory usage, with sklearn being much more efficient.

    MultinomialNB with binary features is not really multinomial, but in my experience that doesn’t seem to matter. sklearn does have other Naive Bayes algorithms available, but I’ve found MultinomialNB performs best.

  • kpa

    Thank you for the quick response Jacob!

    I’ve ran the above experiments with the NaiveBayes classifier with the same parameters for the movie_reviews data and I get the following results:

    NaiveBayes bow (equivalent to Bernoulli model): accuracy = 72.6

    NaiveBayes int (equivalent to Multinomial model): accuracy = 73.0

    The Multinomial and Bernoulli results are the same as the results you get.

    Now, in the NaiveBayes with bow features, this should be equivalent to the BernoulliNB classifier, yet there is almost 10% difference in accuracy. Similarly for the int feature, this should be equivalent to the MultinomialNB, yet again there is 10% difference in accuracy.

    Do you know what could lead to this discrepancy?

    Cheers

  • http://streamhacker.com/ Jacob Perkins

    The algorithms are implemented differently, with different data structures and lower level libraries. You would have to carefully examine the code for both to figure out what causes the difference.