Text Classification for Sentiment Analysis – NLTK + Scikit-Learn

Now that NLTK versions 2.0.1 & higher include the SklearnClassifier (contributed by Lars Buitinck), it’s much easier to make use of the excellent scikit-learn library of algorithms for text classification. But how well do they work?

Below is a table showing both the accuracy & F-measure of many of these algorithms using different feature extraction methods. Unlike the standard NLTK classifiers, sklearn classifiers are designed for handling numeric features. So there are 3 different values under the feats column for each algorithm. bow means bag-of-words feature extraction, where every word gets a 1 if present, or a 0 if not. int means word counts are used, so if a word occurs twice, it gets the number 2 as its feature value (whereas with bow it would still get a 1). And tfidf means the TfidfTransformer is used to produce a floating point number that measures the importance of a word, using the tf-idf algorithm.

All numbers were determined using nltk-trainer, specifically, python train_classifier.py movie_reviews --no-pickle --classifier sklearn.ALGORITHM --fraction 0.75. For int features, the option --value-type int was used, and for tfidf features, the options --value-type float --tfidf were used. This was with NLTK 2.0.3 and sklearn 0.12.1.

algorithm feats accuracy neg f-measure pos f-measure
BernoulliNB bow 82.2 82.7 81.6
BernoulliNB int 82.2 82.7 81.6
BernoulliNB tfidf 82.2 82.7 81.6
GaussianNB bow 66.4 65.1 67.6
GaussianNB int 66.8 66.3 67.3
MultinomialNB bow 82.2 82.7 81.6
MultinomialNB int 81.2 81.5 80.1
MultinomialNB tfidf 81.6 83.0 80.0
LogisticRegression bow 85.6 85.8 85.4
LogisticRegression int 83.2 83.0 83.4
LogisticRegression tfidf 82.0 81.5 82.5
SVC bow 67.6 75.3 52.9
SVC int 67.8 71.7 62.6
SVC tfidf 50.2 0.8 66.7
LinearSVC bow 86.0 86.2 85.8
LinearSVC int 81.8 81.7 81.9
LinearSVC tfidf 85.8 85.5 86.0
NuSVC bow 85.0 85.5 84.5
NuSVC int 81.4 81.7 81.1
NuSVC tfidf 50.2 0.8 66.7

As you can see, the best algorithms are BernoulliNB, MultinomialNB, LogisticRegression, LinearSVC, and NuSVC. Surprisingly, int and tfidf features either provide a very small performance increase, or significantly decrease performance. So let’s see if we can improve performance with the same techniques used in previous articles in this series, specifically bigrams and high information words.


Below is a table showing the accuracy of the top 5 algorithms using just unigrams (the default, a.k.a single words), and using unigrams + bigrams (pairs of words) with the option --ngrams 1 2.

algorithm unigrams bigrams
BernoulliNB 82.2 86.0
MultinomialNB 82.2 86.0
LogisticRegression 85.6 86.6
LinearSVC 86.0 86.4
NuSVC 85.0 85.2

Only BernoulliNB & MultinomialNB got a modest boost in accuracy, putting them on-par with the rest of the algorithms. But we can do better than this using feature scoring.

Feature Scoring

As I’ve shown previously, eliminating low information features can have significant positive effects. Below is a table showing the accuracy of each algorithm at different score levels, using the option --min_score SCORE (and keeping the --ngrams 1 2 option to get bigram features).

algorithm score 1 score 2 score 3
BernoulliNB 62.8 97.2 95.8
MultinomialNB 62.8 97.2 95.8
LogisticRegression 90.4 91.6 91.4
LinearSVC 89.8 91.4 90.2
NuSVC 89.4 90.8 91.0

LogisticRegression, LinearSVC, and NuSVC all get a nice gain of ~4-5%, but the most interesting results are from the BernoulliNB & MultinomialNB algorithms, which drop down significantly at --min_score 1, but then skyrocket up to 97% with --min_score 2. The only explanation I can offer for this is that Naive Bayes classification, because it does not weight features, can be quite sensitive to changes in training data (see Bayesian Poisoning for an example).


If you haven’t yet tried using scikit-learn for text classification, then I hope this article convinces you that it’s worth learning. NLTK’s SklearnClassifier makes the process much easier, since you don’t have to convert feature dictionaries to numpy arrays yourself, or keep track of all known features. The Scikits classifiers also tend to be more memory efficient than the standard NLTK classifiers, due to their use of sparse arrays.

  • Paul

    Hi Jacob,

    Firstly, thank you for all the great blog post relating to NLTK. Being a really new to this it has got me up the learning curve much quicker than I imagined.

    I do have a question regarding running the skikit-learn classifer posted above. When trying to run the command:

    train_classifier.py movie_reviews –no-pickle –classifier sklearn.LinearSVC –fraction 0.75

    it returns the following error:

    train_classifier.py: error: argument –classifier/–algorithm: invalid choice: ‘sklearn.LinearSVC’

    I’ve made a couple of mods to the train_classifier.py file by adding ‘import sklearn’ and ‘import scipy’, more out of curiousness than knowing the right answer to fix the error.

    Could you point me in the right direction as to where I might going wrong? I believe I have all the correct modules installed.


  • If sklearn.LinearSVC is an invalid choice, it means you probably need to upgrade NLTK and/or ensure that scikit-learn is fully installed. Everything should work if you can do the following without errors:
    >>> from nltk.classify import scikitlearn

  • Paul

    Hi Jacob. That didn’t work. Thanks for the pointer, I’ll go and hunt down the problem.

  • kpa

    Hi Jacob,

    I was just wondering if you knew whether there was a difference between the NaiveBayes algorithm you use in your other sentiment analysis post, and the MultinomialNB that scikit-learn proposes.

    More specifically – is the MultinomialNB with the bow binary features really multinomial?


  • The NLTK NaiveBayes and sklearn MultinomialNB usually have comparable accuracy and speed. The big difference is in memory usage, with sklearn being much more efficient.

    MultinomialNB with binary features is not really multinomial, but in my experience that doesn’t seem to matter. sklearn does have other Naive Bayes algorithms available, but I’ve found MultinomialNB performs best.

  • kpa

    Thank you for the quick response Jacob!

    I’ve ran the above experiments with the NaiveBayes classifier with the same parameters for the movie_reviews data and I get the following results:

    NaiveBayes bow (equivalent to Bernoulli model): accuracy = 72.6

    NaiveBayes int (equivalent to Multinomial model): accuracy = 73.0

    The Multinomial and Bernoulli results are the same as the results you get.

    Now, in the NaiveBayes with bow features, this should be equivalent to the BernoulliNB classifier, yet there is almost 10% difference in accuracy. Similarly for the int feature, this should be equivalent to the MultinomialNB, yet again there is 10% difference in accuracy.

    Do you know what could lead to this discrepancy?


  • The algorithms are implemented differently, with different data structures and lower level libraries. You would have to carefully examine the code for both to figure out what causes the difference.

  • Pingback: NLTK Resources | celiala456()

  • Try switching the arguments around so you –classifier first. I think there might be an argparse issue.

  • I am presently on 3.0.1 and 0.14.1 with the most recent github version of nltk-trainer and I get somewhat different results from the reported. BernoulliNB with bigram is reported to be 86.0 in accuracy. I get 71.8.

    $ ./train_classifier.py –classifier sklearn.BernoulliNB –ngrams 1 2 –fraction 0.75 –no-pickle movie_reviews
    loading movie_reviews
    2 labels: [u’neg’, u’pos’]
    using bag of words feature extraction
    1500 training feats, 500 testing feats
    training sklearn.BernoulliNB with {‘alpha’: 1.0}
    using dtype bool
    training sklearn.BernoulliNB classifier
    accuracy: 0.718000
    neg precision: 0.643799
    neg recall: 0.976000
    neg f-measure: 0.775835
    pos precision: 0.950413
    pos recall: 0.460000
    pos f-measure: 0.619946

  • Pingback: Document classification using python and scikit()

  • Pingback: Can't get NLTK-Trainer to recognize/ work with scikit-learn classifiers - BlogoSfera()

  • Sandy

    is it possible to see this file $ ./train_classifier.py –classifier sklearn.BernoulliNB –ngrams 1 2 –fraction 0.75 — that you mentioned, pls

  • uday sai

    When I wire the following code:
    from nltk.classify.scikitlearn import SklearnClassifier
    sklearn.naive_bayes import MultinomialNB,BernoulliNB
    it results in following error:
    Traceback (most recent call last):
    File “C:Python27uni.py”, line 2, in
    from sklearn.naive_bayes import MultinomialNB,BernoulliNB
    File “C:Python27libsite-packagessklearn__init__.py”, line 56, in
    from . import __check_build
    ImportError: cannot import name __check_build
    Can any body help me on how to resolve it?

  • That looks like something is wrong in your scikit-learn installation. You may want to try reinstalling with pip