StreamHacker Weotta be Hacking

25Oct/1029

Training Binary Text Classifiers with NLTK Trainer

NLTK-Trainer (available github and bitbucket) was created to make it as easy as possible to train NLTK text classifiers. The train_classifiers.py script provides a command-line interface for training & evaluating classifiers, with a number of options for customizing text feature extraction and classifier training (run python train_classifier.py --help for a complete list of options). Below, I'll show you how to use it to (mostly) replicate the results shown in my previous articles on text classification. You should checkout or download nltk-trainer if you want to run the examples yourself.

NLTK Movie Reviews Corpus

To run the code, we need to make sure everything is setup for training. The most important thing is installing the NLTK data (and of course, you'll need to install NLTK as well). In this case, we need the movie_reviews corpus, which you can download/install by running sudo python -m nltk.downloader movie_reviews. This command will ensure that the movie_reviews corpus is downloaded and/or located in an NLTK data directory, such as /usr/share/nltk_data on Linux, or C:\nltk_data on Windows. The movie_reviews corpus can then be found under the corpora subdirectory.

Training a Naive Bayes Classifier

Now we can use train_classifier.py to replicate the results from the first article on text classification for sentiment analysis with a naive bayes classifier. The complete command is:

python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --show-most-informative 10 --no-pickle movie_reviews

Here's an explanation of each option:

  • --instances files: this says that each file is treated as an individual instance, so that each feature set will contain word: True for each word in a file
  • --fraction 0.75: we'll use 75% of the the files in each category for training, and the remaining 25% of the files for testing
  • --show-most-informative 10: show the 10 most informative words
  • --no-pickle: the default is to store a pickled classifier, but this option lets us do evaluation without pickling the classifier

If you cd into the nltk-trainer directory and the run the above command, your output should look like this:

$ python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --show-most-informative 10 --no-pickle movie_reviews
2 labels: ['neg', 'pos']
1500 training feats, 500 testing feats
training a NaiveBayes classifier
accuracy: 0.726000
neg precision: 0.952000
neg recall: 0.476000
neg f-measure: 0.634667
pos precision: 0.650667
pos recall: 0.976000
pos f-measure: 0.780800
10 most informative features
Most Informative Features
          finest = True              pos : neg    =     13.4 : 1.0
      astounding = True              pos : neg    =     11.0 : 1.0
          avoids = True              pos : neg    =     11.0 : 1.0
          inject = True              neg : pos    =     10.3 : 1.0
       strongest = True              pos : neg    =     10.3 : 1.0
       stupidity = True              neg : pos    =     10.2 : 1.0
           damon = True              pos : neg    =      9.8 : 1.0
            slip = True              pos : neg    =      9.7 : 1.0
          temple = True              pos : neg    =      9.7 : 1.0
          regard = True              pos : neg    =      9.7 : 1.0

If you refer to the article on measuring precision and recall of a classifier, you'll see that the numbers are slightly different. We also ended up with a different top 10 most informative features. This is due to train_classifier.py choosing slightly different training instances than the code in the previous articles. But the results are still basically the same.

Filtering Stopwords

Let's try it again, but this time we'll filter out stopwords (the default is no stopword filtering):

$ python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --no-pickle --filter-stopwords english movie_reviews
2 labels: ['neg', 'pos']
1500 training feats, 500 testing feats
training a NaiveBayes classifier
accuracy: 0.724000
neg precision: 0.944444
neg recall: 0.476000
neg f-measure: 0.632979
pos precision: 0.649733
pos recall: 0.972000
pos f-measure: 0.778846

As shown in text classification with stopwords and collocations, filtering stopwords reduces accuracy. A helpful comment by Pierre explained that adverbs and determiners that start with "wh" can be valuable features, and removing them is what causes the dip in accuracy.

High Information Feature Selection

There's two options that allow you to restrict which words are used by their information gain:

  • --max_feats 10000 will use the 10,000 most informative words, and discard the rest
  • --min_score 3 will use all words whose score is at least 3, and discard any words with a lower score

Here's the results of using --max_feats 10000:

$ python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --no-pickle --max_feats 10000 movie_reviews
2 labels: ['neg', 'pos']
calculating word scores
10000 words meet min_score and/or max_feats
1500 training feats, 500 testing feats
training a NaiveBayes classifier
accuracy: 0.888000
neg precision: 0.970874
neg recall: 0.800000
neg f-measure: 0.877193
pos precision: 0.829932
pos recall: 0.976000
pos f-measure: 0.897059

The accuracy is a bit lower than shown in the article on eliminating low information features, most likely due to the slightly different training & testing instances. Using --min_score 3 instead increases accuracy a little bit:

$ python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --no-pickle --min_score 3 movie_reviews
2 labels: ['neg', 'pos']
calculating word scores
8298 words meet min_score and/or max_feats
1500 training feats, 500 testing feats
training a NaiveBayes classifier
accuracy: 0.894000
neg precision: 0.966825
neg recall: 0.816000
neg f-measure: 0.885033
pos precision: 0.840830
pos recall: 0.972000
pos f-measure: 0.901670

Bigram Features

To include bigram features (pairs of words that occur in a sentence), use the --bigrams option. This is different than finding significant collocations, as all bigrams are considered using the nltk.util.bigrams function. Combining --bigrams with --min_score 3 gives us the highest accuracy yet, 97%!:

  $ python train_classifier.py --algorithm NaiveBayes --instances files --fraction 0.75 --no-pickle --min_score 3 --bigrams --show-most-informative 10 movie_reviews
  2 labels: ['neg', 'pos']
  calculating word scores
  28075 words meet min_score and/or max_feats
  1500 training feats, 500 testing feats
  training a NaiveBayes classifier
  accuracy: 0.970000
  neg precision: 0.979592
  neg recall: 0.960000
  neg f-measure: 0.969697
  pos precision: 0.960784
  pos recall: 0.980000
  pos f-measure: 0.970297
  10 most informative features
  Most Informative Features
                finest = True              pos : neg    =     13.4 : 1.0
     ('matt', 'damon') = True              pos : neg    =     13.0 : 1.0
  ('a', 'wonderfully') = True              pos : neg    =     12.3 : 1.0
('everything', 'from') = True              pos : neg    =     12.3 : 1.0
      ('witty', 'and') = True              pos : neg    =     11.0 : 1.0
            astounding = True              pos : neg    =     11.0 : 1.0
                avoids = True              pos : neg    =     11.0 : 1.0
     ('most', 'films') = True              pos : neg    =     11.0 : 1.0
                inject = True              neg : pos    =     10.3 : 1.0
         ('show', 's') = True              pos : neg    =     10.3 : 1.0

Of course, the "Bourne bias" is still present with the ('matt', 'damon') bigram, but you can't argue with the numbers. Every metric is at 96% or greater, clearly showing that high information feature selection with bigrams is hugely beneficial for text classification, at least when using the the NaiveBayes algorithm. This also goes against what I said at the end of the article on high information feature selection:

bigrams don't matter much when using only high information words

In fact, bigrams can make a huge difference, but you can't restrict them to just 200 significant collocations. Instead, you must include all of them, and let the scoring function decide what's significant and what isn't.

  • Pingback: Tweets that mention Training Binary Text Classifiers with NLTK Trainer | streamhacker.com -- Topsy.com

  • Pingback: Tweets that mention Training Binary Text Classifiers with NLTK Trainer | streamhacker.com -- Topsy.com

  • http://twitter.com/denzil_correa Denzil Correa

    Could you also provide a functionality for some kind of cross-validation? It may may go a long way in helping us !

  • http://streamhacker.com/ Jacob Perkins

    Yes, good idea, I’ll add it to my TODO list.

  • http://twitter.com/dkisabaka D Kisabaka

    Using bigrams only in norm_words() yields a 98.2% accuracy on the movie_reviews corpus.

  • http://streamhacker.com/ Jacob Perkins

    That’s a pretty significant accuracy. Did you try it with the whole corpus, or just a fraction?

  • http://twitter.com/dkisabaka D Kisabaka

    I used the same parameters that gives 97% above but I changed the line in norm_words() from “return words + bigrams(words)” to “return bigrams(words)”. So –bigrams would count bigrams only instead of bigrams + unigrams. This is a corpus specific hack but it work as well for a task like bigram based language guessing on a small corpus. 

  • http://streamhacker.com/ Jacob Perkins

    There may be a –ngrams option added soon so you can control whether to use unigrams, bigrams, or both. Then you won’t have to do a custom “hack”.

  • http://streamhacker.com/ Jacob Perkins

    Oops, just realized you’re the same guy :)So if you can add that option, that’d be awesome.

  • Pingback: Sentiment Analysis: Is there any algorithm for sentiment analysis on emoticons? - Quora

  • Motaz

    Hello,

    I wander why there are bigrams and unigrams among most informative features why you used bigram features?

    Does this procedure

    bigrams = bigram_finder.nbest(score_fn, n)

    produce bigrams and unigrams ?

    Thanks very much,

    Motaz

  • http://streamhacker.com/ Jacob Perkins

    The bigram finder will find the most significant bigrams, so you can use that, but it won’t find all of them. The way I do it above uses all bigrams, regardless of their significance. Bigrams can be very useful for classification, because the more significant ones are more informative.

  • Motaz

    sorry for mistypes. I will rephrase the question: does the procedure bigrams=bigram_finder.nbest(score_fn, n) produces unigrams (1 word feature)? if no, why most informative features include 1 words (unigrams)? . Thanks

  • http://streamhacker.com/ Jacob Perkins

    No, the bigram finder will only produce bigrams. The reason my above results include bigrams & unigrams is because I combine the unigrams (words) with nltk.ngrams(words, 2) to produce unigrams + bigrams, which are then treated the same for classification, and to determine the most informative features.

  • ajab

    Thank you so much for this post, and for nltk-trainer, what an awesome

    tool!

    I’m using the n-gram feature in nltk-trainer to classify using a custom corpus of pos/neg internet comments. But while the classifier uses the n-grams, I’m not sure how to go about including n-grams when I’m actually processing new comments.

    probdist = classifier.prob_classify(word_feats(nltk.word_tokenize(lemmatize(text))))
    return round(probdist.prob(‘neg’), 3)

    Is there a way to indicate that I want n-grams included in prob_classify?

  • http://streamhacker.com/ Jacob Perkins

    Glad you like it :)

    To get ngrams for your classifier, use nltk.util.ngrams. ngrams(words, 2) will return bigram tuples, ngrams(words, 3) will return trigram tuples, etc. So inside of word feats, you want to pass your list of words to nltk.util.ngrams for each size ngram you want, then add those ngrams to your feature dict, giving you something like {“word1″: True, “word2″: True, (“word1″, “word2″): True}.

  • ajab

    Amazing, thank you! :)

  • Paul

    Hi Jacob,

    Have you removed the –bigrams argument? When running the train classifier above with –bigrams I get the following error:

    train_classifier.py: error: unrecognized arguments: –bigrams

    Or I could well be missing something.

    TIA

  • http://streamhacker.com/ Jacob Perkins

    The bigrams argument was replaced with a flexible ngrams argument, as in –ngrams 1 2

  • Paul

    Super, thanks for the quick reply Jacob.

  • Jose Costa Pereira

    hi all (Jacob great tool, thanks for sharing this nice tutorial)

    i’ve managed to work with the most common features, and I’m now trying to work with more powerful classifiers (i.e. SVM).
    I’m able to perform training, but experiencing some problems on the testing stage.

    In order to obtain classification predictions using the NB classifier from your example, I just load the classifier (previously saved), tokenize the document to classify, and then do:
    p=classifier.prob_classify(feats)
    You can then access the predicted probabilities by doing p.prob(‘neg’) and p.prob(‘pos’).
    My question is regarding other classifiers…

    An SVM classifier does not produce probabilities (at least directly). Can someone tell me how to access the distances to the margin when classifying a sample?
    prob_classify() is not available for these classifiers, makes sense since they don’t produce probabilities. Instead we use classify() which returns a string (‘neg’ or ‘pos’), but I’m afraid I can’t access the decision values…
    In general, is there a standard way to access the decision values computed by the classifier?

    I’m a newbie in python, and this might be very simple question.
    thanks.
    -jose

  • http://streamhacker.com/ Jacob Perkins

    Hi Jose,

    It looks like you can get you want from a scikit-learn SVM classifier: http://scikit-learn.org/stable/modules/svm.html#scores-and-probabilities

    But I’m not sure you can access this thru NLTK and its sklearn classifier wrapper.

  • Waheed El Miladi

    I’ve found if you use paragraphs (paras) as the –instances arg you can achieve 99.2 % accuracy

  • http://streamhacker.com/ Jacob Perkins

    What arguments are you using? When I do “python train_classifier.py movie_reviews –classifier NaiveBayes –instances paras –fraction 0.75 –no-pickle –min_score 3 –ngrams 1 2″ I only get 97.6% accuracy.

  • Waheed El Miladi

    python train_classifier.py –algorithm NaiveBayes –instances paras –fraction 0.75 –no-pickle –min_score 3 –ngrams 2 –show-most-informative 10 movie_reviews

  • http://streamhacker.com/ Jacob Perkins

    Interesting that bigrams by themselves are better than individual words, even when scoring is used. Thanks for the tip

  • suvirbhargav

    i want to extract all the possible info(all movie related words) from movie reviews, so i started with named entity example. Next, i’m thinking of training a named entity classifier for movie data(Is it a good next step to go with?). Is the movie review corpus in nltk data good enough to begin with? Or should
    Also, i would like to see how informative a “feature” is based on how well the “feature” represent a movie (based on genre). Any suggestions how i should proceed on this?

  • http://streamhacker.com/ Jacob Perkins

    I don’t really understand what your goal is. Why do you want named entities? Can you give an example of a feature that you’re looking for?

  • suvirbhargav

    i want to work on user written movie reviews and extract all the useful words from it(words like “this movie is ” or “this movie is makes a “). I want to then collect all these words for a single movie from let’s say 100 reviews. Once i have it, i want to repeat this word collection for other movies and in the end, find similarity between movies based on these set of words.

  • http://streamhacker.com/ Jacob Perkins

    That looks like custom phrase extraction. What you need is a training corpus where every phrase has been annotated, similar to the treebank tagged/chunked corpus, where every phrase is surrounded by square brackets. Then you can train your own chunker / phrase extractor.

    But you could maybe skip all that and use search indexing techniques like TF/IDF & cosine similarity. If you can cleanly separate the reviews for each movie, then you can compare the reviews of movie 1 to the reviews of movie 2.

  • suvirbhargav

    Hi Jacob, Thanks a lot for the reply. I’m going to try with search indexing techniques next. But since i already finished some chapters of nltk books, i thought to finish with custom phrase extractor with nltk. I’m getting this error when trying nltk-trainer’s example of traninig IOB chunkers. http://pastebin.com/FUZbFek3
    I wanted to do this example so that i can later do it with my custom training movie review corpus.

  • http://streamhacker.com/ Jacob Perkins

    Your error looks like it has to do with a different version of scikit-learn. I’ve tested my code with 0.14.1. Let me know what version you have and perhaps I can fix it. And for errors like this, the best place to report them is https://github.com/japerk/nltk-trainer/issues/new

%d bloggers like this: