Part of Speech Tagging with NLTK Part 4 – Brill Tagger vs Classifier Taggers

In previous installments on part-of-speech tagging, we saw that a Brill Tagger provides significant accuracy improvements over the Ngram Taggers combined with Regex and Affix Tagging.

With the latest 2.0 beta releases (2.0b8 as of this writing), NLTK has included a ClassifierBasedTagger as well as a pre-trained tagger used by the nltk.tag.pos_tag method. Based on the name, then pre-trained tagger appears to be a ClassifierBasedTagger trained on the treebank corpus using a MaxentClassifier. So let’s see how a classifier tagger compares to the brill tagger.

NLTK Training Sets

For the brown corpus, I trained on 2/3 of the reviews, lore, and romance categories, and tested against the remaining 1/3. For conll2000, I used the standard train.txt vs test.txt. And for treebank, I again used a 2/3 vs 1/3 split.

import itertools
from nltk.corpus import brown, conll2000, treebank

brown_reviews = brown.tagged_sents(categories=['reviews'])
brown_reviews_cutoff = len(brown_reviews) * 2 / 3
brown_lore = brown.tagged_sents(categories=['lore'])
brown_lore_cutoff = len(brown_lore) * 2 / 3
brown_romance = brown.tagged_sents(categories=['romance'])
brown_romance_cutoff = len(brown_romance) * 2 / 3

brown_train = list(itertools.chain(brown_reviews[:brown_reviews_cutoff],
	brown_lore[:brown_lore_cutoff], brown_romance[:brown_romance_cutoff]))
brown_test = list(itertools.chain(brown_reviews[brown_reviews_cutoff:],
	brown_lore[brown_lore_cutoff:], brown_romance[brown_romance_cutoff:]))

conll_train = conll2000.tagged_sents('train.txt')
conll_test = conll2000.tagged_sents('test.txt')

treebank_cutoff = len(treebank.tagged_sents()) * 2 / 3
treebank_train = treebank.tagged_sents()[:treebank_cutoff]
treebank_test = treebank.tagged_sents()[treebank_cutoff:]

Naive Bayes Classifier Taggers

There are 3 new taggers referenced below:

  • cpos is an instance of ClassifierBasedPOSTagger using the default NaiveBayesClassifier. It was trained by doing ClassifierBasedPOSTagger(train=train_sents)
  • craubt is like cpos, but has the raubt tagger from part 2 as a backoff tagger by doing ClassifierBasedPOSTagger(train=train_sents, backoff=raubt)
  • bcpos is a BrillTagger using cpos as its initial tagger instead of raubt.

The raubt tagger is the same as from part 2, and braubt is from part 3.

postag is NLTK’s pre-trained tagger used by the pos_tag function. It can be loaded using nltk.data.load(nltk.tag._POS_TAGGER).

Accuracy Evaluation

Tagger accuracy was determined by calling the evaluate method with the test set on each trained tagger. Here are the results:

brill vs classifier tagger accuracy chart

Conclusions

The above results are quite interesting, and lead to a few conclusions:

  1. Training data is hugely significant when it comes to accuracy. This is why postag takes a huge nose dive on brown, while at the same time can get near 100% accuracy on treebank.
  2. A ClassifierBasedPOSTagger does not need a backoff tagger, since cpos accuracy is exactly the same as for craubt across all corpora.
  3. The ClassifierBasedPOSTagger is not necessarily more accurate than the bcraubt tagger from part 3 (at least with the default feature detector). It also takes much longer to train and tag (more details below) and so may not be worth the tradeoff in efficiency.
  4. Using brill tagger will nearly always increase the accuracy of your initial tagger, but not by much.

I was also surprised at how much more accurate postag was compared to cpos. Thinking that postag was probably trained on the full treebank corpus, I did the same, and re-evaluated:

cpos = ClassifierBasedPOSTagger(train=treebank.tagged_sents())
cpos.evaluate(treebank_test)

The result was 98.08% accuracy. So the remaining 2% difference must be due to the MaxentClassifier being more accurate than the naive bayes classifier, and/or the use of a different feature detector. I tried again with classifier_builder=MaxentClassifier.train and only got to 98.4% accuracy. So I can only conclude that a different feature detector is used. Hopefully the NLTK leaders will publish the training method so we can all know for sure.

Classification Efficiency

On the nltk-users list, there was a question about which tagger is the most computationaly economic. I can’t tell you the right answer, but I can definitely say that ClassifierBasedPOSTagger is the wrong answer. During accuracy evaluation, I noticed that the cpos tagger took a lot longer than raubt or braubt. So I ran timeit on the tag method of each tagger, and got the following results:

Tagger secs/pass
raubt 0.00005
braubt 0.00009
cpos 0.02219
bcpos 0.02259
postag 0.01241

This was run with python 2.6.4 on an Athlon 64 Dual Core 4600+ with 3G RAM, but the important thing is the relative times. braubt is over 246 times faster than cpos! To put it another way, braubt can process over 66666 words/sec, where cpos can only do 270 words/sec and postag only 483 words/sec. So the lesson is: do not use a classifier based tagger if speed is an issue.

Here’s the code for timing postag. You can do the same thing for any other pickled tagger by replacing nltk.tag._POS_TAGGER with a nltk.data accessible path with a .pickle suffix for the load method.

import nltk, timeit
text = nltk.word_tokenize('And now for something completely different')
setup = 'import nltk.data, nltk.tag; tagger = nltk.data.load(nltk.tag._POS_TAGGER)'
t = timeit.Timer('tagger.tag(%s)' % text, setup)
print 'timing postag 1000 times'
spent = t.timeit(number=1000)
print 'took %.5f secs/pass' % (spent / 1000)

File Size

There’s also a significant difference in the file size of the pickled taggers (trained on treebank):

Tagger Size
raubt 272K
braubt 273K
cpos 3.8M
bcpos 3.8M
postag 8.2M

Fin

I think there’s a lot of room for experimentation with classifier based taggers and their feature detectors. But if speed is an issue for you, don’t even bother. In that case, stick with a simpler tagger that’s nearly as accurate and orders of magnitude faster.

  • tdflatline

    Did you try a Brill tagger with the MaxEnt classifier as the initial tagger? It does look like Brill is buying you a little over a percent over your original classifier tagger. That might almost explain the remaining gap.

    Also, I am willing to wager heavily that the primary reason nltk.pos_tag has such a high error rate on brown is because the tags are substantially different between brown and treebank, more so than any difference in the actual corpus material. Doing a simple set subtraction on the set of brown tags vs treebank tags makes me wonder how it had any accuracy on brown at all. Did you do any form of tag normalization?

  • http://streamhacker.com/ Jacob Perkins

    No, I did not try Brill with a MaxentClassifier tagger. It probably would give another percent of accuracy, but I don't think that's what the pre-trained tagger for pos_tag does, as the repr of the tagger is from the ClassifierBasedTagger.

    Yes, I'm sure the tag differences contributed significantly to the inaccuracy. I would have normalized them if I knew a simple method to do so, but I'm not sure that exists yet :) And then I saw on the nltk-users list that someone tried to train with brown & simplify_tags=True, and the results were even worse. The real point of testing against brown was to illustrate the importance of using the right training data, and I think that came across loud and clear :)

  • tdflatline

    Actually, I'm not sure if you have conclusively demonstrated the importance of training data. You need to compare apples to apples for that. Perhaps something like training on brown romance and testing against science_fiction, and/or different combinations of brown corpus categories.

    At any rate, that doesn't detract from your excellent work here, which definitely elegantly showcases what can be done with different taggers under different constraints, but it's perhaps something to consider once nltk.pos_tag's source is published. Agility and generalization for different datasets with the same tags would be a useful final data point.

  • Oli

    Hi Jacob,

    as I mentioned on the google-group, your post is very interesting. Thanks for it. I am still seeking for the real implementation of pos_tag().

    You mentioned a different feature_detector. What do you think about the following idea, for getting the same feature_detector as pos_tag()?

    #load default tagger
    t0 = nltk.data.load('taggers/maxent_treebank_pos_tagger/english.pickle');
    train_set = []
    for tagged_sent in train_sents:
    untagged_sent = nltk.tag.untag(tagged_sent)
    history = []
    for i, (word, tag) in enumerate(tagged_sent):
    featureset = t0.feature_detector(untagged_sent, i, history)
    train_set.append( (featureset, tag) )
    history.append(tag)

    Now you can use train_set for training the classifier with the same feature_detector as the pos_tag(). Right?

    I would like to test it, but I got a problem. When I use the simple example, u mentioned in your post:

    cpos =ClassifierBasedPOSTagger(train=treebank.tagged_sents(),classifier_builder=MaxentClassifier.train)

    It took about 1 hour for 1 iteration (out of 100). Even when I am only using 50 sents of the corpus for training its still taking about 10 minutes to finish all 100 iterations. So how did you train your ClassifierBasedPOSTagger on the whole corpus without waiting 1 week? :-)

    I hope, my English is not too bad and you understand me :-)

    all the best
    Oli

  • http://streamhacker.com/ Jacob Perkins

    Hi Oli,

    Yes, the default MaxentClassifier algorithm is unfortunately slow. If you create a custom training function that calls MaxentClassifier.train with different parameters, you can speed it up. Take a look at http://nltk.googlecode.com/svn/trunk/doc/api/nl… for more details. I generally set min_lldelta to 0.01 for the default algorithm and often stop the iterations before it gets to 10 iterations by using Ctrl-C.

    Lately I've been using the 'megam' algorithm instead. It's much faster but requires installing the megam package: http://nltk.googlecode.com/svn/trunk/doc/api/to

    Great idea on using the feature extractor from the default tagger. I'll have to try that out.

    And your english is great, I would have assumed you're a native if you hadn't mentioned anything :)

    Jacob

  • Oli

    Thanks for your fast answer, Jacob!

    I tried different things now, but without any results. I am facing different issues:

    - scipy-algorithms do not work for some reason
    - there are no binary releases of megam for windows…and I wasn't able to compile it on my machine

    So…I am not able to test the stuff, because the default-algo is toooo slow. But thank you for the hint with ctrl+c. You can also use max_iter=5 to limit the iterations.

    I am also a bit confused how to init the ClassifierBasedPOSTagger. Do you build a MaxentClassifier and use it with classifier=YourMaxentClassifier, or do you use the classifier_builder-function? It would be great, if you could post your code for your testing.

    thanks in advance!
    Oli

  • http://streamhacker.com/ Jacob Perkins

    I've used classifier_builder=MaxentClassifier.train, or to pass in custom parameters something like

    cb = lamba toks: MaxentClassifier.train(toks, max_iter=5)

    and then classifier_builder=cb. I haven't tried any of scipy algorithms; all I can recommend is to be sure numpy and scipy are correctly installed. That's too bad about megam, maybe someone on the mailing list can help out.

    I've been thinking about doing an evaluation of each of the training algorithms for speed and memory consumption. If I do that, I'll definitely post results & code.

  • Oli

    Thanks for your fast answer, Jacob!

    I tried different things now, but without any results. I am facing different issues:

    - scipy-algorithms do not work for some reason
    - there are no binary releases of megam for windows…and I wasn't able to compile it on my machine

    So…I am not able to test the stuff, because the default-algo is toooo slow. But thank you for the hint with ctrl+c. You can also use max_iter=5 to limit the iterations.

    I am also a bit confused how to init the ClassifierBasedPOSTagger. Do you build a MaxentClassifier and use it with classifier=YourMaxentClassifier, or do you use the classifier_builder-function? It would be great, if you could post your code for your testing.

    thanks in advance!
    Oli

  • http://streamhacker.com/ Jacob Perkins

    I've used classifier_builder=MaxentClassifier.train, or to pass in custom parameters something like

    cb = lamba toks: MaxentClassifier.train(toks, max_iter=5)

    and then classifier_builder=cb. I haven't tried any of scipy algorithms; all I can recommend is to be sure numpy and scipy are correctly installed. That's too bad about megam, maybe someone on the mailing list can help out.

    I've been thinking about doing an evaluation of each of the training algorithms for speed and memory consumption. If I do that, I'll definitely post results & code.

  • Dannii

    You said: “A ClassifierBasedPOSTagger does not need a backoff tagger, since cpos accuracy is exactly the same as for craubt across all corpora.”

    This is probably because you didn’t set the classifier’s cutoff_prob parameter. Without it the tagger will never consult its backoff. I don’t use a backoff tagger with a classifier tagger, as anything else is only going to be less accurate.

  • Dannii

    You said: “A ClassifierBasedPOSTagger does not need a backoff tagger, since cpos accuracy is exactly the same as for craubt across all corpora.”

    This is probably because you didn’t set the classifier’s cutoff_prob parameter. Without it the tagger will never consult its backoff. I don’t use a backoff tagger with a classifier tagger, as anything else is only going to be less accurate.

  • http://streamhacker.com/ Jacob Perkins

    Yes, I found that cutoff_prob parameter later and did some experiments, coming to the same conclusion as you: a backoff tagger with a classifier based tagger generally doesn’t help.

  • Mvdeshpande28

    CAN ANYONE PLEASE SUGGEST ME THE BEST TAGGING METHOD.. I HAVE TO MAKE A PROJECT.. ANYONE PLZZZZZZZZZZZZZ HELP

  • http://streamhacker.com/ Jacob Perkins

    The ClassifierBasedPOSTagger is the most accurate method I know of.

  • Max

    Hi, Jacob! Thanks very much for your book and blog!
    I used ClassifierBasedPOSTagger for POS-tagging but changed it for this
    http://www.infochimps.com/datasets/brown-simplifed-tags-part-of-speech-tagger-for-python-nltkafter I saw it on your blog. Where can I get the list of all possible tags (simplified tags) that this tagger can assign to words?

  • http://streamhacker.com/ Jacob Perkins

    Hi Max, I just updated the infochimps page to list all the tags. The VB+ tags are pretty rare, but most of the rest are fairly common in the brown corpus.

  • Max

    Thanks a lot!
    I’ve got a question but to be more precise, let me explain a bit:
    I’m working on a solution that extracts meaningful pieces of information from text (phrases, names…) that will be used for further text mining stages.
    The current algorithm consists of the following stages, as I followed the book:
    #Tokenizing Text into Sentences
    #Tokenizing Sentences into Words
    #POS Tagging – using brown-simplifed-tags from infochimps
    #Chunking – currently using ClassifierChunker trained by Treebank chunked sentences
    #Stop-words filtering
    #Apply extra filters – remove words that don’t comply with word length requirements depending on POS and other
    #Stemming, lemmatization may also be added (I just don’t want to apply it to ALL the words as it “destroys” the meaning of important terms)
    Currently, I’m focused on POS-tagging and chunking.
    Is there a way to somehow combine regular expressions for chunking with ordinary regular expressions (for example to specify that a noun starts with a capital letter or parse constructions and exact words like “+ is to +”)?
    The only solution that comes to mind is to create a Part-of-Speech Tagged Word Corpus and assign “custom tags” to words that I need and then pass them to the chunker, but I’m not sure about such approach.
    I know that training a chunker is a better way (comparatively to defining rules) to extract chunks but there are a lot of cases I’m totally not satisfied with the chunks it returns. As I can’t fully rely on it, I think I could define some simple grammar rules manually and if they don’t return results, a trained chunker could be used.
    I’m new to NLTK and I would appreciate any suggestions to the overall process.

  • http://streamhacker.com/ Jacob Perkins

    Hi Max,

    Sounds like you’re clear the overall structure, though you may want to think about removing the stemming & stop word filtering steps, as transforming chunks & words can often change the meaning. But that depends on your needs.

    To improve chunking, yes, you could define a few select chunk rules where you’re close to 100% sure that if they work, then it’s a good chunk, and if you get nothing, then use a trained chunker. In other words, you define a few high-precision regular expressions, and then rely on the trained chunker to find/recall chunks the manual chunker misses.

    The other thing I recommend is if you’re using a treebank trained chunker, then you should use a treebank trained tagger. Or consider training on conll2000 as well. Checkout https://github.com/japerk/nltk-trainer for some scripts I made to make training easier.

    However, the best option (but also the most time consuming) is to create your own tagged & chunked corpus, then train a tagger & chunker on that. I recommend a bootstrap approach, where you’d use an existing tagger & chunker to create an initial corpus, then go in and hand-correct before training a custom tagger & chunker. This is the only way I know of to end up with highly accurate results, and also allows you to define custom chunk structures that aren’t found in treebank, because you have control of the training data.

  • Max

    Hi Jacob,

    Thanks for your fast reply! I was away (on vacation) but now I’m back to work. I’m following your recommendations, also learned NLTK Trainer and using for training.

    As I know, key-phrases extraction highly depends on the POS-tagger efficiency. The thing is that POS-tagger cannot work perfectly due to unknown words and words or phrases that have special meaning (names of trademarks, companies, products and so on).
    I want to use categorized phrases/words lists in order to parse information more efficiently. For example, I want to ensure that “General Electric” phrase won’t be broken or “iphone” won’t be assigned the “CD” POS-tag (as it is now in my application).
    So I want to assign correct POS-tags to a set of non-standard words. It’s clear about tagging single words. But when it comes to multiple-word phrases, I’m not sure about the solution. So I want some multi-word phrases to be kept as they are and treated just like single nouns on the chunking stage (as I think, it may improve chunking efficiency).

    I’ve been thinking about how to do this and that’s the most obvious solution I see:
    Before the stage of tokenizing sentences into words the application finds all the occurrences from the lists (i.e. categorized lists of phrases/words) and replaces spaces with “_” (for example) so that “General Electric” will become “General_Electric” and won’t be split into separate words (at least by TreebankWordTokenizer.tokenize()). This way using my own POS tagger as the first in the chain will tag “General_Electric” as a single noun and chunking stage may be more successful. Also I’ll have to somehow remember that “General_Electric” refers to the original “General Electric” and belongs to “Companies” list.
    OR
    Apply some function that will join 2 words “General” “Electric” into a single “General Electric”.

    That’s not the solution that I like. I’m sure Python or NLTK provide some suitable functionality. Could you suggest a better way to do such things? I would appreciate any thoughts.

  • http://streamhacker.com/ Jacob Perkins

    Hi Max,

    Joining phrases with “_” then splitting out later can definitely work, because it would let you use a manual dictionary UnigramTagger to define the tags, ensuring they don’t get missed. After chunking, it should be easy to split up any word with “_” in it.

    An alternative solution is to transform your list of phrases into a corpus of tagged & chunked phrases, then train a tagger and chunker on it. Basically, each line could look like “[General/NN Electric/NN …]”. The brackets are used by the BracketParseCorpusReader (used by treebank) to define noun phrases. This method can definitely work, but it may take a lot more effort on your part to tag & chunk every phrases.

  • Pingback: Accurate and Fastest NLTK tagger. | Anu B Nair()

  • Pritpal

    Can we use nltk pos_tag without training…is it already trained on Treebank Corpus??

  • http://streamhacker.com/ Jacob Perkins

    Yes, nltk pos_tag uses a pre-trained tagger that was trained on the treebank corpus.

  • Iykeln

    Hi Jacob, thanks for the info I have gotten so far from your articles.
    I have two corpora. One is pre-tagged in some fashion and the other a gold std. I wish to use transformation-based learning to improve the pre-tagged corpus, which brill fits in. It seems brill will always do initial pre-tag of untagged corpus. I have these questions – 1) How will I use Brill to solve this situation (using my pre-tagged corpus as the Brill’s temp. corpus). 2) What is pre-coded corpus from NLTK (in Brill). 3) I am confuse on how to make Brill templates (templates = [SymmetricProximateTokensTemplate(ProximateTagsRule, (1,1)),
    SymmetricProximateTokensTemplate(ProximateTagsRule, (2,2)), …). Please can you explain to me. Thanks.

  • http://streamhacker.com/ Jacob Perkins

    I’m not sure what you mean by pre-tagged. A corpus is either tagged or not tagged.

    In order to use a brill tagger, you must have an initial tagger, such as a UnigramTagger. This tagger should be trained on a tagged corpus, such as your gold standard corpus. Then you train the brill tagger using the same corpus & your initial tagger. To create the templates, you can follow the example in the previous post from this series: http://streamhacker.com/2008/12/03/part-of-speech-tagging-with-nltk-part-3/

  • Pingback: Au Naturale - An introduction to NLTK for NLP in Python()

  • Florijan

    Hi Jacob,

    I am a bit confused about your evaluation method. It seems like you evaluated the default NLTK tagger on the brown corpus without any tag conversion… As far as I understand, the default POS tagger uses Penn tags, different from the Brown corpus tags, making it impossible to properly evaluate. My testing seems to confirm this, I get 57% accuracy on a subset of the Brown corpus using the default POS tagger without any conversions, which is similar to your results.

    I am new to this, so perhaps I misunderstood something, please let me know!

    Best regards,
    F

  • http://streamhacker.com/ Jacob Perkins

    You’re correct about the default tagger, and i should have explained that the low accuracy on brown was due to the different tag sets. I was more concerned with evaluating training methods, so I included the default tagger as a reference to compare to, in order to demonstrate that it’s possible to train a tagger that’s just as good or better.