streamhacker.com Weotta be Hacking

3Dec/0810

Part of Speech Tagging with NLTK – Part 3

In part 2, I showed how to produce a part-of-speech tagger using Ngram tagging in combination with Affix and Regex tagging, with accuracy approaching 90%. In part 3, I'll use the BrillTagger to get the accuracy up to and over 90%.

Brill Tagging

The BrillTagger is different than the previous taggers. For one, it's not a SequentialBackoffTagger, though it does use an initial tagger, which in our case will be the raubt_tagger from part 2. The BrillTagger uses the initial tagger to produce initial tags, then corrects those tags based on transformational rules. These rules are learned by training with the FastBrillTaggerTrainer and rules templates. Here's an example, with templates copied from the demo() function in nltk.tag.brill.py. Refer to part 1 for the backoff_tagger function and the train_sents, and part 2 for the word_patterns.

import nltk.tag
from nltk.tag import brill

raubt_tagger = backoff_tagger(train_sents, [nltk.tag.AffixTagger,
    nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger],
    backoff=nltk.tag.RegexpTagger(word_patterns))

templates = [
    brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,1)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (2,2)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,2)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,3)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,1)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (2,2)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,2)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,3)),
    brill.ProximateTokensTemplate(brill.ProximateTagsRule, (-1, -1), (1,1)),
    brill.ProximateTokensTemplate(brill.ProximateWordsRule, (-1, -1), (1,1))
]

trainer = brill.FastBrillTaggerTrainer(raubt_tagger, templates)
braubt_tagger = trainer.train(train_sents, max_rules=100, min_score=3)

Brill Tagging Accuracy

So now we have a braubt_tagger. You can tweak the max_rules and min_score params, but be careful, as increasing the values will exponentially increase the training time without significantly increasing accuracy. In fact, I found that increasing the min_score tended to decrease the accuracy by a percent or 2. So here's how the braubt_tagger fares against the other taggers.

Conclusion

There's certainly more you can do for part-of-speech tagging with nltk, but the braubt_tagger should be good enough for many purposes. The most important component of part-of-speech tagging is using the correct training data. If you want your tagger to be accurate, you need to train it on a corpus similar to the text you'll be tagging. The brown, conll2000, and treebank corpora are what they are, and you shouldn't assume that a tagger trained on them will be accurate on a different corpus. For example, a tagger trained on one part of the brown corpus may be 90% accurate on other parts of the brown corpus, but only 50% accurate on the conll2000 corpus. But a tagger trained on the conll2000 corpus will be accurate for the treebank corpus, and vice versa, because conll2000 and treebank are quite similar. So make sure you choose your training data carefully.

If you'd like to try to push the accuracy even higher, see part 4, where I compare the braubt_tagger to classifier based taggers, and nltk.tag.pos_tag.

  • Delicious
  • StumbleUpon
  • Reddit
  • Digg
  • Twitter
  • FriendFeed
  • Facebook
  • Share/Bookmark

Related posts

Comments (10) Trackbacks (1)
  1. I played around with Brown/Treebank/conll2000 a little bit. Did you test with nltk.tag.pos_tag()? It loads a pickle to do the tagging. I’m asking because that seemed to perform comparable/better, and was already setup.

  2. I have not tested nltk.tag.pos_tag() (I’m pretty sure it wasn’t released when I wrote this series). I believe it was trained with most or all of the available corpora, which would definitely make it more accurate. However, it’ll only have high accuracy for text that’s similar to the corpora it was trained on. If you’re tagging text that has a lot of specialty/unique words and phrases, you’ll need to create your own training data for the training process in order to get accurate results.

  3. Hi! i’d like to cite this for my dissertation but i can’t find ur name anywhere!

  4. Cool! What’s your topic?
    My name’s Jacob Perkins, and I should probably put it somewhere obvious :)

  5. I'm working on a class project and this article series saved me a lot of time and trouble. It's much more accessible than the NLTK documentation, which I now only had to use to understand some specific details. Thanks a lot!

  6. Glad this helped, thanks for the positive feedback.

  7. hi jacob,
    how we can find precision ,recall and f,measure by using brill tagger,as brill's only displaying accuracy.and if we look at precision ,its formula is:

    precision = Correctly tagged words by system
    ————————————————— x 100%
    Total no. of words tagged by taggers

    so from where to get tagged words, that are Correctly tagged by nltk's brill.

    i've successfully run brill tagger over Urdu language,nad prb is brill takes hexadecimal code ov urdu and also showing it in hexadecimal for urdu language.bt i need to know precision ,recall and f measure to quote in my FINAL Presentation, and my MS thesis plz…
    i m short with time so help me out plzzzz.

  8. I think you'll have to collect the stats manually. You could write a function like accuracy that takes in a “gold standard” of tagged sentences. Untag each sentence and run your tagger over it and compare it to the gold sentence. Count all the correct tags along with the total tags, then when it's finished you can calculate precision.

  9. hi jacob,
    how we can find precision ,recall and f,measure by using brill tagger,as brill's only displaying accuracy.and if we look at precision ,its formula is:

    precision = Correctly tagged words by system
    ————————————————— x 100%
    Total no. of words tagged by taggers

    so from where to get tagged words, that are Correctly tagged by nltk's brill.

    i've successfully run brill tagger over Urdu language,nad prb is brill takes hexadecimal code ov urdu and also showing it in hexadecimal for urdu language.bt i need to know precision ,recall and f measure to quote in my FINAL Presentation, and my MS thesis plz…
    i m short with time so help me out plzzzz.

  10. I think you'll have to collect the stats manually. You could write a function like accuracy that takes in a “gold standard” of tagged sentences. Untag each sentence and run your tagger over it and compare it to the gold sentence. Count all the correct tags along with the total tags, then when it's finished you can calculate precision.


Leave a comment


blog comments powered by Disqus