streamhacker.com Weotta be Hacking

12Apr/1012

Part of Speech Tagging with NLTK Part 4 – Brill Tagger vs Classifier Taggers

In previous installments on part-of-speech tagging, we saw that a Brill Tagger provides significant accuracy improvements over the Ngram Taggers combined with Regex and Affix Tagging.

With the latest 2.0 beta releases (2.0b8 as of this writing), NLTK has included a ClassifierBasedTagger as well as a pre-trained tagger used by the nltk.tag.pos_tag method. Based on the name, then pre-trained tagger appears to be a ClassifierBasedTagger trained on the treebank corpus using a MaxentClassifier. So let's see how a classifier tagger compares to the brill tagger.

Training Sets

For the brown corpus, I trained on 2/3 of the reviews, lore, and romance categories, and tested against the remaining 1/3. For conll2000, I used the standard train.txt vs test.txt. And for treebank, I again used a 2/3 vs 1/3 split.

import itertools
from nltk.corpus import brown, conll2000, treebank

brown_reviews = brown.tagged_sents(categories=['reviews'])
brown_reviews_cutoff = len(brown_reviews) * 2 / 3
brown_lore = brown.tagged_sents(categories=['lore'])
brown_lore_cutoff = len(brown_lore) * 2 / 3
brown_romance = brown.tagged_sents(categories=['romance'])
brown_romance_cutoff = len(brown_romance) * 2 / 3

brown_train = list(itertools.chain(brown_reviews[:brown_reviews_cutoff],
	brown_lore[:brown_lore_cutoff], brown_romance[:brown_romance_cutoff]))
brown_test = list(itertools.chain(brown_reviews[brown_reviews_cutoff:],
	brown_lore[brown_lore_cutoff:], brown_romance[brown_romance_cutoff:]))

conll_train = conll2000.tagged_sents('train.txt')
conll_test = conll2000.tagged_sents('test.txt')

treebank_cutoff = len(treebank.tagged_sents()) * 2 / 3
treebank_train = treebank.tagged_sents()[:treebank_cutoff]
treebank_test = treebank.tagged_sents()[treebank_cutoff:]

Classifier Taggers

There are 3 new taggers referenced below:

  • cpos is an instance of ClassifierBasedPOSTagger using the default NaiveBayesClassifier. It was trained by doing ClassifierBasedPOSTagger(train=train_sents)
  • craubt is like cpos, but has the raubt tagger from part 2 as a backoff tagger by doing ClassifierBasedPOSTagger(train=train_sents, backoff=raubt)
  • bcpos is a BrillTagger using cpos as its initial tagger instead of raubt.

The raubt tagger is the same as from part 2, and braubt is from part 3.

postag is NLTK's pre-trained tagger used by the pos_tag function. It can be loaded using nltk.data.load(nltk.tag._POS_TAGGER).

Accuracy Evaluation

Tagger accuracy was determined by calling the evaluate method with the test set on each trained tagger. Here are the results:

brill vs classifier tagger accuracy chart

Conclusions

The above results are quite interesting, and lead to a few conclusions:

  1. Training data is hugely significant when it comes to accuracy. This is why postag takes a huge nose dive on brown, while at the same time can get near 100% accuracy on treebank.
  2. A ClassifierBasedPOSTagger does not need a backoff tagger, since cpos accuracy is exactly the same as for craubt across all corpora.
  3. The ClassifierBasedPOSTagger is not necessarily more accurate than the bcraubt tagger from part 3 (at least with the default feature detector). It also takes much longer to train and tag (more details below) and so may not be worth the tradeoff in efficiency.
  4. Using BrillTagger will nearly always increase the accuracy of your initial tagger, but not by much.

I was also surprised at how much more accurate postag was compared to cpos. Thinking that postag was probably trained on the full treebank corpus, I did the same, and re-evaluated:

cpos = ClassifierBasedPOSTagger(train=treebank.tagged_sents())
cpos.evaluate(treebank_test)

The result was 98.08% accuracy. So the remaining 2% difference must be due to the MaxentClassifier being more accurate than NaiveBayesClassifier, and/or the use of a different feature detector. I tried again with classifier_builder=MaxentClassifier.train and only got to 98.4% accuracy. So I can only conclude that a different feature detector is used. Hopefully the NLTK leaders will publish the training method so we can all know for sure.

Efficiency

On the nltk-users list, there was a question about which tagger is the most computationaly economic. I can't tell you the right answer, but I can definitely say that ClassifierBasedPOSTagger is the wrong answer. During accuracy evaluation, I noticed that the cpos tagger took a lot longer than raubt or braubt. So I ran timeit on the tag method of each tagger, and got the following results:

Taggersecs/pass
raubt0.00005
braubt0.00009
cpos0.02219
bcpos0.02259
postag0.01241

This was run with python 2.6.4 on an Athlon 64 Dual Core 4600+ with 3G RAM, but the important thing is the relative times. braubt is over 246 times faster than cpos! To put it another way, braubt can process over 66666 words/sec, where cpos can only do 270 words/sec and postag only 483 words/sec. So the lesson is: do not use a classifier based tagger if speed is an issue.

Here's the code for timing postag. You can do the same thing for any other pickled tagger by replacing nltk.tag._POS_TAGGER with a nltk.data accessible path with a .pickle suffix for the load method.

import nltk, timeit
text = nltk.word_tokenize('And now for something completely different')
setup = 'import nltk.data, nltk.tag; tagger = nltk.data.load(nltk.tag._POS_TAGGER)'
t = timeit.Timer('tagger.tag(%s)' % text, setup)
print 'timing postag 1000 times'
spent = t.timeit(number=1000)
print 'took %.5f secs/pass' % (spent / 1000)

File Size

There's also a significant difference in the file size of the pickled taggers (trained on treebank):

TaggerSize
raubt272K
braubt273K
cpos3.8M
bcpos3.8M
postag8.2M

Fin

I think there's a lot of room for experimentation with classifier based taggers and their feature detectors. But if speed is an issue for you, don't even bother. In that case, stick with a simpler tagger that's nearly as accurate and orders of magnitude faster.

3Dec/0810

Part of Speech Tagging with NLTK – Part 3

In part 2, I showed how to produce a part-of-speech tagger using Ngram tagging in combination with Affix and Regex tagging, with accuracy approaching 90%. In part 3, I'll use the BrillTagger to get the accuracy up to and over 90%.

Brill Tagging

The BrillTagger is different than the previous taggers. For one, it's not a SequentialBackoffTagger, though it does use an initial tagger, which in our case will be the raubt_tagger from part 2. The BrillTagger uses the initial tagger to produce initial tags, then corrects those tags based on transformational rules. These rules are learned by training with the FastBrillTaggerTrainer and rules templates. Here's an example, with templates copied from the demo() function in nltk.tag.brill.py. Refer to part 1 for the backoff_tagger function and the train_sents, and part 2 for the word_patterns.

import nltk.tag
from nltk.tag import brill

raubt_tagger = backoff_tagger(train_sents, [nltk.tag.AffixTagger,
    nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger],
    backoff=nltk.tag.RegexpTagger(word_patterns))

templates = [
    brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,1)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (2,2)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,2)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,3)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,1)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (2,2)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,2)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,3)),
    brill.ProximateTokensTemplate(brill.ProximateTagsRule, (-1, -1), (1,1)),
    brill.ProximateTokensTemplate(brill.ProximateWordsRule, (-1, -1), (1,1))
]

trainer = brill.FastBrillTaggerTrainer(raubt_tagger, templates)
braubt_tagger = trainer.train(train_sents, max_rules=100, min_score=3)

Brill Tagging Accuracy

So now we have a braubt_tagger. You can tweak the max_rules and min_score params, but be careful, as increasing the values will exponentially increase the training time without significantly increasing accuracy. In fact, I found that increasing the min_score tended to decrease the accuracy by a percent or 2. So here's how the braubt_tagger fares against the other taggers.

Conclusion

There's certainly more you can do for part-of-speech tagging with nltk, but the braubt_tagger should be good enough for many purposes. The most important component of part-of-speech tagging is using the correct training data. If you want your tagger to be accurate, you need to train it on a corpus similar to the text you'll be tagging. The brown, conll2000, and treebank corpora are what they are, and you shouldn't assume that a tagger trained on them will be accurate on a different corpus. For example, a tagger trained on one part of the brown corpus may be 90% accurate on other parts of the brown corpus, but only 50% accurate on the conll2000 corpus. But a tagger trained on the conll2000 corpus will be accurate for the treebank corpus, and vice versa, because conll2000 and treebank are quite similar. So make sure you choose your training data carefully.

If you'd like to try to push the accuracy even higher, see part 4, where I compare the braubt_tagger to classifier based taggers, and nltk.tag.pos_tag.