In regexp and affix pos tagging, I showed how to produce a Python NLTK part-of-speech tagger using Ngram pos tagging in combination with Affix and Regex pos tagging, with accuracy approaching 90%. In part 3, I’ll use the brill tagger to get the accuracy up to and over 90%.
NLTK Brill Tagger
The BrillTagger is different than the previous part of speech taggers. For one, it’s not a SequentialBackoffTagger, though it does use an initial pos tagger, which in our case will be the raubt_tagger
from part 2. The brill tagger uses the initial pos tagger to produce initial part of speech tags, then corrects those pos tags based on brill transformational rules. These rules are learned by training the brill tagger with the FastBrillTaggerTrainer and rules templates. Here’s an example, with templates copied from the demo()
function in nltk.tag.brill.py. Refer to ngram part of speech tagging for the backoff_tagger
function and the train_sents
, and regexp part of speech tagging for the word_patterns
.
import nltk.tag from nltk.tag import brill raubt_tagger = backoff_tagger(train_sents, [nltk.tag.AffixTagger, nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger], backoff=nltk.tag.RegexpTagger(word_patterns)) templates = [ brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,1)), brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (2,2)), brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,2)), brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,3)), brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,1)), brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (2,2)), brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,2)), brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,3)), brill.ProximateTokensTemplate(brill.ProximateTagsRule, (-1, -1), (1,1)), brill.ProximateTokensTemplate(brill.ProximateWordsRule, (-1, -1), (1,1)) ] trainer = brill.FastBrillTaggerTrainer(raubt_tagger, templates) braubt_tagger = trainer.train(train_sents, max_rules=100, min_score=3)
NLTK Brill Tagger Accuracy
So now we have a braubt_tagger
. You can tweak the max_rules
and min_score
params, but be careful, as increasing the values will exponentially increase the training time without significantly increasing accuracy. In fact, I found that increasing the min_score
tended to decrease the accuracy by a percent or 2. So here’s how the braubt_tagger
fares against the other NLTK part of speech taggers.
Conclusion
There’s certainly more you can do for part-of-speech tagging with nltk & python, but the brill tagger based braubt_tagger
should be good enough for many purposes. The most important component of part-of-speech tagging is using the correct training data. If you want your pos tagger to be accurate, you need to train it on a corpus similar to the text you’ll be tagging. The brown, conll2000, and treebank corpora are what they are, and you shouldn’t assume that a pos tagger trained on them will be accurate on a different corpus. For example, a pos tagger trained on one part of the brown corpus may be 90% accurate on other parts of the brown corpus, but only 50% accurate on the conll2000 corpus. But a pos tagger trained on the conll2000 corpus will be accurate for the treebank corpus, and vice versa, because conll2000 and treebank are quite similar. So make sure you choose your training data carefully.
If you’d like to try to push NLTK part of speech tagging accuracy even higher, see part 4, where I compare the brill tagger to classifier based pos taggers, and nltk.tag.pos_tag
.