With the latest 2.0 beta releases (2.0b8 as of this writing), NLTK has included a ClassifierBasedTagger as well as a pre-trained tagger used by the nltk.tag.pos_tag method. Based on the name, then pre-trained tagger appears to be a
ClassifierBasedTagger trained on the treebank corpus using a MaxentClassifier. So let's see how a classifier tagger compares to the brill tagger.
NLTK Training Sets
For the brown corpus, I trained on 2/3 of the reviews, lore, and romance categories, and tested against the remaining 1/3. For conll2000, I used the standard train.txt vs test.txt. And for treebank, I again used a 2/3 vs 1/3 split.
import itertools from nltk.corpus import brown, conll2000, treebank brown_reviews = brown.tagged_sents(categories=['reviews']) brown_reviews_cutoff = len(brown_reviews) * 2 / 3 brown_lore = brown.tagged_sents(categories=['lore']) brown_lore_cutoff = len(brown_lore) * 2 / 3 brown_romance = brown.tagged_sents(categories=['romance']) brown_romance_cutoff = len(brown_romance) * 2 / 3 brown_train = list(itertools.chain(brown_reviews[:brown_reviews_cutoff], brown_lore[:brown_lore_cutoff], brown_romance[:brown_romance_cutoff])) brown_test = list(itertools.chain(brown_reviews[brown_reviews_cutoff:], brown_lore[brown_lore_cutoff:], brown_romance[brown_romance_cutoff:])) conll_train = conll2000.tagged_sents('train.txt') conll_test = conll2000.tagged_sents('test.txt') treebank_cutoff = len(treebank.tagged_sents()) * 2 / 3 treebank_train = treebank.tagged_sents()[:treebank_cutoff] treebank_test = treebank.tagged_sents()[treebank_cutoff:]
Naive Bayes Classifier Taggers
There are 3 new taggers referenced below:
cposis an instance of ClassifierBasedPOSTagger using the default NaiveBayesClassifier. It was trained by doing
cpos, but has the
raubttagger from part 2 as a backoff tagger by doing
bcposis a BrillTagger using
cposas its initial tagger instead of
postag is NLTK's pre-trained tagger used by the pos_tag function. It can be loaded using
Tagger accuracy was determined by calling the evaluate method with the test set on each trained tagger. Here are the results:
The above results are quite interesting, and lead to a few conclusions:
- Training data is hugely significant when it comes to accuracy. This is why
postagtakes a huge nose dive on
brown, while at the same time can get near 100% accuracy on
- A ClassifierBasedPOSTagger does not need a backoff tagger, since
cposaccuracy is exactly the same as for
craubtacross all corpora.
- The ClassifierBasedPOSTagger is not necessarily more accurate than the
bcraubttagger from part 3 (at least with the default feature detector). It also takes much longer to train and tag (more details below) and so may not be worth the tradeoff in efficiency.
- Using brill tagger will nearly always increase the accuracy of your initial tagger, but not by much.
I was also surprised at how much more accurate
postag was compared to
cpos. Thinking that
postag was probably trained on the full treebank corpus, I did the same, and re-evaluated:
cpos = ClassifierBasedPOSTagger(train=treebank.tagged_sents()) cpos.evaluate(treebank_test)
The result was 98.08% accuracy. So the remaining 2% difference must be due to the MaxentClassifier being more accurate than the naive bayes classifier, and/or the use of a different feature detector. I tried again with
classifier_builder=MaxentClassifier.train and only got to 98.4% accuracy. So I can only conclude that a different feature detector is used. Hopefully the NLTK leaders will publish the training method so we can all know for sure.
On the nltk-users list, there was a question about which tagger is the most computationaly economic. I can't tell you the right answer, but I can definitely say that ClassifierBasedPOSTagger is the wrong answer. During accuracy evaluation, I noticed that the
cpos tagger took a lot longer than
braubt. So I ran timeit on the tag method of each tagger, and got the following results:
This was run with python 2.6.4 on an Athlon 64 Dual Core 4600+ with 3G RAM, but the important thing is the relative times.
braubt is over 246 times faster than
cpos! To put it another way,
braubt can process over 66666 words/sec, where
cpos can only do 270 words/sec and
postag only 483 words/sec. So the lesson is: do not use a classifier based tagger if speed is an issue.
import nltk, timeit text = nltk.word_tokenize('And now for something completely different') setup = 'import nltk.data, nltk.tag; tagger = nltk.data.load(nltk.tag._POS_TAGGER)' t = timeit.Timer('tagger.tag(%s)' % text, setup) print 'timing postag 1000 times' spent = t.timeit(number=1000) print 'took %.5f secs/pass' % (spent / 1000)
There's also a significant difference in the file size of the pickled taggers (trained on treebank):
I think there's a lot of room for experimentation with classifier based taggers and their feature detectors. But if speed is an issue for you, don't even bother. In that case, stick with a simpler tagger that's nearly as accurate and orders of magnitude faster.