In previous installments on part-of-speech tagging, we saw that a Brill Tagger provides significant accuracy improvements over the Ngram Taggers combined with Regex and Affix Tagging.
With the latest 2.0 beta releases (2.0b8 as of this writing), NLTK has included a ClassifierBasedTagger as well as a pre-trained tagger used by the nltk.tag.pos_tag method. Based on the name, then pre-trained tagger appears to be a ClassifierBasedTagger
trained on the treebank corpus using a MaxentClassifier. So let’s see how a classifier tagger compares to the brill tagger.
NLTK Training Sets
For the brown corpus, I trained on 2/3 of the reviews, lore, and romance categories, and tested against the remaining 1/3. For conll2000, I used the standard train.txt vs test.txt. And for treebank, I again used a 2/3 vs 1/3 split.
import itertools from nltk.corpus import brown, conll2000, treebank brown_reviews = brown.tagged_sents(categories=['reviews']) brown_reviews_cutoff = len(brown_reviews) * 2 / 3 brown_lore = brown.tagged_sents(categories=['lore']) brown_lore_cutoff = len(brown_lore) * 2 / 3 brown_romance = brown.tagged_sents(categories=['romance']) brown_romance_cutoff = len(brown_romance) * 2 / 3 brown_train = list(itertools.chain(brown_reviews[:brown_reviews_cutoff], brown_lore[:brown_lore_cutoff], brown_romance[:brown_romance_cutoff])) brown_test = list(itertools.chain(brown_reviews[brown_reviews_cutoff:], brown_lore[brown_lore_cutoff:], brown_romance[brown_romance_cutoff:])) conll_train = conll2000.tagged_sents('train.txt') conll_test = conll2000.tagged_sents('test.txt') treebank_cutoff = len(treebank.tagged_sents()) * 2 / 3 treebank_train = treebank.tagged_sents()[:treebank_cutoff] treebank_test = treebank.tagged_sents()[treebank_cutoff:]
Naive Bayes Classifier Taggers
There are 3 new taggers referenced below:
cpos
is an instance of ClassifierBasedPOSTagger using the default NaiveBayesClassifier. It was trained by doingClassifierBasedPOSTagger(train=train_sents)
craubt
is likecpos
, but has theraubt
tagger from part 2 as a backoff tagger by doingClassifierBasedPOSTagger(train=train_sents,
backoff=raubt)
bcpos
is a BrillTagger usingcpos
as its initial tagger instead ofraubt
.
The raubt
tagger is the same as from part 2, and braubt
is from part 3.
postag
is NLTK’s pre-trained tagger used by the pos_tag function. It can be loaded using nltk.data.load(nltk.tag._POS_TAGGER)
.
Accuracy Evaluation
Tagger accuracy was determined by calling the evaluate method with the test set on each trained tagger. Here are the results:
Conclusions
The above results are quite interesting, and lead to a few conclusions:
- Training data is hugely significant when it comes to accuracy. This is why
postag
takes a huge nose dive onbrown
, while at the same time can get near 100% accuracy ontreebank
. - A ClassifierBasedPOSTagger does not need a backoff tagger, since
cpos
accuracy is exactly the same as forcraubt
across all corpora. - The ClassifierBasedPOSTagger is not necessarily more accurate than the
bcraubt
tagger from part 3 (at least with the default feature detector). It also takes much longer to train and tag (more details below) and so may not be worth the tradeoff in efficiency. - Using brill tagger will nearly always increase the accuracy of your initial tagger, but not by much.
I was also surprised at how much more accurate postag
was compared to cpos
. Thinking that postag
was probably trained on the full treebank corpus, I did the same, and re-evaluated:
cpos = ClassifierBasedPOSTagger(train=treebank.tagged_sents()) cpos.evaluate(treebank_test)
The result was 98.08% accuracy. So the remaining 2% difference must be due to the MaxentClassifier being more accurate than the naive bayes classifier, and/or the use of a different feature detector. I tried again with classifier_builder=MaxentClassifier.train
and only got to 98.4% accuracy. So I can only conclude that a different feature detector is used. Hopefully the NLTK leaders will publish the training method so we can all know for sure.
Classification Efficiency
On the nltk-users list, there was a question about which tagger is the most computationaly economic. I can’t tell you the right answer, but I can definitely say that ClassifierBasedPOSTagger is the wrong answer. During accuracy evaluation, I noticed that the cpos
tagger took a lot longer than raubt
or braubt
. So I ran timeit on the tag method of each tagger, and got the following results:
Tagger | secs/pass |
---|---|
raubt | 0.00005 |
braubt | 0.00009 |
cpos | 0.02219 |
bcpos | 0.02259 |
postag | 0.01241 |
This was run with python 2.6.4 on an Athlon 64 Dual Core 4600+ with 3G RAM, but the important thing is the relative times. braubt
is over 246 times faster than cpos
! To put it another way, braubt
can process over 66666 words/sec, where cpos
can only do 270 words/sec and postag
only 483 words/sec. So the lesson is: do not use a classifier based tagger if speed is an issue.
Here’s the code for timing postag
. You can do the same thing for any other pickled tagger by replacing nltk.tag._POS_TAGGER
with a nltk.data accessible path with a .pickle suffix for the load method.
import nltk, timeit text = nltk.word_tokenize('And now for something completely different') setup = 'import nltk.data, nltk.tag; tagger = nltk.data.load(nltk.tag._POS_TAGGER)' t = timeit.Timer('tagger.tag(%s)' % text, setup) print 'timing postag 1000 times' spent = t.timeit(number=1000) print 'took %.5f secs/pass' % (spent / 1000)
File Size
There’s also a significant difference in the file size of the pickled taggers (trained on treebank):
Tagger | Size |
---|---|
raubt | 272K |
braubt | 273K |
cpos | 3.8M |
bcpos | 3.8M |
postag | 8.2M |
Fin
I think there’s a lot of room for experimentation with classifier based taggers and their feature detectors. But if speed is an issue for you, don’t even bother. In that case, stick with a simpler tagger that’s nearly as accurate and orders of magnitude faster.