StreamHacker Weotta be Hacking

21Mar/113

Training Part of Speech Taggers with NLTK Trainer

NLTK trainer makes it easy to train part-of-speech taggers with various algorithms using train_tagger.py.

Training Sequential Backoff Taggers

The fastest algorithms are the sequential backoff taggers. You can specify the backoff sequence using the --sequential argument, which accepts any combination of the following letters:

a: AffixTagger
u: UnigramTagger
b: BigramTagger
t: TrigramTagger

For example, to train the same kinds of taggers that were used in Part of Speech Tagging with NLTK Part 1 - Ngram Taggers, you could do the following:

python train_tagger.py treebank --sequential ubt

You can rearrange ubt any way you want to change the order of the taggers (though ubt is generally the most accurate order).

Training Affix Taggers

The --sequential argument also recognizes the letter a, which will insert an AffixTagger into the backoff chain. If you do not specify the --affix argument, then it will include one AffixTagger with a 3-character suffix. However, you can change this by specifying one or more --affix N options, where N should be a positive number for prefixes, and a negative number for suffixes. For example, to train an aubt tagger with 2 AffixTaggers, one that uses a 3 character suffix, and another that uses a 2 character prefix, specify the --affix argument twice:

python train_tagger.py treebank --sequential aubt --affix -3 --affix 2

The order of the --affix arguments is the order in which each AffixTagger will be trained and inserted into the backoff chain.

Training Brill Taggers

To train a BrillTagger in a similar fashion to the one trained in Part of Speech Tagging Part 3 - Brill Tagger (using FastBrillTaggerTrainer), use the --brill argument:

python train_tagger.py treebank --sequential aubt --brill

The default training options are a maximum of 200 rules with a minimum score of 2, but you can change that with the --max_rules and --min_score arguments. You can also change the rule template bounds, which defaults to 1, using the --template_bounds argument.

Training Classifier Based Taggers

Many of the arguments used by train_classifier.py can also be used to train a ClassifierBasedPOSTagger. If you don't want this tagger to backoff to a sequential backoff tagger, be sure to specify --sequential ''. Here's an example for training a NaiveBayesClassifier based tagger, similar to what was shown in Part of Speech Tagging Part 4 - Classifier Taggers:

python train_tagger.py treebank --sequential '' --classifier NaiveBayes

If you do want to backoff to a sequential tagger, be sure to specify a cutoff probability, like so:

python train_tagger.py treebank --sequential ubt --classifier NaiveBayes --cutoff_prob 0.4

Any of the NLTK classification algorithms can be used for the --classifier argument, such as Maxent or MEGAM, and every algorithm other than NaiveBayes has specific training options that can be customized.

Phonetic Feature Options

You can also include phonetic algorithm features using the following arguments:

--metaphone: Use metaphone feature
--double-metaphone: Use double metaphone feature
--soundex: Use soundex feature
--nysiis: Use NYSIIS feature
--caverphone: Use caverphone feature

These options create phonetic codes that will be included as features along with the default features used by the ClassifierBasedPOSTagger. The --double-metaphone algorithm comes from metaphone.py, while all the other phonetic algorithm have been copied from the advas project (which appears to be abandoned).

I created these options after discussions with Michael D Healy about Twitter Linguistics, in which he explained the prevalence of regional spelling variations. These phonetic features may be able to reduce that variation where a tagger is concerned, as slightly different spellings might generate the same phonetic code.

A tagger trained with any of these phonetic features will be an instance of nltk_trainer.tagging.taggers.PhoneticClassifierBasedPOSTagger, which means nltk_trainer must be included in your PYTHONPATH in order to load & use the tagger. The simplest way to do this is to install nltk-trainer using python setup.py install.

12Apr/1024

Part of Speech Tagging with NLTK Part 4 – Brill Tagger vs Classifier Taggers

In previous installments on part-of-speech tagging, we saw that a Brill Tagger provides significant accuracy improvements over the Ngram Taggers combined with Regex and Affix Tagging.

With the latest 2.0 beta releases (2.0b8 as of this writing), NLTK has included a ClassifierBasedTagger as well as a pre-trained tagger used by the nltk.tag.pos_tag method. Based on the name, then pre-trained tagger appears to be a ClassifierBasedTagger trained on the treebank corpus using a MaxentClassifier. So let's see how a classifier tagger compares to the brill tagger.

NLTK Training Sets

For the brown corpus, I trained on 2/3 of the reviews, lore, and romance categories, and tested against the remaining 1/3. For conll2000, I used the standard train.txt vs test.txt. And for treebank, I again used a 2/3 vs 1/3 split.

import itertools
from nltk.corpus import brown, conll2000, treebank

brown_reviews = brown.tagged_sents(categories=['reviews'])
brown_reviews_cutoff = len(brown_reviews) * 2 / 3
brown_lore = brown.tagged_sents(categories=['lore'])
brown_lore_cutoff = len(brown_lore) * 2 / 3
brown_romance = brown.tagged_sents(categories=['romance'])
brown_romance_cutoff = len(brown_romance) * 2 / 3

brown_train = list(itertools.chain(brown_reviews[:brown_reviews_cutoff],
	brown_lore[:brown_lore_cutoff], brown_romance[:brown_romance_cutoff]))
brown_test = list(itertools.chain(brown_reviews[brown_reviews_cutoff:],
	brown_lore[brown_lore_cutoff:], brown_romance[brown_romance_cutoff:]))

conll_train = conll2000.tagged_sents('train.txt')
conll_test = conll2000.tagged_sents('test.txt')

treebank_cutoff = len(treebank.tagged_sents()) * 2 / 3
treebank_train = treebank.tagged_sents()[:treebank_cutoff]
treebank_test = treebank.tagged_sents()[treebank_cutoff:]

Naive Bayes Classifier Taggers

There are 3 new taggers referenced below:

  • cpos is an instance of ClassifierBasedPOSTagger using the default NaiveBayesClassifier. It was trained by doing ClassifierBasedPOSTagger(train=train_sents)
  • craubt is like cpos, but has the raubt tagger from part 2 as a backoff tagger by doing ClassifierBasedPOSTagger(train=train_sents, backoff=raubt)
  • bcpos is a BrillTagger using cpos as its initial tagger instead of raubt.

The raubt tagger is the same as from part 2, and braubt is from part 3.

postag is NLTK's pre-trained tagger used by the pos_tag function. It can be loaded using nltk.data.load(nltk.tag._POS_TAGGER).

Accuracy Evaluation

Tagger accuracy was determined by calling the evaluate method with the test set on each trained tagger. Here are the results:

brill vs classifier tagger accuracy chart

Conclusions

The above results are quite interesting, and lead to a few conclusions:

  1. Training data is hugely significant when it comes to accuracy. This is why postag takes a huge nose dive on brown, while at the same time can get near 100% accuracy on treebank.
  2. A ClassifierBasedPOSTagger does not need a backoff tagger, since cpos accuracy is exactly the same as for craubt across all corpora.
  3. The ClassifierBasedPOSTagger is not necessarily more accurate than the bcraubt tagger from part 3 (at least with the default feature detector). It also takes much longer to train and tag (more details below) and so may not be worth the tradeoff in efficiency.
  4. Using brill tagger will nearly always increase the accuracy of your initial tagger, but not by much.

I was also surprised at how much more accurate postag was compared to cpos. Thinking that postag was probably trained on the full treebank corpus, I did the same, and re-evaluated:

cpos = ClassifierBasedPOSTagger(train=treebank.tagged_sents())
cpos.evaluate(treebank_test)

The result was 98.08% accuracy. So the remaining 2% difference must be due to the MaxentClassifier being more accurate than the naive bayes classifier, and/or the use of a different feature detector. I tried again with classifier_builder=MaxentClassifier.train and only got to 98.4% accuracy. So I can only conclude that a different feature detector is used. Hopefully the NLTK leaders will publish the training method so we can all know for sure.

Classification Efficiency

On the nltk-users list, there was a question about which tagger is the most computationaly economic. I can't tell you the right answer, but I can definitely say that ClassifierBasedPOSTagger is the wrong answer. During accuracy evaluation, I noticed that the cpos tagger took a lot longer than raubt or braubt. So I ran timeit on the tag method of each tagger, and got the following results:

Tagger secs/pass
raubt 0.00005
braubt 0.00009
cpos 0.02219
bcpos 0.02259
postag 0.01241

This was run with python 2.6.4 on an Athlon 64 Dual Core 4600+ with 3G RAM, but the important thing is the relative times. braubt is over 246 times faster than cpos! To put it another way, braubt can process over 66666 words/sec, where cpos can only do 270 words/sec and postag only 483 words/sec. So the lesson is: do not use a classifier based tagger if speed is an issue.

Here's the code for timing postag. You can do the same thing for any other pickled tagger by replacing nltk.tag._POS_TAGGER with a nltk.data accessible path with a .pickle suffix for the load method.

import nltk, timeit
text = nltk.word_tokenize('And now for something completely different')
setup = 'import nltk.data, nltk.tag; tagger = nltk.data.load(nltk.tag._POS_TAGGER)'
t = timeit.Timer('tagger.tag(%s)' % text, setup)
print 'timing postag 1000 times'
spent = t.timeit(number=1000)
print 'took %.5f secs/pass' % (spent / 1000)

File Size

There's also a significant difference in the file size of the pickled taggers (trained on treebank):

Tagger Size
raubt 272K
braubt 273K
cpos 3.8M
bcpos 3.8M
postag 8.2M

Fin

I think there's a lot of room for experimentation with classifier based taggers and their feature detectors. But if speed is an issue for you, don't even bother. In that case, stick with a simpler tagger that's nearly as accurate and orders of magnitude faster.

   
%d bloggers like this: