Training Part of Speech Taggers with NLTK Trainer

NLTK trainer makes it easy to train part-of-speech taggers with various algorithms using train_tagger.py.

Training Sequential Backoff Taggers

The fastest algorithms are the sequential backoff taggers. You can specify the backoff sequence using the --sequential argument, which accepts any combination of the following letters:

a: AffixTagger
u: UnigramTagger
b: BigramTagger
t: TrigramTagger

For example, to train the same kinds of taggers that were used in Part of Speech Tagging with NLTK Part 1 – Ngram Taggers, you could do the following:

python train_tagger.py treebank --sequential ubt

You can rearrange ubt any way you want to change the order of the taggers (though ubt is generally the most accurate order).

Training Affix Taggers

The --sequential argument also recognizes the letter a, which will insert an AffixTagger into the backoff chain. If you do not specify the --affix argument, then it will include one AffixTagger with a 3-character suffix. However, you can change this by specifying one or more --affix N options, where N should be a positive number for prefixes, and a negative number for suffixes. For example, to train an aubt tagger with 2 AffixTaggers, one that uses a 3 character suffix, and another that uses a 2 character prefix, specify the --affix argument twice:

python train_tagger.py treebank --sequential aubt --affix -3 --affix 2

The order of the --affix arguments is the order in which each AffixTagger will be trained and inserted into the backoff chain.

Training Brill Taggers

To train a BrillTagger in a similar fashion to the one trained in Part of Speech Tagging Part 3 – Brill Tagger (using FastBrillTaggerTrainer), use the --brill argument:

python train_tagger.py treebank --sequential aubt --brill

The default training options are a maximum of 200 rules with a minimum score of 2, but you can change that with the --max_rules and --min_score arguments. You can also change the rule template bounds, which defaults to 1, using the --template_bounds argument.

Training Classifier Based Taggers

Many of the arguments used by train_classifier.py can also be used to train a ClassifierBasedPOSTagger. If you don’t want this tagger to backoff to a sequential backoff tagger, be sure to specify --sequential ''. Here’s an example for training a NaiveBayesClassifier based tagger, similar to what was shown in Part of Speech Tagging Part 4 – Classifier Taggers:

python train_tagger.py treebank --sequential '' --classifier NaiveBayes

If you do want to backoff to a sequential tagger, be sure to specify a cutoff probability, like so:

python train_tagger.py treebank --sequential ubt --classifier NaiveBayes --cutoff_prob 0.4

Any of the NLTK classification algorithms can be used for the --classifier argument, such as Maxent or MEGAM, and every algorithm other than NaiveBayes has specific training options that can be customized.

Phonetic Feature Options

You can also include phonetic algorithm features using the following arguments:

--metaphone: Use metaphone feature
--double-metaphone: Use double metaphone feature
--soundex: Use soundex feature
--nysiis: Use NYSIIS feature
--caverphone: Use caverphone feature

These options create phonetic codes that will be included as features along with the default features used by the ClassifierBasedPOSTagger. The --double-metaphone algorithm comes from metaphone.py, while all the other phonetic algorithm have been copied from the advas project (which appears to be abandoned).

I created these options after discussions with Michael D Healy about Twitter Linguistics, in which he explained the prevalence of regional spelling variations. These phonetic features may be able to reduce that variation where a tagger is concerned, as slightly different spellings might generate the same phonetic code.

A tagger trained with any of these phonetic features will be an instance of nltk_trainer.tagging.taggers.PhoneticClassifierBasedPOSTagger, which means nltk_trainer must be included in your PYTHONPATH in order to load & use the tagger. The simplest way to do this is to install nltk-trainer using python setup.py install.

  • Max

    Hi Jacob,

    Currently my POS-tagging procedure goes through a chain of backoff POS-taggers, which includes:- a chain of pre-trained POS-taggers that are loaded from pickle file
    - Unigram POS-tagger(s) and RegexpTagger(s) that are not pre-trained but created on program initialization stage (as it’s server app, it starts only once so it’s acceptable to spend some time on initialization) and need to be placed in different positions of POS-tagger chain (some need to be placed at certain positions of pre-trained POS-tagger chain).

    I want to know if there is a proper way of creating a POS-tagger chain using various POS-taggers (loaded from pickle or just created) so that the order of elements is clearly seen in the code which initializes the chain. (so I’ve got N taggers and I want to put them to a POS-tagger chain in the order that I define.)

    I need this because in some cases I’d like to experiment with changing the order of elements in POS-tagger chain, e.g. add/remove/insert after certain position (like add a RegexpTagger after AffixTagger and before DefaultTagger). I want all these changes to be fast and clean every time I need it.
    (I want to note that of course it can be done before saving a tagger to pickle but in this case I would need to re-train it every time I change something in some RegexTagger and if training data needs to be filtered or pre-parsed somehow before supplying as a “train=” parameter, it all becomes a bit messy and time-consuming. To sum up, I want some taggers to be pre-trained but not all)

    The thing is that a backoff can be supplied only when creating a tagger (or not?). How should I supply a backoff tagger to the one loaded from pickle?
    Currently I can access POS-tagger chain elements directly like this:
    pos_tagger_chain._initial_tagger._taggers[0]=…
    this way new elements can be added and it seems to work, but:
    - this is messy and this is an ugly solution
    - not sure if it works correctly (but seems to)
    - it will stop working after these class properties will become read-only or like this.

    I suppose NLTK provides a solution, I just haven’t found it.
    Could you suggest anything about this?

    I hope you understand my question.

    Thanks in advance.

  • http://streamhacker.com/ Jacob Perkins

    Hi Max, it sounds like you already figured it out: modify the _taggers list. NLTK doesn’t provide an API for modifying the backoff chain, so this is the only way to do it. One alternative is to make your own API by subclassing SequentialBackoffTagger and modifying tag_one() (see http://nltk.org/_modules/nltk/tag/sequential.html#SequentialBackoffTagger), but you’d still be accessing the _tagger list, so it’s really not that different.

  • Max

    Thanks for your fast reply!
    Now I know that’s an acceptable solution.