NLTK trainer makes it easy to train part-of-speech taggers with various algorithms using
Training Sequential Backoff Taggers
The fastest algorithms are the sequential backoff taggers. You can specify the backoff sequence using the
--sequential argument, which accepts any combination of the following letters:
For example, to train the same kinds of taggers that were used in Part of Speech Tagging with NLTK Part 1 – Ngram Taggers, you could do the following:
python train_tagger.py treebank --sequential ubt
You can rearrange
ubt any way you want to change the order of the taggers (though
ubt is generally the most accurate order).
Training Affix Taggers
--sequential argument also recognizes the letter
a, which will insert an AffixTagger into the backoff chain. If you do not specify the
--affix argument, then it will include one AffixTagger with a 3-character suffix. However, you can change this by specifying one or more
--affix N options, where
N should be a positive number for prefixes, and a negative number for suffixes. For example, to train an
aubt tagger with 2 AffixTaggers, one that uses a 3 character suffix, and another that uses a 2 character prefix, specify the
--affix argument twice:
python train_tagger.py treebank --sequential aubt --affix -3 --affix 2
The order of the
--affix arguments is the order in which each AffixTagger will be trained and inserted into the backoff chain.
Training Brill Taggers
python train_tagger.py treebank --sequential aubt --brill
The default training options are a maximum of 200 rules with a minimum score of 2, but you can change that with the
--min_score arguments. You can also change the rule template bounds, which defaults to 1, using the
Training Classifier Based Taggers
Many of the arguments used by train_classifier.py can also be used to train a ClassifierBasedPOSTagger. If you don’t want this tagger to backoff to a sequential backoff tagger, be sure to specify
--sequential ''. Here’s an example for training a NaiveBayesClassifier based tagger, similar to what was shown in Part of Speech Tagging Part 4 – Classifier Taggers:
python train_tagger.py treebank --sequential '' --classifier NaiveBayes
If you do want to backoff to a sequential tagger, be sure to specify a cutoff probability, like so:
python train_tagger.py treebank --sequential ubt --classifier NaiveBayes --cutoff_prob 0.4
Any of the NLTK classification algorithms can be used for the
--classifier argument, such as
MEGAM, and every algorithm other than
NaiveBayes has specific training options that can be customized.
Phonetic Feature Options
You can also include phonetic algorithm features using the following arguments:
||Use metaphone feature|
||Use double metaphone feature|
||Use soundex feature|
||Use NYSIIS feature|
||Use caverphone feature|
These options create phonetic codes that will be included as features along with the default features used by the ClassifierBasedPOSTagger. The
--double-metaphone algorithm comes from metaphone.py, while all the other phonetic algorithm have been copied from the advas project (which appears to be abandoned).
I created these options after discussions with Michael D Healy about Twitter Linguistics, in which he explained the prevalence of regional spelling variations. These phonetic features may be able to reduce that variation where a tagger is concerned, as slightly different spellings might generate the same phonetic code.
A tagger trained with any of these phonetic features will be an instance of
nltk_trainer.tagging.taggers.PhoneticClassifierBasedPOSTagger, which means
nltk_trainer must be included in your
PYTHONPATH in order to load & use the tagger. The simplest way to do this is to install nltk-trainer using
python setup.py install.