NLTK trainer makes it easy to train part-of-speech taggers with various algorithms using train_tagger.py
.
Training Sequential Backoff Taggers
The fastest algorithms are the sequential backoff taggers. You can specify the backoff sequence using the <span class="pre">--sequential</span>
argument, which accepts any combination of the following letters:
For example, to train the same kinds of taggers that were used in Part of Speech Tagging with NLTK Part 1 – Ngram Taggers, you could do the following:
python train_tagger.py treebank --sequential ubt
You can rearrange ubt
any way you want to change the order of the taggers (though ubt
is generally the most accurate order).
Training Affix Taggers
The <span class="pre">--sequential</span>
argument also recognizes the letter a
, which will insert an AffixTagger into the backoff chain. If you do not specify the <span class="pre">--affix</span>
argument, then it will include one AffixTagger with a 3-character suffix. However, you can change this by specifying one or more <span class="pre">--affix</span> N
options, where N
should be a positive number for prefixes, and a negative number for suffixes. For example, to train an aubt
tagger with 2 AffixTaggers, one that uses a 3 character suffix, and another that uses a 2 character prefix, specify the <span class="pre">--affix</span>
argument twice:
python train_tagger.py treebank --sequential aubt --affix -3 --affix 2
The order of the <span class="pre">--affix</span>
arguments is the order in which each AffixTagger will be trained and inserted into the backoff chain.
Training Brill Taggers
To train a BrillTagger in a similar fashion to the one trained in Part of Speech Tagging Part 3 – Brill Tagger (using FastBrillTaggerTrainer), use the <span class="pre">--brill</span>
argument:
python train_tagger.py treebank --sequential aubt --brill
The default training options are a maximum of 200 rules with a minimum score of 2, but you can change that with the <span class="pre">--max_rules</span>
and <span class="pre">--min_score</span>
arguments. You can also change the rule template bounds, which defaults to 1, using the <span class="pre">--template_bounds</span>
argument.
Training Classifier Based Taggers
Many of the arguments used by train_classifier.py can also be used to train a ClassifierBasedPOSTagger. If you don’t want this tagger to backoff to a sequential backoff tagger, be sure to specify <span class="pre">--sequential</span> ''
. Here’s an example for training a NaiveBayesClassifier based tagger, similar to what was shown in Part of Speech Tagging Part 4 – Classifier Taggers:
python train_tagger.py treebank --sequential '' --classifier NaiveBayes
If you do want to backoff to a sequential tagger, be sure to specify a cutoff probability, like so:
python train_tagger.py treebank --sequential ubt --classifier NaiveBayes --cutoff_prob 0.4
Any of the NLTK classification algorithms can be used for the <span class="pre">--classifier</span>
argument, such as Maxent
or MEGAM
, and every algorithm other than NaiveBayes
has specific training options that can be customized.
Phonetic Feature Options
You can also include phonetic algorithm features using the following arguments:
<span class="pre">--metaphone</span> : |
Use metaphone feature |
<span class="pre">--double-metaphone</span> : |
Use double metaphone feature |
<span class="pre">--soundex</span> : |
Use soundex feature |
<span class="pre">--nysiis</span> : |
Use NYSIIS feature |
<span class="pre">--caverphone</span> : |
Use caverphone feature |
These options create phonetic codes that will be included as features along with the default features used by the ClassifierBasedPOSTagger. The <span class="pre">--double-metaphone</span>
algorithm comes from metaphone.py, while all the other phonetic algorithm have been copied from the advas project (which appears to be abandoned).
I created these options after discussions with Michael D Healy about Twitter Linguistics, in which he explained the prevalence of regional spelling variations. These phonetic features may be able to reduce that variation where a tagger is concerned, as slightly different spellings might generate the same phonetic code.
A tagger trained with any of these phonetic features will be an instance of nltk_trainer.tagging.taggers.PhoneticClassifierBasedPOSTagger
, which means nltk_trainer
must be included in your PYTHONPATH
in order to load & use the tagger. The simplest way to do this is to install nltk-trainer using python setup.py install
.