Part of speech tagging is the process of identifying nouns, verbs, adjectives, and other parts of speech in context. NLTK provides the necessary tools for tagging, but doesn't actually tell you what methods work best, so I decided to find out for myself.
Training and Test Sentences
NLTK has a data package that includes 3 part of speech tagged corpora: brown, conll2000, and treebank. I divided each of these corpora into 2 sets, the training set and the testing set. The choice and size of your training set can have a significant effect on the pos tagging accuracy, so for real world usage, you need to train on a corpus that is very representative of the actual text you want to tag. In particular, the brown corpus has a number of different categories, so choose your categories wisely. I chose these categories primarily because they have a higher occurance of the word food than other categories.
import nltk.corpus, nltk.tag, itertools brown_review_sents = nltk.corpus.brown.tagged_sents(categories=['reviews']) brown_lore_sents = nltk.corpus.brown.tagged_sents(categories=['lore']) brown_romance_sents = nltk.corpus.brown.tagged_sents(categories=['romance']) brown_train = list(itertools.chain(brown_review_sents[:1000], brown_lore_sents[:1000], brown_romance_sents[:1000])) brown_test = list(itertools.chain(brown_review_sents[1000:2000], brown_lore_sents[1000:2000], brown_romance_sents[1000:2000])) conll_sents = nltk.corpus.conll2000.tagged_sents() conll_train = list(conll_sents[:4000]) conll_test = list(conll_sents[4000:8000]) treebank_sents = nltk.corpus.treebank.tagged_sents() treebank_train = list(treebank_sents[:1500]) treebank_test = list(treebank_sents[1500:3000])
(Updated 4/15/2010 for new
brown categories. Also note that the best way to use
conll2000 is with
conll2000.tagged_sents('test.txt'), but changing that above may change the accuracy.)
NLTK Ngram Taggers
I started by testing different combinations of the 3 NgramTaggers: UnigramTagger, BigramTagger, and TrigramTagger. These taggers inherit from SequentialBackoffTagger, which allows them to be chained together for greater accuracy. To save myself a little pain when constructing and training these pos taggers, I created a utility method for creating a chain of backoff taggers.
def backoff_tagger(tagged_sents, tagger_classes, backoff=None): if not backoff: backoff = tagger_classes(tagged_sents) del tagger_classes for cls in tagger_classes: tagger = cls(tagged_sents, backoff=backoff) backoff = tagger return backoff ubt_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger]) utb_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.TrigramTagger, nltk.tag.BigramTagger]) but_tagger = backoff_tagger(train_sents, [nltk.tag.BigramTagger, nltk.tag.UnigramTagger, nltk.tag.TrigramTagger]) btu_tagger = backoff_tagger(train_sents, [nltk.tag.BigramTagger, nltk.tag.TrigramTagger, nltk.tag.UnigramTagger]) tub_tagger = backoff_tagger(train_sents, [nltk.tag.TrigramTagger, nltk.tag.UnigramTagger, nltk.tag.BigramTagger]) tbu_tagger = backoff_tagger(train_sents, [nltk.tag.TrigramTagger, nltk.tag.BigramTagger, nltk.tag.UnigramTagger])
Tagger Accuracy Testing
To test the accuracy of a pos tagger, we can compare it to the test sentences using the nltk.tag.accuracy function.
Ngram Tagging Accuracy
ubt_tagger and utb_taggers are extremely close to each other, but the
ubt_tagger is the slight favorite (note that the backoff sequence is in reverse order, so for the
ubt_tagger, the trigram tagger backsoff to the bigram tagger, which backsoff to the unigram tagger). In Part of Speech Tagging with NLTK - Regexp and Affix Tagging, I do further testing using the AffixTagger and the RegexpTagger to get the accuracy up past 80%.