Part of Speech Tagging with NLTK – Part 1
An important part of weotta's tag extraction is part of speech tagging, a process of identifying nouns, verbs, adjectives, and other parts of speech in context. NLTK provides the necessary tools for tagging, but doesn't actually tell you what methods work best, so I decided to find out for myself.
Training and Test Sentences
NLTK has a data package that includes 3 tagged corpora: brown, conll2000, and treebank. I divided each of these corpora into 2 sets, the training set and the testing set. The choice and size of your training set can have a significant effect on the tagging accuracy, so for real world usage, you need to train on a corpus that is very representative of the actual text you want to tag. In particular, the brown corpus has a number of different categories, so choose your categories wisely. I chose these categories primarily because they have a higher occurance of the word food than other categories.
import nltk.corpus, nltk.tag, itertools brown_review_sents = nltk.corpus.brown.tagged_sents(categories=['reviews']) brown_lore_sents = nltk.corpus.brown.tagged_sents(categories=['lore']) brown_romance_sents = nltk.corpus.brown.tagged_sents(categories=['romance']) brown_train = list(itertools.chain(brown_review_sents[:1000], brown_lore_sents[:1000], brown_romance_sents[:1000])) brown_test = list(itertools.chain(brown_review_sents[1000:2000], brown_lore_sents[1000:2000], brown_romance_sents[1000:2000])) conll_sents = nltk.corpus.conll2000.tagged_sents() conll_train = list(conll_sents[:4000]) conll_test = list(conll_sents[4000:8000]) treebank_sents = nltk.corpus.treebank.tagged_sents() treebank_train = list(treebank_sents[:1500]) treebank_test = list(treebank_sents[1500:3000])
(Updated 4/15/2010 for new brown categories. Also note that the best way to use conll2000 is with conll2000.tagged_sents('train.txt') and conll2000.tagged_sents('test.txt'), but changing that above may change the accuracy.)
Ngram Tagging
I started by testing different combinations of the 3 NgramTaggers: UnigramTagger, BigramTagger, and TrigramTagger. These taggers inherit from SequentialBackoffTagger, which allows them to be chained together for greater accuracy. To save myself a little pain when constructing and training these taggers, I created a utility method for creating a chain of SequentialBackoffTaggers.
def backoff_tagger(tagged_sents, tagger_classes, backoff=None): if not backoff: backoff = tagger_classes[0](tagged_sents) del tagger_classes[0] for cls in tagger_classes: tagger = cls(tagged_sents, backoff=backoff) backoff = tagger return backoff ubt_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger]) utb_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.TrigramTagger, nltk.tag.BigramTagger]) but_tagger = backoff_tagger(train_sents, [nltk.tag.BigramTagger, nltk.tag.UnigramTagger, nltk.tag.TrigramTagger]) btu_tagger = backoff_tagger(train_sents, [nltk.tag.BigramTagger, nltk.tag.TrigramTagger, nltk.tag.UnigramTagger]) tub_tagger = backoff_tagger(train_sents, [nltk.tag.TrigramTagger, nltk.tag.UnigramTagger, nltk.tag.BigramTagger]) tbu_tagger = backoff_tagger(train_sents, [nltk.tag.TrigramTagger, nltk.tag.BigramTagger, nltk.tag.UnigramTagger])
Accuracy Testing
To test the accuracy of a tagger, we can compare it to the test sentences using the nltk.tag.accuracy function.
nltk.tag.accuracy(tagger, test_sents)
Ngram Tagging Accuracy
Ngram Tagging Accuracy
Conclusion
The ubt_tagger and utb_taggers are extremely close to each other, but the ubt_tagger is the slight favorite (note that the backoff sequence is in reverse order, so for the ubt_tagger, the TrigramTagger backsoff to the BigramTagger, which backsoff to the UnigramTagger.)
In Part of Speech Tagging with NLTK - Part 2, I do further testing using the AffixTagger and the RegexpTagger to get the accuracy up past 80%.








