Part of Speech Tagging with NLTK – Part 1
An important part of weotta’s tag extraction is part of speech tagging, a process of identifying nouns, verbs, adjectives, and other parts of speech in context. NLTK provides the necessary tools for tagging, but doesn’t actually tell you what methods work best, so I decided to find out for myself.
Training and Test Sentences
NLTK has a data package that includes 3 tagged corpora: brown, conll2000, and treebank. I divided each of these corpora into 2 sets, the training set and the testing set. The choice and size of your training set can have a significant effect on the tagging accuracy, so for real world usage, you need to train on a corpus that is very representative of the actual text you want to tag. In particular, the brown corpus has a number of different categories, so choose your categories wisely. I chose these categories primarily because they have a higher occurance of the word food than other categories.
import nltk.corpus, nltk.tag, itertools from nltk.tag import brill # PRESS: REVIEWS brownc_sents = nltk.corpus.brown.tagged_sents(categories="c") # POPULAR LORE brownf_sents = nltk.corpus.brown.tagged_sents(categories="f") # FICTION: ROMANCE brownp_sents = nltk.corpus.brown.tagged_sents(categories="p") brown_train = list(itertools.chain(brownc_sents[:1000], brownf_sents[:1000], brownp_sents[:1000])) brown_test = list(itertools.chain(brownc_sents[1000:2000], brownf_sents[1000:2000], brownp_sents[1000:2000])) conll_sents = nltk.corpus.conll2000.tagged_sents() conll_train = list(conll_sents[:4000]) conll_test = list(conll_sents[4000:8000]) treebank_sents = nltk.corpus.treebank.tagged_sents() treebank_train = list(treebank_sents[:1500]) treebank_test = list(treebank_sents[1500:3000])
Ngram Tagging
I started by testing different combinations of the 3 NgramTaggers: UnigramTagger, BigramTagger, and TrigramTagger. These taggers inherit from SequentialBackoffTagger, which allows them to be chained together for greater accuracy. To save myself a little pain when constructing and training these taggers, I created a utility method for creating a chain of SequentialBackoffTaggers.
def backoff_tagger(tagged_sents, tagger_classes, backoff=None): if not backoff: backoff = tagger_classes[0](tagged_sents) del tagger_classes[0] for cls in tagger_classes: tagger = cls(tagged_sents, backoff=backoff) backoff = tagger return backoff ubt_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger]) utb_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.TrigramTagger, nltk.tag.BigramTagger]) but_tagger = backoff_tagger(train_sents, [nltk.tag.BigramTagger, nltk.tag.UnigramTagger, nltk.tag.TrigramTagger]) btu_tagger = backoff_tagger(train_sents, [nltk.tag.BigramTagger, nltk.tag.TrigramTagger, nltk.tag.UnigramTagger]) tub_tagger = backoff_tagger(train_sents, [nltk.tag.TrigramTagger, nltk.tag.UnigramTagger, nltk.tag.BigramTagger]) tbu_tagger = backoff_tagger(train_sents, [nltk.tag.TrigramTagger, nltk.tag.BigramTagger, nltk.tag.UnigramTagger])
Accuracy Testing
To test the accuracy of a tagger, we can compare it to the test sentences using the nltk.tag.accuracy function.
nltk.tag.accuracy(tagger, test_sents)
Ngram Tagging Accuracy
Ngram Tagging Accuracy
Conclusion
The ubt_tagger and utb_taggers are extremely close to each other, but the ubt_tagger is the slight favorite (note that the backoff sequence is in reverse order, so for the ubt_tagger, the TrigramTagger backsoff to the BigramTagger, which backsoff to the UnigramTagger.)
Update: in Part of Speech Tagging with NLTK – Part 2, I do further testing using the AffixTagger and the RegexpTagger to get the accuracy up past 80%.
Related Reading:











March 25, 2009 - 1:38 am
for nltk version 0.9.9b1 the call to taged_sents in
nltk.corpus.brown.tagged_sents(categories=”c”)
Throws the error:
Traceback (most recent call last):
File “”, line 2, in ?
File “/usr/lib/python2.4/site-packages/nltk/corpus/reader/tagged.py”, line 211, in tagged_sents
return TaggedCorpusReader.tagged_sents(
File “/usr/lib/python2.4/site-packages/nltk/corpus/reader/tagged.py”, line 148, in tagged_sents
tag_mapping_function)
File “/usr/lib/python2.4/site-packages/nltk/corpus/reader/util.py”, line 409, in concat
raise ValueError(‘concat() expects at least one object!’)
ValueError: concat() expects at least one object!
March 25, 2009 - 7:04 am
The NLTK corpus API has changed since I wrote this. Try with categories=['reviews']
March 25, 2009 - 11:23 am
Thank you!
October 7, 2009 - 7:07 pm
Would you mind telling me what editor you were using for the codes above? it’s pretty cool
October 7, 2009 - 7:47 pm
Moses: The wordpress plugin is SyntaxHighlighter Evolved http://www.viper007bond.com/wordpress-plugins/syntaxhighlighter/