Part of Speech Tagging with NLTK Part 1 – Ngram Taggers

Part of speech tagging is the process of identifying nouns, verbs, adjectives, and other parts of speech in context. NLTK provides the necessary tools for tagging, but doesn’t actually tell you what methods work best, so I decided to find out for myself.

Training and Test Sentences

NLTK has a data package that includes 3 part of speech tagged corpora: brown, conll2000, and treebank. I divided each of these corpora into 2 sets, the training set and the testing set. The choice and size of your training set can have a significant effect on the pos tagging accuracy, so for real world usage, you need to train on a corpus that is very representative of the actual text you want to tag. In particular, the brown corpus has a number of different categories, so choose your categories wisely. I chose these categories primarily because they have a higher occurance of the word food than other categories.

import nltk.corpus, nltk.tag, itertools
brown_review_sents = nltk.corpus.brown.tagged_sents(categories=['reviews'])
brown_lore_sents = nltk.corpus.brown.tagged_sents(categories=['lore'])
brown_romance_sents = nltk.corpus.brown.tagged_sents(categories=['romance'])

brown_train = list(itertools.chain(brown_review_sents[:1000], brown_lore_sents[:1000], brown_romance_sents[:1000]))
brown_test = list(itertools.chain(brown_review_sents[1000:2000], brown_lore_sents[1000:2000], brown_romance_sents[1000:2000]))

conll_sents = nltk.corpus.conll2000.tagged_sents()
conll_train = list(conll_sents[:4000])
conll_test = list(conll_sents[4000:8000])

treebank_sents = nltk.corpus.treebank.tagged_sents()
treebank_train = list(treebank_sents[:1500])
treebank_test = list(treebank_sents[1500:3000])

(Updated 4/15/2010 for new brown categories. Also note that the best way to use conll2000 is with conll2000.tagged_sents('train.txt') and conll2000.tagged_sents('test.txt'), but changing that above may change the accuracy.)

NLTK Ngram Taggers

I started by testing different combinations of the 3 NgramTaggers: UnigramTagger, BigramTagger, and TrigramTagger. These taggers inherit from SequentialBackoffTagger, which allows them to be chained together for greater accuracy. To save myself a little pain when constructing and training these pos taggers, I created a utility method for creating a chain of backoff taggers.

def backoff_tagger(tagged_sents, tagger_classes, backoff=None):
	if not backoff:
		backoff = tagger_classes[0](tagged_sents)
		del tagger_classes[0]

	for cls in tagger_classes:
		tagger = cls(tagged_sents, backoff=backoff)
		backoff = tagger

	return backoff

ubt_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger])
utb_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.TrigramTagger, nltk.tag.BigramTagger])
but_tagger = backoff_tagger(train_sents, [nltk.tag.BigramTagger, nltk.tag.UnigramTagger, nltk.tag.TrigramTagger])
btu_tagger = backoff_tagger(train_sents, [nltk.tag.BigramTagger, nltk.tag.TrigramTagger, nltk.tag.UnigramTagger])
tub_tagger = backoff_tagger(train_sents, [nltk.tag.TrigramTagger, nltk.tag.UnigramTagger, nltk.tag.BigramTagger])
tbu_tagger = backoff_tagger(train_sents, [nltk.tag.TrigramTagger, nltk.tag.BigramTagger, nltk.tag.UnigramTagger])

Tagger Accuracy Testing

To test the accuracy of a pos tagger, we can compare it to the test sentences using the nltk.tag.accuracy function.

nltk.tag.accuracy(tagger, test_sents)

Ngram Tagging Accuracy

Ngram Tagging Accuracy
Ngram Tagging Accuracy

Conclusion

The ubt_tagger and utb_taggers are extremely close to each other, but the ubt_tagger is the slight favorite (note that the backoff sequence is in reverse order, so for the ubt_tagger, the trigram tagger backsoff to the bigram tagger, which backsoff to the unigram tagger). In Part of Speech Tagging with NLTK – Regexp and Affix Tagging, I do further testing using the AffixTagger and the RegexpTagger to get the accuracy up past 80%.

  • Pingback: Part of Speech Tagging with NLTK - Part 2 « Stream Hacker

  • Andrew Lee

    for nltk version 0.9.9b1 the call to taged_sents in

    nltk.corpus.brown.tagged_sents(categories=”c”)

    Throws the error:

    Traceback (most recent call last):
    File “”, line 2, in ?
    File “/usr/lib/python2.4/site-packages/nltk/corpus/reader/tagged.py”, line 211, in tagged_sents
    return TaggedCorpusReader.tagged_sents(
    File “/usr/lib/python2.4/site-packages/nltk/corpus/reader/tagged.py”, line 148, in tagged_sents
    tag_mapping_function)
    File “/usr/lib/python2.4/site-packages/nltk/corpus/reader/util.py”, line 409, in concat
    raise ValueError(‘concat() expects at least one object!’)
    ValueError: concat() expects at least one object!

  • http://weotta.com Jacob

    The NLTK corpus API has changed since I wrote this. Try with categories=['reviews']

  • Andrew Lee

    Thank you!

  • Moses

    Would you mind telling me what editor you were using for the codes above? it’s pretty cool

  • Jacob

    Moses: The wordpress plugin is SyntaxHighlighter Evolved http://www.viper007bond.com/wordpress-plugins/syntaxhighlighter/

  • Pingback: Learning to do natural language processing with NLTK | JetLlib Journal

  • M D Sykora

    Excellent post, on tagging and the entire blog, I love it, since it’s a good read and you write on interesting topics.

    One question thought, what method/algorithm does the nltk.pos_tag(..) use in the background. I’ve used this NLTK out of the box :

    text = nltk.word_tokenize(“Some normal and common English language text, that needs to be part of speech tagged.”)

    nltk.pos_tag(text)

    Thanks,
    Martin

  • http://streamhacker.com/ Jacob Perkins

    Thanks Martin.

    nltk.pos_tag uses a ClassifierBasedTagger trained on treebank. Presumably it uses at least the same features as ClassifierBasedPOSTagger, but when I tested my own ClassifierBasedPOSTagger against nltk.pos_tag, nltk.pos_tag was a bit more accurate on treebank.

  • Xiejuncs

    I also have the same question on which algorithm pos_tag method employ. Thank you very much…

  • http://twitter.com/GVasilis VasilisG

    Hi Jacob, when I try to create the ubt I receive the following error:
     
    Traceback (most recent call last):
      File “”, line 1, in
        ubt_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger])
      File “”, line 3, in backoff_tagger
        backoff = tagger_classes[0](tagged_sents)
      File “/usr/local/lib/python2.6/dist-packages/nltk/tag/sequential.py”, line 317, in __init__
        backoff, cutoff, verbose)
      File “/usr/local/lib/python2.6/dist-packages/nltk/tag/sequential.py”, line 274, in __init__
        self._train(train, cutoff, verbose)
      File “/usr/local/lib/python2.6/dist-packages/nltk/tag/sequential.py”, line 177, in _train
        tokens, tags = zip(*sentence)
    ValueError: need more than 1 value to unpack

    Is it because some change in NLTK’s API?

    Thanks,
    Vasilis

  • Fred

    This series of posts is quite useful, thanks. For measuring performance, it looks like “accuracy” has moved up to nltk.accuracy and is no longer under nltk.tag.accuracy.

    Also, people often check precision and recall…just pointing out here that, unlike accuracy() which takes 2 lists as inputs, those functions take sets.

  • Fred

     EDIT: accuracy() exists as a method of the tagger classes as well, so you can eval with, e.g.

    acc = ubt_tagger.evaluate(conll_test)

  • http://streamhacker.com/ Jacob Perkins

    Thanks Fred. The NLTK api has indeed moved around a bit since I wrote this. I also covered precision & recall in the context of sentiment classification, but it’s quite a bit more involved for tagging because there’s many possible tags. You have to build reference & test sets for each tag, and I’ve done this for analyze_tagger_coverage.py in https://github.com/japerk/nltk-trainer

  • Henryk

    Thanks for the great blog. Let me add the following comment on the syntax:

    nltk.tag.accuracy(ubt_tagger, test_sents)

    seems to fail in NLTK 2.0; instead one can check the accuracy with

    ubt_tagger.evaluate(test_sents)

  • Henryk

    Sorry, Fred wrote about the issue a year ago, I missed this post.