How to Train a NLTK Chunker

In NLTK, chunking is the process of extracting short, well-formed phrases, or chunks, from a sentence. This is also known as partial parsing, since a chunker is not required to capture all the words in a sentence, and does not produce a deep parse tree. But this is a good thing because it’s very hard to create a complete parse grammar for natural language, and full parsing is usually all or nothing. So chunking allows you to get at the bits you want and ignore the rest.

Training a Chunker

The general approach to chunking and parsing is to define rules or expressions that are then matched against the input sentence. But this is a very manual, tedious, and error-prone process, likely to get very complicated real fast. The alternative approach is to train a chunker the same way you train a part-of-speech tagger. Except in this case, instead of training on (word, tag) sequences, we train on (tag, iob) sequences, where iob is a chunk tag defined in the the conll2000 corpus. Here’s a function that will take a list of chunked sentences (from a chunked corpus like conll2000 or treebank), and return a list of (tag, iob) sequences.

import nltk.chunk

def conll_tag_chunks(chunk_sents):
    tag_sents = [nltk.chunk.tree2conlltags(tree) for tree in chunk_sents]
    return [[(t, c) for (w, t, c) in chunk_tags] for chunk_tags in tag_sents]

Chunker Accuracy

So how accurate is the trained chunker? Here’s the rest of the code, followed by a chart of the accuracy results. Note that I’m only using Ngram Taggers. You could additionally use the BrillTagger, but the training takes a ridiculously long time for very minimal gains in accuracy.

import nltk.corpus, nltk.tag

def ubt_conll_chunk_accuracy(train_sents, test_sents):
    train_chunks = conll_tag_chunks(train_sents)
    test_chunks = conll_tag_chunks(test_sents)

    u_chunker = nltk.tag.UnigramTagger(train_chunks)
    print 'u:', nltk.tag.accuracy(u_chunker, test_chunks)

    ub_chunker = nltk.tag.BigramTagger(train_chunks, backoff=u_chunker)
    print 'ub:', nltk.tag.accuracy(ub_chunker, test_chunks)

    ubt_chunker = nltk.tag.TrigramTagger(train_chunks, backoff=ub_chunker)
    print 'ubt:', nltk.tag.accuracy(ubt_chunker, test_chunks)

    ut_chunker = nltk.tag.TrigramTagger(train_chunks, backoff=u_chunker)
    print 'ut:', nltk.tag.accuracy(ut_chunker, test_chunks)

    utb_chunker = nltk.tag.BigramTagger(train_chunks, backoff=ut_chunker)
    print 'utb:', nltk.tag.accuracy(utb_chunker, test_chunks)

# conll chunking accuracy test
conll_train = nltk.corpus.conll2000.chunked_sents('train.txt')
conll_test = nltk.corpus.conll2000.chunked_sents('test.txt')
ubt_conll_chunk_accuracy(conll_train, conll_test)

# treebank chunking accuracy test
treebank_sents = nltk.corpus.treebank_chunk.chunked_sents()
ubt_conll_chunk_accuracy(treebank_sents[:2000], treebank_sents[2000:])
Accuracy for Trained Chunker
Accuracy for Trained Chunker

The ub_chunker and utb_chunker are slight favorites with equal accuracy, so in practice I suggest using the ub_chunker since it takes slightly less time to train.

Conclusion

Training a chunker this way is much easier than creating manual chunk expressions or rules, it can approach 100% accuracy, and the process is re-usable across data sets. As with part-of-speech tagging, the training set really matters, and should be as similar as possible to the actual text that you want to tag and chunk.