In NLTK, chunking is the process of extracting short, well-formed phrases, or chunks, from a sentence. This is also known as partial parsing, since a chunker is not required to capture all the words in a sentence, and does not produce a deep parse tree. But this is a good thing because it's very hard to create a complete parse grammar for natural language, and full parsing is usually all or nothing. So chunking allows you to get at the bits you want and ignore the rest.
Training a Chunker
The general approach to chunking and parsing is to define rules or expressions that are then matched against the input sentence. But this is a very manual, tedious, and error-prone process, likely to get very complicated real fast. The alternative approach is to train a chunker the same way you train a part-of-speech tagger. Except in this case, instead of training on (word, tag) sequences, we train on (tag, iob) sequences, where iob is a chunk tag defined in the the conll2000 corpus. Here's a function that will take a list of chunked sentences (from a chunked corpus like conll2000 or treebank), and return a list of (tag, iob) sequences.
import nltk.chunk def conll_tag_chunks(chunk_sents): tag_sents = [nltk.chunk.tree2conlltags(tree) for tree in chunk_sents] return [[(t, c) for (w, t, c) in chunk_tags] for chunk_tags in tag_sents]
So how accurate is the trained chunker? Here's the rest of the code, followed by a chart of the accuracy results. Note that I'm only using Ngram Taggers. You could additionally use the BrillTagger, but the training takes a ridiculously long time for very minimal gains in accuracy.
import nltk.corpus, nltk.tag def ubt_conll_chunk_accuracy(train_sents, test_sents): train_chunks = conll_tag_chunks(train_sents) test_chunks = conll_tag_chunks(test_sents) u_chunker = nltk.tag.UnigramTagger(train_chunks) print 'u:', nltk.tag.accuracy(u_chunker, test_chunks) ub_chunker = nltk.tag.BigramTagger(train_chunks, backoff=u_chunker) print 'ub:', nltk.tag.accuracy(ub_chunker, test_chunks) ubt_chunker = nltk.tag.TrigramTagger(train_chunks, backoff=ub_chunker) print 'ubt:', nltk.tag.accuracy(ubt_chunker, test_chunks) ut_chunker = nltk.tag.TrigramTagger(train_chunks, backoff=u_chunker) print 'ut:', nltk.tag.accuracy(ut_chunker, test_chunks) utb_chunker = nltk.tag.BigramTagger(train_chunks, backoff=ut_chunker) print 'utb:', nltk.tag.accuracy(utb_chunker, test_chunks) # conll chunking accuracy test conll_train = nltk.corpus.conll2000.chunked_sents('train.txt') conll_test = nltk.corpus.conll2000.chunked_sents('test.txt') ubt_conll_chunk_accuracy(conll_train, conll_test) # treebank chunking accuracy test treebank_sents = nltk.corpus.treebank_chunk.chunked_sents() ubt_conll_chunk_accuracy(treebank_sents[:2000], treebank_sents[2000:])
The ub_chunker and utb_chunker are slight favorites with equal accuracy, so in practice I suggest using the ub_chunker since it takes slightly less time to train.
Training a chunker this way is much easier than creating manual chunk expressions or rules, it can approach 100% accuracy, and the process is re-usable across data sets. As with part-of-speech tagging, the training set really matters, and should be as similar as possible to the actual text that you want to tag and chunk.