Tag Archives: chunking

Announcing Text Processing APIs

If you liked the NLTK demos, then you’ll love the text processing APIs. They provide all the functionality of the demos, plus a little bit more, and return results in JSON. Requests can contain up to 10,000 characters, instead of the 1,000 character limit on the demos, and you can do up to 100 calls per day. These limits may change in the future depending on usage & demand. If you’d like to do more, please fill out this survey to let me know what your needs are.

Announcing Python NLTK Demos

If you want to see what NLTK can do, but don’t want to go thru the effort of installation and learning how to use it, then check out my Python NLTK demos.

It currently demonstrates the following functionality:

If you like it, please share it. If you want to see more, leave a comment below. And if you are interested in a service that could apply these processes to your own data, please fill out this NLTK services survey.

Other Natural Language Processing Demos

Here’s a list of similar resources on the web:

NLTK Classifier Based Chunker Accuracy

The NLTK Book has been updated with an explanation of how to train a classifier based chunker, and I wanted to compare it’s accuracy versus my previous tagger based chunker.

Tag Chunker

I already covered how to train a tagger based chunker, with the the discovery that a UnigramBigram TagChunker is the narrow favorite. I’ll use this Unigram-Bigram Chunker as a baseline for comparison below.

Classifier Chunker

A Classifier based Chunker uses a classifier such as the MaxentClassifier to determine which IOB chunk tags to use. It’s very similar to the TagChunker in that the Chunker class is really a wrapper around a Classifier based part-of-speech tagger. And both are trainable alternatives to a regular expression parser. So first we need to create a ClassifierTagger, and then we can wrap it with a ClassifierChunker.

Classifier Tagger

The ClassifierTagger below is an abstracted version of what’s described in the Information Extraction chapter of the NLTK Book. It should theoretically work with any feature extractor and classifier class when created with the train classmethod. The kwargs are passed to the classifier constructor.

from nltk.tag import TaggerI, untag

class ClassifierTagger(TaggerI):
	'''Abstracted from "Training Classifier-Based Chunkers" section of
	http://nltk.googlecode.com/svn/trunk/doc/book/ch07.html
	'''
	def __init__(self, feature_extractor, classifier):
		self.feature_extractor = feature_extractor
		self.classifier = classifier

	def tag(self, sent):
		history = []

		for i, word in enumerate(sent):
			featureset = self.feature_extractor(sent, i, history)
			tag = self.classifier.classify(featureset)
			history.append(tag)

		return zip(sent, history)

	@classmethod
	def train(cls, train_sents, feature_extractor, classifier_cls, **kwargs):
		train_set = []

		for tagged_sent in train_sents:
			untagged_sent = untag(tagged_sent)
			history = []

			for i, (word, tag) in enumerate(tagged_sent):
				featureset = feature_extractor(untagged_sent, i, history)
				train_set.append((featureset, tag))
				history.append(tag)

		classifier = classifier_cls.train(train_set, **kwargs)
		return cls(feature_extractor, classifier)

Classifier Chunker

The ClassifierChunker is a thin wrapper around the ClassifierTagger that converts between tagged tuples and parse trees. args and kwargs in __init__ are passed in to ClassifierTagger.train().

from nltk.chunk import ChunkParserI, tree2conlltags, conlltags2tree

class ClassifierChunker(nltk.chunk.ChunkParserI):
	def __init__(self, train_sents, *args, **kwargs):
		tag_sents = [tree2conlltags(sent) for sent in train_sents]
		train_chunks = [[((w,t),c) for (w,t,c) in sent] for sent in tag_sents]
		self.tagger = ClassifierTagger.train(train_chunks, *args, **kwargs)

	def parse(self, tagged_sent):
		if not tagged_sent: return None
		chunks = self.tagger.tag(tagged_sent)
		return conlltags2tree([(w,t,c) for ((w,t),c) in chunks])

Feature Extractors

Classifiers work on featuresets, which are created with feature extraction functions. Below are the feature extractors I evaluated, partly copied from the NLTK Book.

def pos(sent, i, history):
	word, pos = sent[i]
	return {'pos': pos}

def pos_word(sent, i, history):
	word, pos = sent[i]
	return {'pos': pos, 'word': word}

def prev_pos(sent, i, history):
	word, pos = sent[i]

	if i == 0:
		prevword, prevpos = '<START>', '<START>'
	else:
		prevword, prevpos = sent[i-1]

	return {'pos': pos, 'prevpos': prevpos}

def prev_pos_word(sent, i, history):
	word, pos = sent[i]

	if i == 0:
		prevword, prevpos = '<START>', '<START>'
	else:
		prevword, prevpos = sent[i-1]

	return {'pos': pos, 'prevpos': prevpos, 'word': word}

def next_pos(sent, i, history):
	word, pos = sent[i]

	if i == len(sent) - 1:
		nextword, nextpos = '<END>', '<END>'
	else:
		nextword, nextpos = sent[i+1]

	return {'pos': pos, 'nextpos': nextpos}

def next_pos_word(sent, i, history):
	word, pos = sent[i]

	if i == len(sent) - 1:
		nextword, nextpos = '<END>', '<END>'
	else:
		nextword, nextpos = sent[i+1]

	return {'pos': pos, 'nextpos': nextpos, 'word': word}

def prev_next_pos(sent, i, history):
	word, pos = sent[i]

	if i == 0:
		prevword, prevpos = '<START>', '<START>'
	else:
		prevword, prevpos = sent[i-1]

	if i == len(sent) - 1:
		nextword, nextpos = '<END>', '<END>'
	else:
		nextword, nextpos = sent[i+1]

	return {'pos': pos, 'nextpos': nextpos, 'prevpos': prevpos}

def prev_next_pos_word(sent, i, history):
	word, pos = sent[i]

	if i == 0:
		prevword, prevpos = '<START>', '<START>'
	else:
		prevword, prevpos = sent[i-1]

	if i == len(sent) - 1:
		nextword, nextpos = '<END>', '<END>'
	else:
		nextword, nextpos = sent[i+1]

	return {'pos': pos, 'nextpos': nextpos, 'word': word, 'prevpos': prevpos}

Training

Now that we have all the pieces, we can put them together with training.

NOTE: training the classifier takes a long time. If you want to reduce the time, you can increase min_lldelta or decrease max_iter, but you risk reducing the accuracy. Also note that the MaxentClassifier will sometimes produce nan for the log likelihood (I’m guessing this is a divide-by-zero error somewhere). If you hit Ctrl-C once at this point, you can stop the training and continue.

from nltk.corpus import conll2000
from nltk.classify import MaxentClassifier

train_sents = conll2000.chunked_sents('train.txt')
# featx is one of the feature extractors defined above
chunker = ClassifierChunker(train_sents, featx, MaxentClassifier,
	min_lldelta=0.01, max_iter=10)

Accuracy

I ran the above training code for each feature extractor defined above, and generated the charts below. ub still refers to the TagChunker, which is included to provide a comparison baseline. All the other labels on the X-Axis refer to a classifier trained with one of the above feature extraction functions, using the first letter of each part of the name (p refers to pos(), pnpw refers to prev_next_pos_word(), etc).

conll2000 chunk training accuracy
treebank chunk training accuracy

One of the most interesting results of this test is how including the word in the featureset affects the accuracy. The only time including the word improves the accuracy is if the previous part-of-speech tag is also included in the featureset. Otherwise, including the word decreases accuracy. And looking ahead with next_pos() and next_pos_word() produces the worst results of all, until the previous part-of-speech tag is included. So whatever else you have in a featureset, the most important features are the current & previous pos tags, which, not surprisingly, is exactly what the TagChunker trains on.

Custom Training Data

Not only can the ClassifierChunker be significantly more accurate than the TagChunker, it is also superior for custom training data. For my own custom chunk corpus, I was unable to get above 94% accuracy with the TagChunker. That may seem pretty good, but it means the chunker is unable to parse over 1000 known chunks! However, after training the ClassifierChunker with the prev_next_pos_word feature extractor, I was able to get 100% parsing accuracy on my own chunk corpus. This is a huge win, and means that the behavior of the ClassifierChunker is much more controllable thru manualation.

Chunk Extraction with NLTK

Chunk extraction is a useful preliminary step to information extraction, that creates parse trees from unstructured text with a chunker. Once you have a parse tree of a sentence, you can do more specific information extraction, such as named entity recognition and relation extraction.

Chunking is basically a 3 step process:

  1. Tag a sentence
  2. Chunk the tagged sentence
  3. Analyze the parse tree to extract information

I’ve already written about how to train a NLTK part of speech tagger and a chunker, so I’ll assume you’ve already done the training, and now you want to use your pos tagger and iob chunker to do something useful.

IOB Tag Chunker

The previously trained chunker is actually a chunk tagger. It’s a Tagger that assigns IOB chunk tags to part-of-speech tags. In order to use it for proper chunking, we need some extra code to convert the IOB chunk tags into a parse tree. I’ve created a wrapper class that complies with the nltk ChunkParserI interface and uses the trained chunk tagger to get IOB tags and convert them to a proper parse tree.

import nltk.chunk
import itertools

class TagChunker(nltk.chunk.ChunkParserI):
    def __init__(self, chunk_tagger):
        self._chunk_tagger = chunk_tagger

    def parse(self, tokens):
        # split words and part of speech tags
        (words, tags) = zip(*tokens)
        # get IOB chunk tags
        chunks = self._chunk_tagger.tag(tags)
        # join words with chunk tags
        wtc = itertools.izip(words, chunks)
        # w = word, t = part-of-speech tag, c = chunk tag
        lines = [' '.join([w, t, c]) for (w, (t, c)) in wtc if c]
        # create tree from conll formatted chunk lines
        return nltk.chunk.conllstr2tree('\n'.join(lines))

Chunk Extraction

Now that we have a proper NLTK chunker, we can use it to extract chunks. Here’s a simple example that tags a sentence, chunks the tagged sentence, then prints out each noun phrase.

# sentence should be a list of words
tagged = tagger.tag(sentence)
tree = chunker.parse(tagged)
# for each noun phrase sub tree in the parse tree
for subtree in tree.subtrees(filter=lambda t: t.node == 'NP'):
    # print the noun phrase as a list of part-of-speech tagged words
    print subtree.leaves()

Each sub tree has a phrase tag, and the leaves of a sub tree are the tagged words that make up that chunk. Since we’re training the chunker on IOB tags, NP stands for Noun Phrase. As noted before, the results of this natural language processing are heavily dependent on the training data. If your input text isn’t similar to the your training data, then you probably won’t be getting many chunks.

How to Train a NLTK Chunker

In NLTK, chunking is the process of extracting short, well-formed phrases, or chunks, from a sentence. This is also known as partial parsing, since a chunker is not required to capture all the words in a sentence, and does not produce a deep parse tree. But this is a good thing because it’s very hard to create a complete parse grammar for natural language, and full parsing is usually all or nothing. So chunking allows you to get at the bits you want and ignore the rest.

Training a Chunker

The general approach to chunking and parsing is to define rules or expressions that are then matched against the input sentence. But this is a very manual, tedious, and error-prone process, likely to get very complicated real fast. The alternative approach is to train a chunker the same way you train a part-of-speech tagger. Except in this case, instead of training on (word, tag) sequences, we train on (tag, iob) sequences, where iob is a chunk tag defined in the the conll2000 corpus. Here’s a function that will take a list of chunked sentences (from a chunked corpus like conll2000 or treebank), and return a list of (tag, iob) sequences.

import nltk.chunk

def conll_tag_chunks(chunk_sents):
    tag_sents = [nltk.chunk.tree2conlltags(tree) for tree in chunk_sents]
    return [[(t, c) for (w, t, c) in chunk_tags] for chunk_tags in tag_sents]

Chunker Accuracy

So how accurate is the trained chunker? Here’s the rest of the code, followed by a chart of the accuracy results. Note that I’m only using Ngram Taggers. You could additionally use the BrillTagger, but the training takes a ridiculously long time for very minimal gains in accuracy.

import nltk.corpus, nltk.tag

def ubt_conll_chunk_accuracy(train_sents, test_sents):
    train_chunks = conll_tag_chunks(train_sents)
    test_chunks = conll_tag_chunks(test_sents)

    u_chunker = nltk.tag.UnigramTagger(train_chunks)
    print 'u:', nltk.tag.accuracy(u_chunker, test_chunks)

    ub_chunker = nltk.tag.BigramTagger(train_chunks, backoff=u_chunker)
    print 'ub:', nltk.tag.accuracy(ub_chunker, test_chunks)

    ubt_chunker = nltk.tag.TrigramTagger(train_chunks, backoff=ub_chunker)
    print 'ubt:', nltk.tag.accuracy(ubt_chunker, test_chunks)

    ut_chunker = nltk.tag.TrigramTagger(train_chunks, backoff=u_chunker)
    print 'ut:', nltk.tag.accuracy(ut_chunker, test_chunks)

    utb_chunker = nltk.tag.BigramTagger(train_chunks, backoff=ut_chunker)
    print 'utb:', nltk.tag.accuracy(utb_chunker, test_chunks)

# conll chunking accuracy test
conll_train = nltk.corpus.conll2000.chunked_sents('train.txt')
conll_test = nltk.corpus.conll2000.chunked_sents('test.txt')
ubt_conll_chunk_accuracy(conll_train, conll_test)

# treebank chunking accuracy test
treebank_sents = nltk.corpus.treebank_chunk.chunked_sents()
ubt_conll_chunk_accuracy(treebank_sents[:2000], treebank_sents[2000:])
Accuracy for Trained Chunker
Accuracy for Trained Chunker

The ub_chunker and utb_chunker are slight favorites with equal accuracy, so in practice I suggest using the ub_chunker since it takes slightly less time to train.

Conclusion

Training a chunker this way is much easier than creating manual chunk expressions or rules, it can approach 100% accuracy, and the process is re-usable across data sets. As with part-of-speech tagging, the training set really matters, and should be as similar as possible to the actual text that you want to tag and chunk.