NLTK Classifier Based Chunker Accuracy

The NLTK Book has been updated with an explanation of how to train a classifier based chunker, and I wanted to compare it’s accuracy versus my previous tagger based chunker.

Tag Chunker

I already covered how to train a tagger based chunker, with the the discovery that a Unigram-Bigram TagChunker is the narrow favorite. I’ll use this Unigram-Bigram Chunker as a baseline for comparison below.

Classifier Chunker

A Classifier based Chunker uses a classifier such as the MaxentClassifier to determine which IOB chunk tags to use. It’s very similar to the TagChunker in that the Chunker class is really a wrapper around a Classifier based part-of-speech tagger. And both are trainable alternatives to a regular expression parser. So first we need to create a ClassifierTagger, and then we can wrap it with a ClassifierChunker.

Classifier Tagger

The ClassifierTagger below is an abstracted version of what’s described in the Information Extraction chapter of the NLTK Book. It should theoretically work with any feature extractor and classifier class when created with the train classmethod. The kwargs are passed to the classifier constructor.

from nltk.tag import TaggerI, untag

class ClassifierTagger(TaggerI):
	'''Abstracted from "Training Classifier-Based Chunkers" section of

http://nltk.googlecode.com/svn/trunk/doc/book/ch07.html

	'''
	def __init__(self, feature_extractor, classifier):
		self.feature_extractor = feature_extractor
		self.classifier = classifier

	def tag(self, sent):
		history = []

		for i, word in enumerate(sent):
			featureset = self.feature_extractor(sent, i, history)
			tag = self.classifier.classify(featureset)
			history.append(tag)

		return zip(sent, history)

	@classmethod
	def train(cls, train_sents, feature_extractor, classifier_cls, **kwargs):
		train_set = []

		for tagged_sent in train_sents:
			untagged_sent = untag(tagged_sent)
			history = []

			for i, (word, tag) in enumerate(tagged_sent):
				featureset = feature_extractor(untagged_sent, i, history)
				train_set.append((featureset, tag))
				history.append(tag)

		classifier = classifier_cls.train(train_set, **kwargs)
		return cls(feature_extractor, classifier)

Classifier Chunker

The ClassifierChunker is a thin wrapper around the ClassifierTagger that converts between tagged tuples and parse trees. args and kwargs in __init__ are passed in to ClassifierTagger.train().

from nltk.chunk import ChunkParserI, tree2conlltags, conlltags2tree

class ClassifierChunker(nltk.chunk.ChunkParserI):
	def __init__(self, train_sents, *args, **kwargs):
		tag_sents = [tree2conlltags(sent) for sent in train_sents]
		train_chunks = [[((w,t),c) for (w,t,c) in sent] for sent in tag_sents]
		self.tagger = ClassifierTagger.train(train_chunks, *args, **kwargs)

	def parse(self, tagged_sent):
		if not tagged_sent: return None
		chunks = self.tagger.tag(tagged_sent)
		return conlltags2tree([(w,t,c) for ((w,t),c) in chunks])

Feature Extractors

Classifiers work on featuresets, which are created with feature extraction functions. Below are the feature extractors I evaluated, partly copied from the NLTK Book.

def pos(sent, i, history):
	word, pos = sent[i]
	return {'pos': pos}

def pos_word(sent, i, history):
	word, pos = sent[i]
	return {'pos': pos, 'word': word}

def prev_pos(sent, i, history):
	word, pos = sent[i]

	if i == 0:
		prevword, prevpos = '<START>', '<START>'
	else:
		prevword, prevpos = sent[i-1]

	return {'pos': pos, 'prevpos': prevpos}

def prev_pos_word(sent, i, history):
	word, pos = sent[i]

	if i == 0:
		prevword, prevpos = '<START>', '<START>'
	else:
		prevword, prevpos = sent[i-1]

	return {'pos': pos, 'prevpos': prevpos, 'word': word}

def next_pos(sent, i, history):
	word, pos = sent[i]

	if i == len(sent) - 1:
		nextword, nextpos = '<END>', '<END>'
	else:
		nextword, nextpos = sent[i+1]

	return {'pos': pos, 'nextpos': nextpos}

def next_pos_word(sent, i, history):
	word, pos = sent[i]

	if i == len(sent) - 1:
		nextword, nextpos = '<END>', '<END>'
	else:
		nextword, nextpos = sent[i+1]

	return {'pos': pos, 'nextpos': nextpos, 'word': word}

def prev_next_pos(sent, i, history):
	word, pos = sent[i]

	if i == 0:
		prevword, prevpos = '<START>', '<START>'
	else:
		prevword, prevpos = sent[i-1]

	if i == len(sent) - 1:
		nextword, nextpos = '<END>', '<END>'
	else:
		nextword, nextpos = sent[i+1]

	return {'pos': pos, 'nextpos': nextpos, 'prevpos': prevpos}

def prev_next_pos_word(sent, i, history):
	word, pos = sent[i]

	if i == 0:
		prevword, prevpos = '<START>', '<START>'
	else:
		prevword, prevpos = sent[i-1]

	if i == len(sent) - 1:
		nextword, nextpos = '<END>', '<END>'
	else:
		nextword, nextpos = sent[i+1]

	return {'pos': pos, 'nextpos': nextpos, 'word': word, 'prevpos': prevpos}

Training

Now that we have all the pieces, we can put them together with training.

NOTE: training the classifier takes a long time. If you want to reduce the time, you can increase min_lldelta or decrease max_iter, but you risk reducing the accuracy. Also note that the MaxentClassifier will sometimes produce nan for the log likelihood (I’m guessing this is a divide-by-zero error somewhere). If you hit Ctrl-C once at this point, you can stop the training and continue.

from nltk.corpus import conll2000
from nltk.classify import MaxentClassifier

train_sents = conll2000.chunked_sents('train.txt')
# featx is one of the feature extractors defined above
chunker = ClassifierChunker(train_sents, featx, MaxentClassifier,
	min_lldelta=0.01, max_iter=10)

Accuracy

I ran the above training code for each feature extractor defined above, and generated the charts below. ub still refers to the TagChunker, which is included to provide a comparison baseline. All the other labels on the X-Axis refer to a classifier trained with one of the above feature extraction functions, using the first letter of each part of the name (p refers to pos(), pnpw refers to prev_next_pos_word(), etc).

conll2000 chunk training accuracy
treebank chunk training accuracy

One of the most interesting results of this test is how including the word in the featureset affects the accuracy. The only time including the word improves the accuracy is if the previous part-of-speech tag is also included in the featureset. Otherwise, including the word decreases accuracy. And looking ahead with next_pos() and next_pos_word() produces the worst results of all, until the previous part-of-speech tag is included. So whatever else you have in a featureset, the most important features are the current & previous pos tags, which, not surprisingly, is exactly what the TagChunker trains on.

Custom Training Data

Not only can the ClassifierChunker be significantly more accurate than the TagChunker, it is also superior for custom training data. For my own custom chunk corpus, I was unable to get above 94% accuracy with the TagChunker. That may seem pretty good, but it means the chunker is unable to parse over 1000 known chunks! However, after training the ClassifierChunker with the prev_next_pos_word feature extractor, I was able to get 100% parsing accuracy on my own chunk corpus. This is a huge win, and means that the behavior of the ClassifierChunker is much more controllable thru manualation.

  • http://www.smithware.co.uk/ James Smith

    I don't suppose you'd be willing to upload the trained chunker as a serialized object would you?

  • http://streamhacker.com/ Jacob Perkins

    What corpus would you want it trained on, treebank, conll2000, or both? Any suggestions on where to put the file? (it's a couple megabytes)

  • http://www.smithware.co.uk/ James Smith

    I've got my computer finally setup so I'm training the stuff myself now.

    I'm training a few different ones to evaluate but I suspect I may be doing something wrong as the iterations go 1, 2, Final. Any ideas?

    Unlike yourself, I'm also interested in the named entity side of this. Unfortunately the NLTK default chunker is pretty bad at recognising the entity types so I'm using conll2002s chunked_sents as a training corpus. I'd also like to use the treebank NE material but suspect I would have to change the tags to the same names as those used in conll materials.

  • http://streamhacker.com/ Jacob Perkins

    Sorry for the late reply, somehow the notification got caught in my spam filter. Anyway…

    If you think the iterations are cutting off too quickly, you might want to try tweaking some of the cutoffs described here: http://nltk.googlecode.com/svn/trunk/doc/api/nl

    Wish I could help with the NE. All I can think of is to suggest checking out the names corpus: http://nltk.googlecode.com/svn/trunk/doc/api/nl
    Maybe you can train on that somehow.

  • http://streamhacker.com/ Jacob Perkins

    Sorry for the late reply, somehow the notification got caught in my spam filter. Anyway…

    If you think the iterations are cutting off too quickly, you might want to try tweaking some of the cutoffs described here: http://nltk.googlecode.com/svn/trunk/doc/api/nl

    Wish I could help with the NE. All I can think of is to suggest checking out the names corpus: http://nltk.googlecode.com/svn/trunk/doc/api/nl
    Maybe you can train on that somehow.

  • Pingback: Learning to do natural language processing with NLTK | JetLlib Journal

  • Pingback: Extracting binary relationships from English sentences with Python | The World of Yesterday