Tag Archives: information extraction

NLTK Classifier Based Chunker Accuracy

The NLTK Book has been updated with an explanation of how to train a classifier based chunker, and I wanted to compare it’s accuracy versus my previous tagger based chunker.

Tag Chunker

I already covered how to train a tagger based chunker, with the the discovery that a UnigramBigram TagChunker is the narrow favorite. I’ll use this Unigram-Bigram Chunker as a baseline for comparison below.

Classifier Chunker

A Classifier based Chunker uses a classifier such as the MaxentClassifier to determine which IOB chunk tags to use. It’s very similar to the TagChunker in that the Chunker class is really a wrapper around a Classifier based part-of-speech tagger. And both are trainable alternatives to a regular expression parser. So first we need to create a ClassifierTagger, and then we can wrap it with a ClassifierChunker.

Classifier Tagger

The ClassifierTagger below is an abstracted version of what’s described in the Information Extraction chapter of the NLTK Book. It should theoretically work with any feature extractor and classifier class when created with the train classmethod. The kwargs are passed to the classifier constructor.

from nltk.tag import TaggerI, untag

class ClassifierTagger(TaggerI):
	'''Abstracted from "Training Classifier-Based Chunkers" section of
	http://nltk.googlecode.com/svn/trunk/doc/book/ch07.html
	'''
	def __init__(self, feature_extractor, classifier):
		self.feature_extractor = feature_extractor
		self.classifier = classifier

	def tag(self, sent):
		history = []

		for i, word in enumerate(sent):
			featureset = self.feature_extractor(sent, i, history)
			tag = self.classifier.classify(featureset)
			history.append(tag)

		return zip(sent, history)

	@classmethod
	def train(cls, train_sents, feature_extractor, classifier_cls, **kwargs):
		train_set = []

		for tagged_sent in train_sents:
			untagged_sent = untag(tagged_sent)
			history = []

			for i, (word, tag) in enumerate(tagged_sent):
				featureset = feature_extractor(untagged_sent, i, history)
				train_set.append((featureset, tag))
				history.append(tag)

		classifier = classifier_cls.train(train_set, **kwargs)
		return cls(feature_extractor, classifier)

Classifier Chunker

The ClassifierChunker is a thin wrapper around the ClassifierTagger that converts between tagged tuples and parse trees. args and kwargs in __init__ are passed in to ClassifierTagger.train().

from nltk.chunk import ChunkParserI, tree2conlltags, conlltags2tree

class ClassifierChunker(nltk.chunk.ChunkParserI):
	def __init__(self, train_sents, *args, **kwargs):
		tag_sents = [tree2conlltags(sent) for sent in train_sents]
		train_chunks = [[((w,t),c) for (w,t,c) in sent] for sent in tag_sents]
		self.tagger = ClassifierTagger.train(train_chunks, *args, **kwargs)

	def parse(self, tagged_sent):
		if not tagged_sent: return None
		chunks = self.tagger.tag(tagged_sent)
		return conlltags2tree([(w,t,c) for ((w,t),c) in chunks])

Feature Extractors

Classifiers work on featuresets, which are created with feature extraction functions. Below are the feature extractors I evaluated, partly copied from the NLTK Book.

def pos(sent, i, history):
	word, pos = sent[i]
	return {'pos': pos}

def pos_word(sent, i, history):
	word, pos = sent[i]
	return {'pos': pos, 'word': word}

def prev_pos(sent, i, history):
	word, pos = sent[i]

	if i == 0:
		prevword, prevpos = '<START>', '<START>'
	else:
		prevword, prevpos = sent[i-1]

	return {'pos': pos, 'prevpos': prevpos}

def prev_pos_word(sent, i, history):
	word, pos = sent[i]

	if i == 0:
		prevword, prevpos = '<START>', '<START>'
	else:
		prevword, prevpos = sent[i-1]

	return {'pos': pos, 'prevpos': prevpos, 'word': word}

def next_pos(sent, i, history):
	word, pos = sent[i]

	if i == len(sent) - 1:
		nextword, nextpos = '<END>', '<END>'
	else:
		nextword, nextpos = sent[i+1]

	return {'pos': pos, 'nextpos': nextpos}

def next_pos_word(sent, i, history):
	word, pos = sent[i]

	if i == len(sent) - 1:
		nextword, nextpos = '<END>', '<END>'
	else:
		nextword, nextpos = sent[i+1]

	return {'pos': pos, 'nextpos': nextpos, 'word': word}

def prev_next_pos(sent, i, history):
	word, pos = sent[i]

	if i == 0:
		prevword, prevpos = '<START>', '<START>'
	else:
		prevword, prevpos = sent[i-1]

	if i == len(sent) - 1:
		nextword, nextpos = '<END>', '<END>'
	else:
		nextword, nextpos = sent[i+1]

	return {'pos': pos, 'nextpos': nextpos, 'prevpos': prevpos}

def prev_next_pos_word(sent, i, history):
	word, pos = sent[i]

	if i == 0:
		prevword, prevpos = '<START>', '<START>'
	else:
		prevword, prevpos = sent[i-1]

	if i == len(sent) - 1:
		nextword, nextpos = '<END>', '<END>'
	else:
		nextword, nextpos = sent[i+1]

	return {'pos': pos, 'nextpos': nextpos, 'word': word, 'prevpos': prevpos}

Training

Now that we have all the pieces, we can put them together with training.

NOTE: training the classifier takes a long time. If you want to reduce the time, you can increase min_lldelta or decrease max_iter, but you risk reducing the accuracy. Also note that the MaxentClassifier will sometimes produce nan for the log likelihood (I’m guessing this is a divide-by-zero error somewhere). If you hit Ctrl-C once at this point, you can stop the training and continue.

from nltk.corpus import conll2000
from nltk.classify import MaxentClassifier

train_sents = conll2000.chunked_sents('train.txt')
# featx is one of the feature extractors defined above
chunker = ClassifierChunker(train_sents, featx, MaxentClassifier,
	min_lldelta=0.01, max_iter=10)

Accuracy

I ran the above training code for each feature extractor defined above, and generated the charts below. ub still refers to the TagChunker, which is included to provide a comparison baseline. All the other labels on the X-Axis refer to a classifier trained with one of the above feature extraction functions, using the first letter of each part of the name (p refers to pos(), pnpw refers to prev_next_pos_word(), etc).

conll2000 chunk training accuracy
treebank chunk training accuracy

One of the most interesting results of this test is how including the word in the featureset affects the accuracy. The only time including the word improves the accuracy is if the previous part-of-speech tag is also included in the featureset. Otherwise, including the word decreases accuracy. And looking ahead with next_pos() and next_pos_word() produces the worst results of all, until the previous part-of-speech tag is included. So whatever else you have in a featureset, the most important features are the current & previous pos tags, which, not surprisingly, is exactly what the TagChunker trains on.

Custom Training Data

Not only can the ClassifierChunker be significantly more accurate than the TagChunker, it is also superior for custom training data. For my own custom chunk corpus, I was unable to get above 94% accuracy with the TagChunker. That may seem pretty good, but it means the chunker is unable to parse over 1000 known chunks! However, after training the ClassifierChunker with the prev_next_pos_word feature extractor, I was able to get 100% parsing accuracy on my own chunk corpus. This is a huge win, and means that the behavior of the ClassifierChunker is much more controllable thru manualation.

Chunk Extraction with NLTK

Chunk extraction is a useful preliminary step to information extraction, that creates parse trees from unstructured text with a chunker. Once you have a parse tree of a sentence, you can do more specific information extraction, such as named entity recognition and relation extraction.

Chunking is basically a 3 step process:

  1. Tag a sentence
  2. Chunk the tagged sentence
  3. Analyze the parse tree to extract information

I’ve already written about how to train a NLTK part of speech tagger and a chunker, so I’ll assume you’ve already done the training, and now you want to use your pos tagger and iob chunker to do something useful.

IOB Tag Chunker

The previously trained chunker is actually a chunk tagger. It’s a Tagger that assigns IOB chunk tags to part-of-speech tags. In order to use it for proper chunking, we need some extra code to convert the IOB chunk tags into a parse tree. I’ve created a wrapper class that complies with the nltk ChunkParserI interface and uses the trained chunk tagger to get IOB tags and convert them to a proper parse tree.

import nltk.chunk
import itertools

class TagChunker(nltk.chunk.ChunkParserI):
    def __init__(self, chunk_tagger):
        self._chunk_tagger = chunk_tagger

    def parse(self, tokens):
        # split words and part of speech tags
        (words, tags) = zip(*tokens)
        # get IOB chunk tags
        chunks = self._chunk_tagger.tag(tags)
        # join words with chunk tags
        wtc = itertools.izip(words, chunks)
        # w = word, t = part-of-speech tag, c = chunk tag
        lines = [' '.join([w, t, c]) for (w, (t, c)) in wtc if c]
        # create tree from conll formatted chunk lines
        return nltk.chunk.conllstr2tree('\n'.join(lines))

Chunk Extraction

Now that we have a proper NLTK chunker, we can use it to extract chunks. Here’s a simple example that tags a sentence, chunks the tagged sentence, then prints out each noun phrase.

# sentence should be a list of words
tagged = tagger.tag(sentence)
tree = chunker.parse(tagged)
# for each noun phrase sub tree in the parse tree
for subtree in tree.subtrees(filter=lambda t: t.node == 'NP'):
    # print the noun phrase as a list of part-of-speech tagged words
    print subtree.leaves()

Each sub tree has a phrase tag, and the leaves of a sub tree are the tagged words that make up that chunk. Since we’re training the chunker on IOB tags, NP stands for Noun Phrase. As noted before, the results of this natural language processing are heavily dependent on the training data. If your input text isn’t similar to the your training data, then you probably won’t be getting many chunks.