The NLTK Book has been updated with an explanation of how to train a classifier based chunker, and I wanted to compare it’s accuracy versus my previous tagger based chunker.
Tag Chunker
I already covered how to train a tagger based chunker, with the the discovery that a Unigram–Bigram TagChunker
is the narrow favorite. I’ll use this Unigram-Bigram Chunker as a baseline for comparison below.
Classifier Chunker
A Classifier based Chunker uses a classifier such as the MaxentClassifier to determine which IOB chunk tags to use. It’s very similar to the TagChunker
in that the Chunker class is really a wrapper around a Classifier based part-of-speech tagger. And both are trainable alternatives to a regular expression parser. So first we need to create a ClassifierTagger
, and then we can wrap it with a ClassifierChunker
.
Classifier Tagger
The ClassifierTagger
below is an abstracted version of what’s described in the Information Extraction chapter of the NLTK Book. It should theoretically work with any feature extractor and classifier class when created with the train
classmethod. The kwargs
are passed to the classifier constructor.
from nltk.tag import TaggerI, untag class ClassifierTagger(TaggerI): '''Abstracted from "Training Classifier-Based Chunkers" section of http://nltk.googlecode.com/svn/trunk/doc/book/ch07.html ''' def __init__(self, feature_extractor, classifier): self.feature_extractor = feature_extractor self.classifier = classifier def tag(self, sent): history = [] for i, word in enumerate(sent): featureset = self.feature_extractor(sent, i, history) tag = self.classifier.classify(featureset) history.append(tag) return zip(sent, history) @classmethod def train(cls, train_sents, feature_extractor, classifier_cls, **kwargs): train_set = [] for tagged_sent in train_sents: untagged_sent = untag(tagged_sent) history = [] for i, (word, tag) in enumerate(tagged_sent): featureset = feature_extractor(untagged_sent, i, history) train_set.append((featureset, tag)) history.append(tag) classifier = classifier_cls.train(train_set, **kwargs) return cls(feature_extractor, classifier)
Classifier Chunker
The ClassifierChunker
is a thin wrapper around the ClassifierTagger
that converts between tagged tuples and parse trees. args
and kwargs
in __init__
are passed in to ClassifierTagger.train()
.
from nltk.chunk import ChunkParserI, tree2conlltags, conlltags2tree class ClassifierChunker(nltk.chunk.ChunkParserI): def __init__(self, train_sents, *args, **kwargs): tag_sents = [tree2conlltags(sent) for sent in train_sents] train_chunks = [[((w,t),c) for (w,t,c) in sent] for sent in tag_sents] self.tagger = ClassifierTagger.train(train_chunks, *args, **kwargs) def parse(self, tagged_sent): if not tagged_sent: return None chunks = self.tagger.tag(tagged_sent) return conlltags2tree([(w,t,c) for ((w,t),c) in chunks])
Feature Extractors
Classifiers work on featuresets, which are created with feature extraction functions. Below are the feature extractors I evaluated, partly copied from the NLTK Book.
def pos(sent, i, history): word, pos = sent[i] return {'pos': pos} def pos_word(sent, i, history): word, pos = sent[i] return {'pos': pos, 'word': word} def prev_pos(sent, i, history): word, pos = sent[i] if i == 0: prevword, prevpos = '<START>', '<START>' else: prevword, prevpos = sent[i-1] return {'pos': pos, 'prevpos': prevpos} def prev_pos_word(sent, i, history): word, pos = sent[i] if i == 0: prevword, prevpos = '<START>', '<START>' else: prevword, prevpos = sent[i-1] return {'pos': pos, 'prevpos': prevpos, 'word': word} def next_pos(sent, i, history): word, pos = sent[i] if i == len(sent) - 1: nextword, nextpos = '<END>', '<END>' else: nextword, nextpos = sent[i+1] return {'pos': pos, 'nextpos': nextpos} def next_pos_word(sent, i, history): word, pos = sent[i] if i == len(sent) - 1: nextword, nextpos = '<END>', '<END>' else: nextword, nextpos = sent[i+1] return {'pos': pos, 'nextpos': nextpos, 'word': word} def prev_next_pos(sent, i, history): word, pos = sent[i] if i == 0: prevword, prevpos = '<START>', '<START>' else: prevword, prevpos = sent[i-1] if i == len(sent) - 1: nextword, nextpos = '<END>', '<END>' else: nextword, nextpos = sent[i+1] return {'pos': pos, 'nextpos': nextpos, 'prevpos': prevpos} def prev_next_pos_word(sent, i, history): word, pos = sent[i] if i == 0: prevword, prevpos = '<START>', '<START>' else: prevword, prevpos = sent[i-1] if i == len(sent) - 1: nextword, nextpos = '<END>', '<END>' else: nextword, nextpos = sent[i+1] return {'pos': pos, 'nextpos': nextpos, 'word': word, 'prevpos': prevpos}
Training
Now that we have all the pieces, we can put them together with training.
NOTE: training the classifier takes a long time. If you want to reduce the time, you can increase min_lldelta
or decrease max_iter
, but you risk reducing the accuracy. Also note that the MaxentClassifier
will sometimes produce nan for the log likelihood (I’m guessing this is a divide-by-zero error somewhere). If you hit Ctrl-C
once at this point, you can stop the training and continue.
from nltk.corpus import conll2000 from nltk.classify import MaxentClassifier train_sents = conll2000.chunked_sents('train.txt') # featx is one of the feature extractors defined above chunker = ClassifierChunker(train_sents, featx, MaxentClassifier, min_lldelta=0.01, max_iter=10)
Accuracy
I ran the above training code for each feature extractor defined above, and generated the charts below. ub
still refers to the TagChunker
, which is included to provide a comparison baseline. All the other labels on the X-Axis refer to a classifier trained with one of the above feature extraction functions, using the first letter of each part of the name (p
refers to pos()
, pnpw
refers to prev_next_pos_word()
, etc).
One of the most interesting results of this test is how including the word in the featureset affects the accuracy. The only time including the word improves the accuracy is if the previous part-of-speech tag is also included in the featureset. Otherwise, including the word decreases accuracy. And looking ahead with next_pos()
and next_pos_word()
produces the worst results of all, until the previous part-of-speech tag is included. So whatever else you have in a featureset, the most important features are the current & previous pos tags, which, not surprisingly, is exactly what the TagChunker
trains on.
Custom Training Data
Not only can the ClassifierChunker
be significantly more accurate than the TagChunker
, it is also superior for custom training data. For my own custom chunk corpus, I was unable to get above 94% accuracy with the TagChunker
. That may seem pretty good, but it means the chunker is unable to parse over 1000 known chunks! However, after training the ClassifierChunker
with the prev_next_pos_word
feature extractor, I was able to get 100% parsing accuracy on my own chunk corpus. This is a huge win, and means that the behavior of the ClassifierChunker
is much more controllable thru manualation.