If you liked the NLTK demos, then you’ll love the text processing APIs. They provide all the functionality of the demos, plus a little bit more, and return results in JSON. Requests can contain up to 10,000 characters, instead of the 1,000 character limit on the demos, and you can do up to 100 calls per day. These limits may change in the future depending on usage & demand. If you’d like to do more, please fill out this survey to let me know what your needs are.
Tag Archives: chunking
Announcing Python NLTK Demos
If you want to see what NLTK can do, but don’t want to go thru the effort of installation and learning how to use it, then check out my Python NLTK demos.
It currently demonstrates the following functionality:
- part-of-speech tagging with the default NLTK pos tagger
- chunking and named entity recognition with the default NLTK chunker
- sentiment analysis with a combination of a naive bayes classifier and a maximum entropy classifier, both trained on the movie reviews corpus
If you like it, please share it. If you want to see more, leave a comment below. And if you are interested in a service that could apply these processes to your own data, please fill out this NLTK services survey.
Other Natural Language Processing Demos
Here’s a list of similar resources on the web:
- A demo of the Stanford Parser with a javascript API: Natural-language Parsing For The Web
- A demo of the FreeLing language analysis suite: FreeLing Demo
- Emotional identification from text: EmoLib
NLTK Classifier Based Chunker Accuracy
The NLTK Book has been updated with an explanation of how to train a classifier based chunker, and I wanted to compare it’s accuracy versus my previous tagger based chunker.
Tag Chunker
I already covered how to train a tagger based chunker, with the the discovery that a Unigram–Bigram TagChunker
is the narrow favorite. I’ll use this Unigram-Bigram Chunker as a baseline for comparison below.
Classifier Chunker
A Classifier based Chunker uses a classifier such as the MaxentClassifier to determine which IOB chunk tags to use. It’s very similar to the TagChunker
in that the Chunker class is really a wrapper around a Classifier based part-of-speech tagger. And both are trainable alternatives to a regular expression parser. So first we need to create a ClassifierTagger
, and then we can wrap it with a ClassifierChunker
.
Classifier Tagger
The ClassifierTagger
below is an abstracted version of what’s described in the Information Extraction chapter of the NLTK Book. It should theoretically work with any feature extractor and classifier class when created with the train
classmethod. The kwargs
are passed to the classifier constructor.
from nltk.tag import TaggerI, untag class ClassifierTagger(TaggerI): '''Abstracted from "Training Classifier-Based Chunkers" section of http://nltk.googlecode.com/svn/trunk/doc/book/ch07.html ''' def __init__(self, feature_extractor, classifier): self.feature_extractor = feature_extractor self.classifier = classifier def tag(self, sent): history = [] for i, word in enumerate(sent): featureset = self.feature_extractor(sent, i, history) tag = self.classifier.classify(featureset) history.append(tag) return zip(sent, history) @classmethod def train(cls, train_sents, feature_extractor, classifier_cls, **kwargs): train_set = [] for tagged_sent in train_sents: untagged_sent = untag(tagged_sent) history = [] for i, (word, tag) in enumerate(tagged_sent): featureset = feature_extractor(untagged_sent, i, history) train_set.append((featureset, tag)) history.append(tag) classifier = classifier_cls.train(train_set, **kwargs) return cls(feature_extractor, classifier)
Classifier Chunker
The ClassifierChunker
is a thin wrapper around the ClassifierTagger
that converts between tagged tuples and parse trees. args
and kwargs
in __init__
are passed in to ClassifierTagger.train()
.
from nltk.chunk import ChunkParserI, tree2conlltags, conlltags2tree class ClassifierChunker(nltk.chunk.ChunkParserI): def __init__(self, train_sents, *args, **kwargs): tag_sents = [tree2conlltags(sent) for sent in train_sents] train_chunks = [[((w,t),c) for (w,t,c) in sent] for sent in tag_sents] self.tagger = ClassifierTagger.train(train_chunks, *args, **kwargs) def parse(self, tagged_sent): if not tagged_sent: return None chunks = self.tagger.tag(tagged_sent) return conlltags2tree([(w,t,c) for ((w,t),c) in chunks])
Feature Extractors
Classifiers work on featuresets, which are created with feature extraction functions. Below are the feature extractors I evaluated, partly copied from the NLTK Book.
def pos(sent, i, history): word, pos = sent[i] return {'pos': pos} def pos_word(sent, i, history): word, pos = sent[i] return {'pos': pos, 'word': word} def prev_pos(sent, i, history): word, pos = sent[i] if i == 0: prevword, prevpos = '<START>', '<START>' else: prevword, prevpos = sent[i-1] return {'pos': pos, 'prevpos': prevpos} def prev_pos_word(sent, i, history): word, pos = sent[i] if i == 0: prevword, prevpos = '<START>', '<START>' else: prevword, prevpos = sent[i-1] return {'pos': pos, 'prevpos': prevpos, 'word': word} def next_pos(sent, i, history): word, pos = sent[i] if i == len(sent) - 1: nextword, nextpos = '<END>', '<END>' else: nextword, nextpos = sent[i+1] return {'pos': pos, 'nextpos': nextpos} def next_pos_word(sent, i, history): word, pos = sent[i] if i == len(sent) - 1: nextword, nextpos = '<END>', '<END>' else: nextword, nextpos = sent[i+1] return {'pos': pos, 'nextpos': nextpos, 'word': word} def prev_next_pos(sent, i, history): word, pos = sent[i] if i == 0: prevword, prevpos = '<START>', '<START>' else: prevword, prevpos = sent[i-1] if i == len(sent) - 1: nextword, nextpos = '<END>', '<END>' else: nextword, nextpos = sent[i+1] return {'pos': pos, 'nextpos': nextpos, 'prevpos': prevpos} def prev_next_pos_word(sent, i, history): word, pos = sent[i] if i == 0: prevword, prevpos = '<START>', '<START>' else: prevword, prevpos = sent[i-1] if i == len(sent) - 1: nextword, nextpos = '<END>', '<END>' else: nextword, nextpos = sent[i+1] return {'pos': pos, 'nextpos': nextpos, 'word': word, 'prevpos': prevpos}
Training
Now that we have all the pieces, we can put them together with training.
NOTE: training the classifier takes a long time. If you want to reduce the time, you can increase min_lldelta
or decrease max_iter
, but you risk reducing the accuracy. Also note that the MaxentClassifier
will sometimes produce nan for the log likelihood (I’m guessing this is a divide-by-zero error somewhere). If you hit Ctrl-C
once at this point, you can stop the training and continue.
from nltk.corpus import conll2000 from nltk.classify import MaxentClassifier train_sents = conll2000.chunked_sents('train.txt') # featx is one of the feature extractors defined above chunker = ClassifierChunker(train_sents, featx, MaxentClassifier, min_lldelta=0.01, max_iter=10)
Accuracy
I ran the above training code for each feature extractor defined above, and generated the charts below. ub
still refers to the TagChunker
, which is included to provide a comparison baseline. All the other labels on the X-Axis refer to a classifier trained with one of the above feature extraction functions, using the first letter of each part of the name (p
refers to pos()
, pnpw
refers to prev_next_pos_word()
, etc).
One of the most interesting results of this test is how including the word in the featureset affects the accuracy. The only time including the word improves the accuracy is if the previous part-of-speech tag is also included in the featureset. Otherwise, including the word decreases accuracy. And looking ahead with next_pos()
and next_pos_word()
produces the worst results of all, until the previous part-of-speech tag is included. So whatever else you have in a featureset, the most important features are the current & previous pos tags, which, not surprisingly, is exactly what the TagChunker
trains on.
Custom Training Data
Not only can the ClassifierChunker
be significantly more accurate than the TagChunker
, it is also superior for custom training data. For my own custom chunk corpus, I was unable to get above 94% accuracy with the TagChunker
. That may seem pretty good, but it means the chunker is unable to parse over 1000 known chunks! However, after training the ClassifierChunker
with the prev_next_pos_word
feature extractor, I was able to get 100% parsing accuracy on my own chunk corpus. This is a huge win, and means that the behavior of the ClassifierChunker
is much more controllable thru manualation.
Chunk Extraction with NLTK
Chunk extraction is a useful preliminary step to information extraction, that creates parse trees from unstructured text with a chunker. Once you have a parse tree of a sentence, you can do more specific information extraction, such as named entity recognition and relation extraction.
Chunking is basically a 3 step process:
- Tag a sentence
- Chunk the tagged sentence
- Analyze the parse tree to extract information
I’ve already written about how to train a NLTK part of speech tagger and a chunker, so I’ll assume you’ve already done the training, and now you want to use your pos tagger and iob chunker to do something useful.
IOB Tag Chunker
The previously trained chunker is actually a chunk tagger. It’s a Tagger that assigns IOB chunk tags to part-of-speech tags. In order to use it for proper chunking, we need some extra code to convert the IOB chunk tags into a parse tree. I’ve created a wrapper class that complies with the nltk ChunkParserI interface and uses the trained chunk tagger to get IOB tags and convert them to a proper parse tree.
import nltk.chunk import itertools class TagChunker(nltk.chunk.ChunkParserI): def __init__(self, chunk_tagger): self._chunk_tagger = chunk_tagger def parse(self, tokens): # split words and part of speech tags (words, tags) = zip(*tokens) # get IOB chunk tags chunks = self._chunk_tagger.tag(tags) # join words with chunk tags wtc = itertools.izip(words, chunks) # w = word, t = part-of-speech tag, c = chunk tag lines = [' '.join([w, t, c]) for (w, (t, c)) in wtc if c] # create tree from conll formatted chunk lines return nltk.chunk.conllstr2tree('\n'.join(lines))
Chunk Extraction
Now that we have a proper NLTK chunker, we can use it to extract chunks. Here’s a simple example that tags a sentence, chunks the tagged sentence, then prints out each noun phrase.
# sentence should be a list of words tagged = tagger.tag(sentence) tree = chunker.parse(tagged) # for each noun phrase sub tree in the parse tree for subtree in tree.subtrees(filter=lambda t: t.node == 'NP'): # print the noun phrase as a list of part-of-speech tagged words print subtree.leaves()
Each sub tree has a phrase tag, and the leaves of a sub tree are the tagged words that make up that chunk. Since we’re training the chunker on IOB tags, NP stands for Noun Phrase. As noted before, the results of this natural language processing are heavily dependent on the training data. If your input text isn’t similar to the your training data, then you probably won’t be getting many chunks.
How to Train a NLTK Chunker
In NLTK, chunking is the process of extracting short, well-formed phrases, or chunks, from a sentence. This is also known as partial parsing, since a chunker is not required to capture all the words in a sentence, and does not produce a deep parse tree. But this is a good thing because it’s very hard to create a complete parse grammar for natural language, and full parsing is usually all or nothing. So chunking allows you to get at the bits you want and ignore the rest.
Training a Chunker
The general approach to chunking and parsing is to define rules or expressions that are then matched against the input sentence. But this is a very manual, tedious, and error-prone process, likely to get very complicated real fast. The alternative approach is to train a chunker the same way you train a part-of-speech tagger. Except in this case, instead of training on (word, tag) sequences, we train on (tag, iob) sequences, where iob is a chunk tag defined in the the conll2000 corpus. Here’s a function that will take a list of chunked sentences (from a chunked corpus like conll2000 or treebank), and return a list of (tag, iob) sequences.
import nltk.chunk def conll_tag_chunks(chunk_sents): tag_sents = [nltk.chunk.tree2conlltags(tree) for tree in chunk_sents] return [[(t, c) for (w, t, c) in chunk_tags] for chunk_tags in tag_sents]
Chunker Accuracy
So how accurate is the trained chunker? Here’s the rest of the code, followed by a chart of the accuracy results. Note that I’m only using Ngram Taggers. You could additionally use the BrillTagger, but the training takes a ridiculously long time for very minimal gains in accuracy.
import nltk.corpus, nltk.tag def ubt_conll_chunk_accuracy(train_sents, test_sents): train_chunks = conll_tag_chunks(train_sents) test_chunks = conll_tag_chunks(test_sents) u_chunker = nltk.tag.UnigramTagger(train_chunks) print 'u:', nltk.tag.accuracy(u_chunker, test_chunks) ub_chunker = nltk.tag.BigramTagger(train_chunks, backoff=u_chunker) print 'ub:', nltk.tag.accuracy(ub_chunker, test_chunks) ubt_chunker = nltk.tag.TrigramTagger(train_chunks, backoff=ub_chunker) print 'ubt:', nltk.tag.accuracy(ubt_chunker, test_chunks) ut_chunker = nltk.tag.TrigramTagger(train_chunks, backoff=u_chunker) print 'ut:', nltk.tag.accuracy(ut_chunker, test_chunks) utb_chunker = nltk.tag.BigramTagger(train_chunks, backoff=ut_chunker) print 'utb:', nltk.tag.accuracy(utb_chunker, test_chunks) # conll chunking accuracy test conll_train = nltk.corpus.conll2000.chunked_sents('train.txt') conll_test = nltk.corpus.conll2000.chunked_sents('test.txt') ubt_conll_chunk_accuracy(conll_train, conll_test) # treebank chunking accuracy test treebank_sents = nltk.corpus.treebank_chunk.chunked_sents() ubt_conll_chunk_accuracy(treebank_sents[:2000], treebank_sents[2000:])
The ub_chunker and utb_chunker are slight favorites with equal accuracy, so in practice I suggest using the ub_chunker since it takes slightly less time to train.
Conclusion
Training a chunker this way is much easier than creating manual chunk expressions or rules, it can approach 100% accuracy, and the process is re-usable across data sets. As with part-of-speech tagging, the training set really matters, and should be as similar as possible to the actual text that you want to tag and chunk.