Announcing Text Processing APIs
If you liked the NLTK demos, then you'll love the text processing APIs. They provide all the functionality of the demos, plus a little bit more, and return results in JSON. Requests can contain up to 10,000 characters, instead of the 1,000 character limit on the demos, and you can do up to 100 calls per day. These limits may change in the future depending on usage & demand. If you'd like to do more, please fill out this survey to let me know what your needs are.
Announcing Python NLTK Demos
If you want to see what NLTK can do, but don't want to go thru the effort of installation and learning how to use it, then check out my Python NLTK demos.
It currently demonstrates the following functionality:
- part-of-speech tagging with the default NLTK pos tagger
- chunking and named entity recognition with the default NLTK chunker
- sentiment analysis with a combination of a naive bayes classifier and a maximum entropy classifier, both trained on the movie reviews corpus
If you like it, please share it. If you want to see more, leave a comment below. And if you are interested in a service that could apply these processes to your own data, please fill out this NLTK services survey.
Other Natural Language Processing Demos
Here's a list of similar resources on the web:
- A demo of the Stanford Parser with a javascript API: Natural-language Parsing For The Web
- A demo of the FreeLing language analysis suite: FreeLing Demo
- Emotional identification from text: EmoLib
Text Classification for Sentiment Analysis – Stopwords and Collocations
Improving feature extraction can often have a significant positive impact on classifier accuracy (and precision and recall). In this article, I'll be evaluating two modifications of the word_feats feature extraction method:
- filter out stopwords
- include bigram collocations
To do this effectively, we'll modify the previous code so that we can use an arbitrary feature extractor function that takes the words in a file and returns the feature dictionary. As before, we'll use these features to train a Naive Bayes Classifier.
import collections
import nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
def evaluate_classifier(featx):
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')
negfeats = [(featx(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(featx(movie_reviews.words(fileids=[f])), 'pos') for f in posids]
negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4
trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
classifier = NaiveBayesClassifier.train(trainfeats)
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
for i, (feats, label) in enumerate(testfeats):
refsets[label].add(i)
observed = classifier.classify(feats)
testsets[observed].add(i)
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])
print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg'])
print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg'])
classifier.show_most_informative_features()
Baseline Bag of Words Feature Extraction
Here's the baseline feature extractor for bag of words feature selection.
def word_feats(words): return dict([(word, True) for word in words]) evaluate_classifier(word_feats)
The results are the same as in the previous articles, but I've included them here for reference:
accuracy: 0.728
pos precision: 0.651595744681
pos recall: 0.98
neg precision: 0.959677419355
neg recall: 0.476
Most Informative Features
magnificent = True pos : neg = 15.0 : 1.0
outstanding = True pos : neg = 13.6 : 1.0
insulting = True neg : pos = 13.0 : 1.0
vulnerable = True pos : neg = 12.3 : 1.0
ludicrous = True neg : pos = 11.8 : 1.0
avoids = True pos : neg = 11.7 : 1.0
uninvolving = True neg : pos = 11.7 : 1.0
astounding = True pos : neg = 10.3 : 1.0
fascination = True pos : neg = 10.3 : 1.0
idiotic = True neg : pos = 9.8 : 1.0Stopword Filtering
Stopwords are words that are generally considered useless. Most search engines ignore these words because they are so common that including them would greatly increase the size of the index without improving precision or recall. NLTK comes with a stopwords corpus that includes a list of 128 english stopwords. Let's see what happens when we filter out these words.
from nltk.corpus import stopwords
stopset = set(stopwords.words('english'))
def stopword_filtered_word_feats(words):
return dict([(word, True) for word in words if word not in stopset])
evaluate_classifier(stopword_filtered_word_feats)
And the results for a stopword filtered bag of words are:
accuracy: 0.726 pos precision: 0.649867374005 pos recall: 0.98 neg precision: 0.959349593496 neg recall: 0.472
Accuracy went down .2%, and pos precision and neg recall dropped as well! Apparently stopwords add information to sentiment analysis classification. I did not include the most informative features since they did not change.
Bigram Collocations
As mentioned at the end of the article on precision and recall, it's possible that including bigrams will improve classification accuracy. The hypothesis is that people say things like "not great", which is a negative expression that the bag of words model could interpret as positive since it sees "great" as a separate word.
To find significant bigrams, we can use nltk.collocations.BigramCollocationFinder along with nltk.metrics.BigramAssocMeasures. The BigramCollocationFinder maintains 2 internal FreqDists, one for individual word frequencies, another for bigram frequencies. Once it has these frequency distributions, it can score individual bigrams using a scoring function provided by BigramAssocMeasures, such chi-square. These scoring functions measure the collocation correlation of 2 words, basically whether the bigram occurs about as frequently as each individual word.
import itertools from nltk.collocations import BigramCollocationFinder from nltk.metrics import BigramAssocMeasures def bigram_word_feats(words, score_fn=BigramAssocMeasures.chi_sq, n=200): bigram_finder = BigramCollocationFinder.from_words(words) bigrams = bigram_finder.nbest(score_fn, n) return dict([(ngram, True) for ngram in itertools.chain(words, bigrams)]) evaluate_classifier(bigram_word_feats)
After some experimentation, I found that using the 200 best bigrams from each file produced great results:
accuracy: 0.816
pos precision: 0.753205128205
pos recall: 0.94
neg precision: 0.920212765957
neg recall: 0.692
Most Informative Features
magnificent = True pos : neg = 15.0 : 1.0
outstanding = True pos : neg = 13.6 : 1.0
insulting = True neg : pos = 13.0 : 1.0
vulnerable = True pos : neg = 12.3 : 1.0
('matt', 'damon') = True pos : neg = 12.3 : 1.0
('give', 'us') = True neg : pos = 12.3 : 1.0
ludicrous = True neg : pos = 11.8 : 1.0
uninvolving = True neg : pos = 11.7 : 1.0
avoids = True pos : neg = 11.7 : 1.0
('absolutely', 'no') = True neg : pos = 10.6 : 1.0Yes, you read that right, Matt Damon is apparently one of the best predictors for positive sentiment in movie reviews. But despite this chuckle-worthy result
- accuracy is up almost 9%
posprecision has increased over 10% with only 4% drop in recallnegrecall has increased over 21% with just under 4% drop in precision
So it appears that the bigram hypothesis is correct, and including significant bigrams can increase classifier effectiveness. Note that it's significant bigrams that enhance effectiveness. I tried using nltk.util.bigrams to include all bigrams, and the results were only a few points above baseline. This points to the idea that including only significant features can improve accuracy compared to using all features. In a future article, I'll try trimming down the single word features to only include significant words.
Text Classification for Sentiment Analysis – Naive Bayes Classifier
Sentiment analysis is becoming a popular area of research and social media analysis, especially around user reviews and tweets. It is a special case of text mining generally focused on identifying opinion polarity, and while it's often not very accurate, it can still be useful. For simplicity (and because the training data is easily accessible) I'll focus on 2 possible sentiment classifications: positive and negative.
NLTK Naive Bayes Classification
NLTK comes with all the pieces you need to get started on sentiment analysis: a movie reviews corpus with reviews categorized into pos and neg categories, and a number of trainable classifiers. We'll start with a simple NaiveBayesClassifier as a baseline, using boolean word feature extraction.
Bag of Words Feature Extraction
All of the NLTK classifiers work with featstructs, which can be simple dictionaries mapping a feature name to a feature value. For text, we'll use a simplified bag of words model where every word is feature name with a value of True. Here's the feature extraction method:
def word_feats(words): return dict([(word, True) for word in words])
Training Set vs Test Set and Accuracy
The movie reviews corpus has 1000 positive files and 1000 negative files. We'll use 3/4 of them as the training set, and the rest as the test set. This gives us 1500 training instances and 500 test instances. The classifier training method expects to be given a list of tokens in the form of [(feats, label)] where feats is a feature dictionary and label is the classification label. In our case, feats will be of the form {word: True} and label will be one of 'pos' or 'neg'. For accuracy evaluation, we can use nltk.classify.util.accuracy with the test set as the gold standard.
Training and Testing the Naive Bayes Classifier
Here's the complete python code for training and testing a Naive Bayes Classifier on the movie review corpus.
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
def word_feats(words):
return dict([(word, True) for word in words])
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')
negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]
negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4
trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats))
classifier = NaiveBayesClassifier.train(trainfeats)
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
classifier.show_most_informative_features()
And the output is:
train on 1500 instances, test on 500 instances
accuracy: 0.728
Most Informative Features
magnificent = True pos : neg = 15.0 : 1.0
outstanding = True pos : neg = 13.6 : 1.0
insulting = True neg : pos = 13.0 : 1.0
vulnerable = True pos : neg = 12.3 : 1.0
ludicrous = True neg : pos = 11.8 : 1.0
avoids = True pos : neg = 11.7 : 1.0
uninvolving = True neg : pos = 11.7 : 1.0
astounding = True pos : neg = 10.3 : 1.0
fascination = True pos : neg = 10.3 : 1.0
idiotic = True neg : pos = 9.8 : 1.0As you can see, the 10 most informative features are, for the most part, highly descriptive adjectives. The only 2 words that seem a bit odd are "vulnerable" and "avoids". Perhaps these words refer to important plot points or character development that signify a good movie. Whatever the case, with simple assumptions and very little code we're able to get almost 73% accuracy. This is somewhat near human accuracy, as apparently people agree on sentiment only around 80% of the time. Future articles in this series will cover precision & recall metrics, alternative classifiers, and techniques for improving accuracy.
Linguistic and Natural Language Processing Links
A number of links related to natural language processing and linguistics:
- What’s the Difference Between Stemming and Lemmatization? - Ask Dr. Search
- A List of Social Tagging Datasets Made Available for Research
- Social Signaling and Language Use
- Lexical Growth in the Blogosphere
- Spelling correction using the Python Natural Language Toolkit (nltk)
- OPUS - an open source parallel corpus
- Evaluating POS Taggers: The Contenders
- Text Analytics Wiki
Part of Speech Tagging with NLTK Part 4 – Brill Tagger vs Classifier Taggers
In previous installments on part-of-speech tagging, we saw that a Brill Tagger provides significant accuracy improvements over the Ngram Taggers combined with Regex and Affix Tagging.
With the latest 2.0 beta releases (2.0b8 as of this writing), NLTK has included a ClassifierBasedTagger as well as a pre-trained tagger used by the nltk.tag.pos_tag method. Based on the name, then pre-trained tagger appears to be a ClassifierBasedTagger trained on the treebank corpus using a MaxentClassifier. So let's see how a classifier tagger compares to the brill tagger.
Training Sets
For the brown corpus, I trained on 2/3 of the reviews, lore, and romance categories, and tested against the remaining 1/3. For conll2000, I used the standard train.txt vs test.txt. And for treebank, I again used a 2/3 vs 1/3 split.
import itertools
from nltk.corpus import brown, conll2000, treebank
brown_reviews = brown.tagged_sents(categories=['reviews'])
brown_reviews_cutoff = len(brown_reviews) * 2 / 3
brown_lore = brown.tagged_sents(categories=['lore'])
brown_lore_cutoff = len(brown_lore) * 2 / 3
brown_romance = brown.tagged_sents(categories=['romance'])
brown_romance_cutoff = len(brown_romance) * 2 / 3
brown_train = list(itertools.chain(brown_reviews[:brown_reviews_cutoff],
brown_lore[:brown_lore_cutoff], brown_romance[:brown_romance_cutoff]))
brown_test = list(itertools.chain(brown_reviews[brown_reviews_cutoff:],
brown_lore[brown_lore_cutoff:], brown_romance[brown_romance_cutoff:]))
conll_train = conll2000.tagged_sents('train.txt')
conll_test = conll2000.tagged_sents('test.txt')
treebank_cutoff = len(treebank.tagged_sents()) * 2 / 3
treebank_train = treebank.tagged_sents()[:treebank_cutoff]
treebank_test = treebank.tagged_sents()[treebank_cutoff:]
Classifier Taggers
There are 3 new taggers referenced below:
cposis an instance of ClassifierBasedPOSTagger using the default NaiveBayesClassifier. It was trained by doingClassifierBasedPOSTagger(train=train_sents)craubtis likecpos, but has theraubttagger from part 2 as a backoff tagger by doingClassifierBasedPOSTagger(train=train_sents,backoff=raubt)bcposis a BrillTagger usingcposas its initial tagger instead ofraubt.
The raubt tagger is the same as from part 2, and braubt is from part 3.
postag is NLTK's pre-trained tagger used by the pos_tag function. It can be loaded using nltk.data.load(nltk.tag._POS_TAGGER).
Accuracy Evaluation
Tagger accuracy was determined by calling the evaluate method with the test set on each trained tagger. Here are the results:
Conclusions
The above results are quite interesting, and lead to a few conclusions:
- Training data is hugely significant when it comes to accuracy. This is why
postagtakes a huge nose dive onbrown, while at the same time can get near 100% accuracy ontreebank. - A ClassifierBasedPOSTagger does not need a backoff tagger, since
cposaccuracy is exactly the same as forcraubtacross all corpora. - The ClassifierBasedPOSTagger is not necessarily more accurate than the
bcraubttagger from part 3 (at least with the default feature detector). It also takes much longer to train and tag (more details below) and so may not be worth the tradeoff in efficiency. - Using BrillTagger will nearly always increase the accuracy of your initial tagger, but not by much.
I was also surprised at how much more accurate postag was compared to cpos. Thinking that postag was probably trained on the full treebank corpus, I did the same, and re-evaluated:
cpos = ClassifierBasedPOSTagger(train=treebank.tagged_sents()) cpos.evaluate(treebank_test)
The result was 98.08% accuracy. So the remaining 2% difference must be due to the MaxentClassifier being more accurate than NaiveBayesClassifier, and/or the use of a different feature detector. I tried again with classifier_builder=MaxentClassifier.train and only got to 98.4% accuracy. So I can only conclude that a different feature detector is used. Hopefully the NLTK leaders will publish the training method so we can all know for sure.
Efficiency
On the nltk-users list, there was a question about which tagger is the most computationaly economic. I can't tell you the right answer, but I can definitely say that ClassifierBasedPOSTagger is the wrong answer. During accuracy evaluation, I noticed that the cpos tagger took a lot longer than raubt or braubt. So I ran timeit on the tag method of each tagger, and got the following results:
| Tagger | secs/pass |
|---|---|
| raubt | 0.00005 |
| braubt | 0.00009 |
| cpos | 0.02219 |
| bcpos | 0.02259 |
| postag | 0.01241 |
This was run with python 2.6.4 on an Athlon 64 Dual Core 4600+ with 3G RAM, but the important thing is the relative times. braubt is over 246 times faster than cpos! To put it another way, braubt can process over 66666 words/sec, where cpos can only do 270 words/sec and postag only 483 words/sec. So the lesson is: do not use a classifier based tagger if speed is an issue.
Here's the code for timing postag. You can do the same thing for any other pickled tagger by replacing nltk.tag._POS_TAGGER with a nltk.data accessible path with a .pickle suffix for the load method.
import nltk, timeit
text = nltk.word_tokenize('And now for something completely different')
setup = 'import nltk.data, nltk.tag; tagger = nltk.data.load(nltk.tag._POS_TAGGER)'
t = timeit.Timer('tagger.tag(%s)' % text, setup)
print 'timing postag 1000 times'
spent = t.timeit(number=1000)
print 'took %.5f secs/pass' % (spent / 1000)
File Size
There's also a significant difference in the file size of the pickled taggers (trained on treebank):
| Tagger | Size |
|---|---|
| raubt | 272K |
| braubt | 273K |
| cpos | 3.8M |
| bcpos | 3.8M |
| postag | 8.2M |
Fin
I think there's a lot of room for experimentation with classifier based taggers and their feature detectors. But if speed is an issue for you, don't even bother. In that case, stick with a simpler tagger that's nearly as accurate and orders of magnitude faster.
NLTK Classifier Based Chunker Accuracy
The NLTK Book has been updated with an explanation of how to train a classifier based chunker, and I wanted to compare it's accuracy versus my previous tagger based chunker.
Tag Chunker
I already covered how to train a tagger based chunker, with the the discovery that a Unigram-Bigram TagChunker is the narrow favorite. I'll use this Unigram-Bigram Chunker as a baseline for comparison below.
Classifier Chunker
A Classifier based Chunker uses a classifier such as the MaxentClassifier to determine which IOB chunk tags to use. It's very similar to the TagChunker in that the Chunker class is really a wrapper around a Classifier based part-of-speech tagger. And both are trainable alternatives to a regular expression parser. So first we need to create a ClassifierTagger, and then we can wrap it with a ClassifierChunker.
Classifier Tagger
The ClassifierTagger below is an abstracted version of what's described in the Information Extraction chapter of the NLTK Book. It should theoretically work with any feature extractor and classifier class when created with the train classmethod. The kwargs are passed to the classifier constructor.
from nltk.tag import TaggerI, untag class ClassifierTagger(TaggerI): '''Abstracted from "Training Classifier-Based Chunkers" section of http://nltk.googlecode.com/svn/trunk/doc/book/ch07.html ''' def __init__(self, feature_extractor, classifier): self.feature_extractor = feature_extractor self.classifier = classifier def tag(self, sent): history = [] for i, word in enumerate(sent): featureset = self.feature_extractor(sent, i, history) tag = self.classifier.classify(featureset) history.append(tag) return zip(sent, history) @classmethod def train(cls, train_sents, feature_extractor, classifier_cls, **kwargs): train_set = [] for tagged_sent in train_sents: untagged_sent = untag(tagged_sent) history = [] for i, (word, tag) in enumerate(tagged_sent): featureset = feature_extractor(untagged_sent, i, history) train_set.append((featureset, tag)) history.append(tag) classifier = classifier_cls.train(train_set, **kwargs) return cls(feature_extractor, classifier)
Classifier Chunker
The ClassifierChunker is a thin wrapper around the ClassifierTagger that converts between tagged tuples and parse trees. args and kwargs in __init__ are passed in to ClassifierTagger.train().
from nltk.chunk import ChunkParserI, tree2conlltags, conlltags2tree class ClassifierChunker(nltk.chunk.ChunkParserI): def __init__(self, train_sents, *args, **kwargs): tag_sents = [tree2conlltags(sent) for sent in train_sents] train_chunks = [[((w,t),c) for (w,t,c) in sent] for sent in tag_sents] self.tagger = ClassifierTagger.train(train_chunks, *args, **kwargs) def parse(self, tagged_sent): if not tagged_sent: return None chunks = self.tagger.tag(tagged_sent) return conlltags2tree([(w,t,c) for ((w,t),c) in chunks])
Feature Extractors
Classifiers work on featuresets, which are created with feature extraction functions. Below are the feature extractors I evaluated, partly copied from the NLTK Book.
def pos(sent, i, history):
word, pos = sent[i]
return {'pos': pos}
def pos_word(sent, i, history):
word, pos = sent[i]
return {'pos': pos, 'word': word}
def prev_pos(sent, i, history):
word, pos = sent[i]
if i == 0:
prevword, prevpos = '<START>', '<START>'
else:
prevword, prevpos = sent[i-1]
return {'pos': pos, 'prevpos': prevpos}
def prev_pos_word(sent, i, history):
word, pos = sent[i]
if i == 0:
prevword, prevpos = '<START>', '<START>'
else:
prevword, prevpos = sent[i-1]
return {'pos': pos, 'prevpos': prevpos, 'word': word}
def next_pos(sent, i, history):
word, pos = sent[i]
if i == len(sent) - 1:
nextword, nextpos = '<END>', '<END>'
else:
nextword, nextpos = sent[i+1]
return {'pos': pos, 'nextpos': nextpos}
def next_pos_word(sent, i, history):
word, pos = sent[i]
if i == len(sent) - 1:
nextword, nextpos = '<END>', '<END>'
else:
nextword, nextpos = sent[i+1]
return {'pos': pos, 'nextpos': nextpos, 'word': word}
def prev_next_pos(sent, i, history):
word, pos = sent[i]
if i == 0:
prevword, prevpos = '<START>', '<START>'
else:
prevword, prevpos = sent[i-1]
if i == len(sent) - 1:
nextword, nextpos = '<END>', '<END>'
else:
nextword, nextpos = sent[i+1]
return {'pos': pos, 'nextpos': nextpos, 'prevpos': prevpos}
def prev_next_pos_word(sent, i, history):
word, pos = sent[i]
if i == 0:
prevword, prevpos = '<START>', '<START>'
else:
prevword, prevpos = sent[i-1]
if i == len(sent) - 1:
nextword, nextpos = '<END>', '<END>'
else:
nextword, nextpos = sent[i+1]
return {'pos': pos, 'nextpos': nextpos, 'word': word, 'prevpos': prevpos}
Training
Now that we have all the pieces, we can put them together with training.
NOTE: training the classifier takes a long time. If you want to reduce the time, you can increase min_lldelta or decrease max_iter, but you risk reducing the accuracy. Also note that the MaxentClassifier will sometimes produce nan for the log likelihood (I'm guessing this is a divide-by-zero error somewhere). If you hit Ctrl-C once at this point, you can stop the training and continue.
from nltk.corpus import conll2000
from nltk.classify import MaxentClassifier
train_sents = conll2000.chunked_sents('train.txt')
# featx is one of the feature extractors defined above
chunker = ClassifierChunker(train_sents, featx, MaxentClassifier,
min_lldelta=0.01, max_iter=10)
Accuracy
I ran the above training code for each feature extractor defined above, and generated the charts below. ub still refers to the TagChunker, which is included to provide a comparison baseline. All the other labels on the X-Axis refer to a classifier trained with one of the above feature extraction functions, using the first letter of each part of the name (p refers to pos(), pnpw refers to prev_next_pos_word(), etc).
One of the most interesting results of this test is how including the word in the featureset affects the accuracy. The only time including the word improves the accuracy is if the previous part-of-speech tag is also included in the featureset. Otherwise, including the word decreases accuracy. And looking ahead with next_pos() and next_pos_word() produces the worst results of all, until the previous part-of-speech tag is included. So whatever else you have in a featureset, the most important features are the current & previous pos tags, which, not surprisingly, is exactly what the TagChunker trains on.
Custom Training Data
Not only can the ClassifierChunker be significantly more accurate than the TagChunker, it is also superior for custom training data. For my own custom chunk corpus, I was unable to get above 94% accuracy with the TagChunker. That may seem pretty good, but it means the chunker is unable to parse over 1000 known chunks! However, after training the ClassifierChunker with the prev_next_pos_word feature extractor, I was able to get 100% parsing accuracy on my own chunk corpus. This is a huge win, and means that the behavior of the ClassifierChunker is much more controllable thru manualation.
Distributed NLTK with execnet
Want to speed up your natural language processing with NLTK? Have a lot of files to process, but don't know how to distribute NLTK across many cores?
Well, here's how you can use execnet to do distributed part of speech tagging with NLTK.
execnet
execnet is a simple library for creating a network of gateways and channels that you can use for distributed computation in python. With it, you can start python shells over ssh, send code and/or data, then receive results. Below are 2 scripts that will test the accuracy of NLTK's recommended part of speech tagger against every file in the brown corpus. The first script (the runner) does all the setup and receives the results, while the second script (the remote module) runs on every gateway, calculating and sending the accuracy of each file it receives for processing.
Runner
The runner does the following:
- Defines the hosts and number of gateways. I recommend 1 gateway per core per host.
- Loads and pickles the default NLTK part of speech tagger.
- Opens each gateway and creates a remote execution channel with the
tag_filesmodule (the remote module covered below). - Sends the pickled tagger and the name of a corpus (
brown) thru the channel. - Once all the channels have been created and initialized, it then sends all of the fileids in the corpus to alternating channels to distribute the work.
- Finally, it creates a receive queue and prints the accuracy response from each channel.
run_tag_files.py
import execnet
import nltk.tag, nltk.data
import cPickle as pickle
import tag_files
HOSTS = {
'localhost': 2
}
NICE = 20
channels = []
tagger = pickle.dumps(nltk.data.load(nltk.tag._POS_TAGGER))
for host, count in HOSTS.items():
print 'opening %d gateways at %s' % (count, host)
for i in range(count):
gw = execnet.makegateway('ssh=%s//nice=%d' % (host, NICE))
channel = gw.remote_exec(tag_files)
channels.append(channel)
channel.send(tagger)
channel.send('brown')
count = 0
chan = 0
for fileid in nltk.corpus.brown.fileids():
print 'sending %s to channel %d' % (fileid, chan)
channels[chan].send(fileid)
count += 1
# alternate channels
chan += 1
if chan >= len(channels): chan = 0
multi = execnet.MultiChannel(channels)
queue = multi.make_receive_queue()
for i in range(count):
channel, response = queue.get()
print response
Remote Module
The remote module is much simpler.
- Receives and unpickles the tagger.
- Receives the corpus name and loads it.
- For each fileid received, evaluates the accuracy of the tagger on the tagged sentences and sends an accuracy response.
tag_files.py
import nltk.corpus
import cPickle as pickle
if __name__ == '__channelexec__':
tagger = pickle.loads(channel.receive())
corpus_name = channel.receive()
corpus = getattr(nltk.corpus, corpus_name)
for fileid in channel:
accuracy = tagger.evaluate(corpus.tagged_sents(fileids=[fileid]))
channel.send('%s: %f' % (fileid, accuracy))
Putting it all together
Make sure you have NLTK and the corpus data installed on every host. You must also have passwordless ssh access to each host from the master host (the machine you run run_tag_files.py on).
run_tag_files.py and tag_files.py only need to be on the master host; execnet will take care of distributing the code. Assuming run_tag_files.py and tag_files.py are in the same directory, all you need to do is run python run_tag_files.py. You should get a message about opening gateways followed by a bunch of send messages. Then, just wait and watch the accuracy responses to see how accurate the built in part of speech tagger is on the brown corpus.
If you'd like test the accuracy of a different corpus, make sure every host has the corpus data, then send that corpus name instead of brown, and send the fileids from the new corpus.
If you want to test your own tagger, pickle it to a file, then load and send it instead of NLTK's tagger. Or you can train it on the master first, then send it once training is complete.
Distributed File Processing
In practice, it's often a PITA to make sure every host has every file you want to process, and you'll want to process files outside of NLTK's builtin corpora. My recommendation is to setup a GlusterFS storage cluster so that every host has a common mount point with access to every file that you want to process. If every host has the same mount point, you can send any file path to any channel for processing.
Chunk Extraction with NLTK
Chunk extraction is a useful preliminary step to information extraction, that creates parse trees from unstructured text. Once you have a parse tree of a sentence, you can do more specific information extraction, such as named entity recognition and relation extraction.
Chunking is basically a 3 step process:
- Tag a sentence
- Chunk the tagged sentence
- Analyze the parse tree to extract information
I've already written about how to train a part of speech tagger and a chunker, so I'll assume you've already done the training, and now you want to use your tagger and chunker to do something useful.
Tag Chunker
The previously trained chunker is actually a chunk tagger. It's a Tagger that assigns IOB chunk tags to part-of-speech tags. In order to use it for proper chunking, we need some extra code to convert the IOB chunk tags into a parse tree. I've created a wrapper class that complies with the nltk ChunkParserI interface and uses the trained chunk tagger to get IOB tags and convert them to a proper parse tree.
import nltk.chunk
import itertools
class TagChunker(nltk.chunk.ChunkParserI):
def __init__(self, chunk_tagger):
self._chunk_tagger = chunk_tagger
def parse(self, tokens):
# split words and part of speech tags
(words, tags) = zip(*tokens)
# get IOB chunk tags
chunks = self._chunk_tagger.tag(tags)
# join words with chunk tags
wtc = itertools.izip(words, chunks)
# w = word, t = part-of-speech tag, c = chunk tag
lines = [' '.join([w, t, c]) for (w, (t, c)) in wtc if c]
# create tree from conll formatted chunk lines
return nltk.chunk.conllstr2tree('\n'.join(lines))
Chunk Extraction
Now that we have a proper chunker, we can use it to extract chunks. Here's a simple example that tags a sentence, chunks the tagged sentence, then prints out each noun phrase.
# sentence should be a list of words
tagged = tagger.tag(sentence)
tree = chunker.parse(tagged)
# for each noun phrase sub tree in the parse tree
for subtree in tree.subtrees(filter=lambda t: t.node == 'NP'):
# print the noun phrase as a list of part-of-speech tagged words
print subtree.leaves()
Each sub tree has a phrase tag, and the leaves of a sub tree are the tagged words that make up that chunk. Since we're training the chunker on IOB tags, NP stands for Noun Phrase. As noted before, the results of this natural language processing are heavily dependent on the training data. If your input text isn't similar to the your training data, then you probably won't be getting many chunks.
How to Train a NLTK Chunker
In NLTK, chunking is the process of extracting short, well-formed phrases, or chunks, from a sentence. This is also known as partial parsing, since a chunker is not required to capture all the words in a sentence, and does not produce a deep parse tree. But this is a good thing because it's very hard to create a complete parse grammar for natural language, and full parsing is usually all or nothing. So chunking allows you to get at the bits you want and ignore the rest.
Training
The general approach to chunking and parsing is to define rules or expressions that are then matched against the input sentence. But this is a very manual, tedious, and error-prone process, likely to get very complicated real fast. The alternative approach is to train a chunker the same way you train a part-of-speech tagger. Except in this case, instead of training on (word, tag) sequences, we train on (tag, iob) sequences, where iob is a chunk tag defined in the the conll2000 corpus. Here's a function that will take a list of chunked sentences (from a chunked corpus like conll2000 or treebank), and return a list of (tag, iob) sequences.
import nltk.chunk
def conll_tag_chunks(chunk_sents):
tag_sents = [nltk.chunk.tree2conlltags(tree) for tree in chunk_sents]
return [[(t, c) for (w, t, c) in chunk_tags] for chunk_tags in tag_sents]
Accuracy
So how accurate is the trained chunker? Here's the rest of the code, followed by a chart of the accuracy results. Note that I'm only using Ngram Taggers. You could additionally use the BrillTagger, but the training takes a ridiculously long time for very minimal gains in accuracy.
import nltk.corpus, nltk.tag
def ubt_conll_chunk_accuracy(train_sents, test_sents):
train_chunks = conll_tag_chunks(train_sents)
test_chunks = conll_tag_chunks(test_sents)
u_chunker = nltk.tag.UnigramTagger(train_chunks)
print 'u:', nltk.tag.accuracy(u_chunker, test_chunks)
ub_chunker = nltk.tag.BigramTagger(train_chunks, backoff=u_chunker)
print 'ub:', nltk.tag.accuracy(ub_chunker, test_chunks)
ubt_chunker = nltk.tag.TrigramTagger(train_chunks, backoff=ub_chunker)
print 'ubt:', nltk.tag.accuracy(ubt_chunker, test_chunks)
ut_chunker = nltk.tag.TrigramTagger(train_chunks, backoff=u_chunker)
print 'ut:', nltk.tag.accuracy(ut_chunker, test_chunks)
utb_chunker = nltk.tag.BigramTagger(train_chunks, backoff=ut_chunker)
print 'utb:', nltk.tag.accuracy(utb_chunker, test_chunks)
# conll chunking accuracy test
conll_train = nltk.corpus.conll2000.chunked_sents('train.txt')
conll_test = nltk.corpus.conll2000.chunked_sents('test.txt')
ubt_conll_chunk_accuracy(conll_train, conll_test)
# treebank chunking accuracy test
treebank_sents = nltk.corpus.treebank_chunk.chunked_sents()
ubt_conll_chunk_accuracy(treebank_sents[:2000], treebank_sents[2000:])
Accuracy for Trained Chunker
The ub_chunker and utb_chunker are slight favorites with equal accuracy, so in practice I suggest using the ub_chunker since it takes slightly less time to train.
Conclusion
Training a chunker this way is much easier than creating manual chunk expressions or rules, it can approach 100% accuracy, and the process is re-usable across data sets. As with part-of-speech tagging, the training set really matters, and should be as similar as possible to the actual text that you want to tag and chunk.




