StreamHacker Weotta be Hacking

15Nov/102

The Beginning of Python Text Processing with NLTK Cookbook

It all started with an email to the baypiggies mailing list. An acquisition editor for Packt was looking for authors to expand their line of python cookbooks. For some reason I can't remember, I thought they wanted to put together a multi-author cookbook, where each author contributes a few recipes. That sounded doable, because I'd already written a number of articles that could serve as the basis for a few recipes. So I replied with links to the following articles:

The reply back was:

The next step is to come up with around 8-14 topics/chapters and around 80-100 recipes for the book as a whole.

My first reaction was "WTF?? No way!" But luckily, I didn't send that email. Instead, I took a couple days to think it over, and realized that maybe I could come up with that many recipes, if I broke my knowledge down into small pieces. I also decided to choose recipes that I didn't already know how to write, and use them as motivation for learning & research. So I replied back with a list of 92 recipes, and got to work. Not surprisingly, the original list of 92 changed significantly while writing the book, and I believe the final recipe count is 81.

I was keenly aware that there'd be some necessary overlap with the original NLTK book, Natural Language Processing with Python. But I did my best to minimize that overlap, and to present a different take on similar content. And there's a number of recipes that (as far as I know) you can't find anywhere else, the largest group of which can be found in Chapter 6, Transforming Chunks and Trees. I'm very pleased with the result, and I hope everyone who buys the book is too. I'd like to think that Python Text Processing with NLTK 2.0 Cookbook is the practical companion to the more teaching oriented Natural Language Processing with Python.

If you'd like a taste of the book, checkout the online sample chapter (pdf) Chapter 3, Custom Corpora, which details how many of the included corpus readers work, how to use them, and how to create your own corpus readers. The last recipe shows you how to create a corpus reader on top of MongoDB, and it should be fairly easy to modify for use with any other database.

Packt has also published two excerpts from Chapter 8, Distributed Processing and Handling Large Datasets, which are partially based on those original 2 articles:

14Dec/097

Execnet vs Disco for Distributed NLTK

There's a number of options for distributed processing and mapreduce in python. Before execnet surfaced, I'd been using Disco to do distributed NLTK. Now that I've happily switched to distributed NLTK with execnet, I can explain some of the differences and why execnet is so much better for my purposes.

Disco Overhead

Disco is a mapreduce framework for python, with an erlang core. This is very cool, but unfortunately introduces overhead costs when your functions are not pure (meaning they require external code and/or data). And part of speech tagging with NLTK is definitely not pure; the map function requires a part of speech tagger in order to do anything. So to use a part of speech tagger within a Disco map function, it must be loaded inline, which means unpickling the object before doing any work. And since a pickled part of speech tagger can easily exceed 500K, unpickling it can take over 2 seconds. When every map call has a fixed overhead of 2 seconds, your mapreduce task can take orders of magnitude longer to complete.

As an example, let's say you need to do 6000 map calls, at 1 second of pure computation each. That's 100 minutes, not counting overhead. Now add in the 2s fixed overhead on each call, and you're at 300 minutes. What should be just over 1.6 hours of computation has jumped to 5 hours.

Execnet FTW

execnet provides a very different computational model: start some gateways and communicate thru message channels. In my case, all the fixed overhead can be done up-front, loading the part of speech tagger once per gateway, resulting in greatly reduced compute times. I did have to change my old Disco based code to work with execnet, but I actually ended up with less code that's easier to understand.

Conclusion

If you're just doing pure mapreduce computations, then consider using Disco. After the one time setup (which can be non-trivial), writing the functions will be relatively easy, and you'll get a nice web UI for configuration and monitoring. But if you're doing any dirty operations that need expensive initialization procedures, or can't quite fit what you need into a pure mapreduce framework, then execnet is for you.

29Nov/094

Distributed NLTK with execnet

(This page has been translated into Spanish by Maria Ramos, and has also been translated into Belorussian)

Want to speed up your natural language processing with NLTK? Have a lot of files to process, but don't know how to distribute NLTK across many cores?

Well, here's how you can use execnet to do distributed part of speech tagging with NLTK.

execnet

execnet is a simple library for creating a network of gateways and channels that you can use for distributed computation in python. With it, you can start python shells over ssh, send code and/or data, then receive results. Below are 2 scripts that will test the accuracy of NLTK's recommended part of speech tagger against every file in the brown corpus. The first script (the runner) does all the setup and receives the results, while the second script (the remote module) runs on every gateway, calculating and sending the accuracy of each file it receives for processing.

Runner

The runner does the following:

  1. Defines the hosts and number of gateways. I recommend 1 gateway per core per host.
  2. Loads and pickles the default NLTK part of speech tagger.
  3. Opens each gateway and creates a remote execution channel with the tag_files module (the remote module covered below).
  4. Sends the pickled tagger and the name of a corpus (brown) thru the channel.
  5. Once all the channels have been created and initialized, it then sends all of the fileids in the corpus to alternating channels to distribute the work.
  6. Finally, it creates a receive queue and prints the accuracy response from each channel.

run_tag_files.py

import execnet
import nltk.tag, nltk.data
import cPickle as pickle
import tag_files

HOSTS = {
	'localhost': 2
}

NICE = 20

channels = []

tagger = pickle.dumps(nltk.data.load(nltk.tag._POS_TAGGER))

for host, count in HOSTS.items():
	print 'opening %d gateways at %s' % (count, host)

	for i in range(count):
		gw = execnet.makegateway('ssh=%s//nice=%d' % (host, NICE))
		channel = gw.remote_exec(tag_files)
		channels.append(channel)
		channel.send(tagger)
		channel.send('brown')

count = 0
chan = 0

for fileid in nltk.corpus.brown.fileids():
	print 'sending %s to channel %d' % (fileid, chan)
	channels[chan].send(fileid)
	count += 1
	# alternate channels
	chan += 1
	if chan >= len(channels): chan = 0

multi = execnet.MultiChannel(channels)
queue = multi.make_receive_queue()

for i in range(count):
	channel, response = queue.get()
	print response

Remote Module

The remote module is much simpler.

  1. Receives and unpickles the tagger.
  2. Receives the corpus name and loads it.
  3. For each fileid received, evaluates the accuracy of the tagger on the tagged sentences and sends an accuracy response.

tag_files.py

import nltk.corpus
import cPickle as pickle

if __name__ == '__channelexec__':
	tagger = pickle.loads(channel.receive())
	corpus_name = channel.receive()
	corpus = getattr(nltk.corpus, corpus_name)

	for fileid in channel:
		accuracy = tagger.evaluate(corpus.tagged_sents(fileids=[fileid]))
		channel.send('%s: %f' % (fileid, accuracy))

Putting it all together

Make sure you have NLTK and the corpus data installed on every host. You must also have passwordless ssh access to each host from the master host (the machine you run run_tag_files.py on).

run_tag_files.py and tag_files.py only need to be on the master host; execnet will take care of distributing the code. Assuming run_tag_files.py and tag_files.py are in the same directory, all you need to do is run python run_tag_files.py. You should get a message about opening gateways followed by a bunch of send messages. Then, just wait and watch the accuracy responses to see how accurate the built in part of speech tagger is on the brown corpus.

If you'd like test the accuracy of a different corpus, make sure every host has the corpus data, then send that corpus name instead of brown, and send the fileids from the new corpus.

If you want to test your own tagger, pickle it to a file, then load and send it instead of NLTK's tagger. Or you can train it on the master first, then send it once training is complete.

Distributed File Processing

In practice, it's often a PITA to make sure every host has every file you want to process, and you'll want to process files outside of NLTK's builtin corpora. My recommendation is to setup a GlusterFS storage cluster so that every host has a common mount point with access to every file that you want to process. If every host has the same mount point, you can send any file path to any channel for processing.

   
%d bloggers like this: