StreamHacker Weotta be Hacking

14Dec/097

Execnet vs Disco for Distributed NLTK

There's a number of options for distributed processing and mapreduce in python. Before execnet surfaced, I'd been using Disco to do distributed NLTK. Now that I've happily switched to distributed NLTK with execnet, I can explain some of the differences and why execnet is so much better for my purposes.

Disco Overhead

Disco is a mapreduce framework for python, with an erlang core. This is very cool, but unfortunately introduces overhead costs when your functions are not pure (meaning they require external code and/or data). And part of speech tagging with NLTK is definitely not pure; the map function requires a part of speech tagger in order to do anything. So to use a part of speech tagger within a Disco map function, it must be loaded inline, which means unpickling the object before doing any work. And since a pickled part of speech tagger can easily exceed 500K, unpickling it can take over 2 seconds. When every map call has a fixed overhead of 2 seconds, your mapreduce task can take orders of magnitude longer to complete.

As an example, let's say you need to do 6000 map calls, at 1 second of pure computation each. That's 100 minutes, not counting overhead. Now add in the 2s fixed overhead on each call, and you're at 300 minutes. What should be just over 1.6 hours of computation has jumped to 5 hours.

Execnet FTW

execnet provides a very different computational model: start some gateways and communicate thru message channels. In my case, all the fixed overhead can be done up-front, loading the part of speech tagger once per gateway, resulting in greatly reduced compute times. I did have to change my old Disco based code to work with execnet, but I actually ended up with less code that's easier to understand.

Conclusion

If you're just doing pure mapreduce computations, then consider using Disco. After the one time setup (which can be non-trivial), writing the functions will be relatively easy, and you'll get a nice web UI for configuration and monitoring. But if you're doing any dirty operations that need expensive initialization procedures, or can't quite fit what you need into a pure mapreduce framework, then execnet is for you.

1Sep/090

Cloud Computing Links

Amazon Web Services:
Python Libraries:
GlusterFS:
   
%d bloggers like this: