Execnet vs Disco for Distributed NLTK

There’s a number of options for distributed processing and mapreduce in python. Before execnet surfaced, I’d been using Disco to do distributed NLTK. Now that I’ve happily switched to distributed NLTK with execnet, I can explain some of the differences and why execnet is so much better for my purposes.

Disco Overhead

Disco is a mapreduce framework for python, with an erlang core. This is very cool, but unfortunately introduces overhead costs when your functions are not pure (meaning they require external code and/or data). And part of speech tagging with NLTK is definitely not pure; the map function requires a part of speech tagger in order to do anything. So to use a part of speech tagger within a Disco map function, it must be loaded inline, which means unpickling the object before doing any work. And since a pickled part of speech tagger can easily exceed 500K, unpickling it can take over 2 seconds. When every map call has a fixed overhead of 2 seconds, your mapreduce task can take orders of magnitude longer to complete.

As an example, let’s say you need to do 6000 map calls, at 1 second of pure computation each. That’s 100 minutes, not counting overhead. Now add in the 2s fixed overhead on each call, and you’re at 300 minutes. What should be just over 1.6 hours of computation has jumped to 5 hours.

Execnet FTW

execnet provides a very different computational model: start some gateways and communicate thru message channels. In my case, all the fixed overhead can be done up-front, loading the part of speech tagger once per gateway, resulting in greatly reduced compute times. I did have to change my old Disco based code to work with execnet, but I actually ended up with less code that’s easier to understand.

Conclusion

If you’re just doing pure mapreduce computations, then consider using Disco. After the one time setup (which can be non-trivial), writing the functions will be relatively easy, and you’ll get a nice web UI for configuration and monitoring. But if you’re doing any dirty operations that need expensive initialization procedures, or can’t quite fit what you need into a pure mapreduce framework, then execnet is for you.

  • Justin
  • Justin
  • http://streamhacker.com/ Jacob Perkins

    That’s what I thought at first too, but it turns out the Params object (and objects attached to it) are unpickled before every map call. And when unpickling takes 2 seconds, it’s prohibitively expensive.

  • Jacob

    That’s what I thought at first too, but it turns out the Params object (and objects attached to it) are unpickled before every map call. And when unpickling takes 2 seconds, it’s prohibitively expensive.

  • Eric Gaumer

    Great blog for NTLK users. Lot’s of helpful insight. We do a lot of NLP work as a precursor to indexing content for search. It’s pretty typical for us to process tens of millions of documents so we’ve designed a highly scalable pipeline framework that allows you to build document processing clusters.

    In terms of the problem you’ve described here, we’re using stackless Python to model pipeline stages (referred to as Components). Since tasklets are true co-routines, they’re instantiated once and run forever making them ideal for loading data models and then using those models across huge volumes of data.

    Check it out, it might fit into your needs at some point.

    http://www.pypes.org/

    A simple example of writing a component.

    http://bitbucket.org/diji/pypes/wiki/Reverse_Field

    We also provide a visual designer that allows you graphically design data flow graphs from the components you write.

    http://bitbucket.org/diji/pypes/wiki/Screenshots

  • http://streamhacker.com/ Jacob Perkins

    Thanks for the comment. I’m not really a fan of visual programming tools, but I have been meaning to checkout stackless. Could be a good combination with execnet, where execnet handles the distribution, and stackless handles processing on each node.

  • http://www.mapreduce.org/ Hadoop MapReduce

    Nice Post dude.. Keep it up.