Tag Archives: erlang

Interview and Article about NLTK and Text-Processing

I recently did an interview with Zoltan Varju (@zoltanvarju) about Python, NLTK, and my demos & APIs at text-processing.com, which you can read here. There’s even a bit about Erlang & functional programming, as well as some insight into what I’ve been working on at Weotta. And last week, the text-processing.com API got a write up (and a nice traffic boost) from Garrett Wilkin (@garrettwilkin) on programmableweb.com.

Mnesia Records to MongoDB Documents

I recently migrated about 50k records from mnesia to MongoDB using my fork of emongo, which adds supervisors with transparent connection restarting, for reasons I’ll explain below.

Why Mongo instead of Mnesia

mnesia is great for a number of reasons, but here’s why I decided to move weotta’s place data into MongoDB:

Converting Records to Docs and vice versa

First, I needed to convert records to documents. In erlang, mongo documents are basically proplists. Keys going into emongo can be atoms, strings, or binaries, but keys coming out will always by binaries. Here’s a simple example of record to document conversion:

record_to_doc(Record, Attrs) ->
    % tl will drop record name
    lists:zip(Attrs, tl(tuple_to_list(Record))).

This would be called like record_to_doc(MyRecord, record_info(fields, my_record)). If you have nested dicts then you’ll have to flatten them using dict:to_list. Also note that list values are coming out of emongo are treated like yaws JSON arrays, i.e. [{key, {array, [val]}}]. For more examples, check out the emongo docs.

Heavy Write Load

To do the migration, I used etable:foreach to insert each document. Bulk insertion would probably be more efficient, but etable makes single record iteration very easy.

I started using the original emongo with a pool size of 10, but it was crashy when I dumped records as fast as possible. So initially I slowed it down with timer:sleep(200), but after adding supervised connections, I was able to dump with no delay. I’m not exactly sure what I fixed in this case, but I think the lesson is that using supervised gen_servers will give you reliability with little effort.

Read Performance

Now that I had data in mongo to play with, I compared the read performance to mnesia. Using timer:tc, I found that mnesia:dirty_read takes about 21 microseconds, whereas emongo:find_one can take anywhere from 600 to 1200 microseconds, querying on an indexed field. Without an index, read performance ranged from 900 to 2000 microseconds. I also tested only requesting specific fields, as recommended on the MongoDB Optimiziation page, but with small documents (<10 fields) that did not seem to have any effect. So while mongodb queries are pretty fast at 1ms, mnesia is about 50 times faster. Further inspection with fprof showed that nearly half of the cpu time of emongo:find is taken by BSON decoding.

Heavy Read Load

Under heavy read load (thousands of find_one calls in less than second), emongo_conn would get into a locked state. Somehow the process had accumulated unparsable data and wouldn’t reply. This problem went away when I increased the size of the pool size to 100, but that’s a ridiculous number of connections to keep open permanently. So instead I added some code to kill the connection on timeout and retry the find call. This was the main reason I added supervision. Now, every pool is locally registered as a simple_one_for_one supervisor that supervises every emongo_server connection. This pool is in turn supervised by emongo_sup, with dynamically added child specs. All this supervision allowed me to lower the pool size back to 10, and made it easy to kill and restart emongo_server connections as needed.

Why you may want to stick with Mnesia

Now that I have experience with both MongoDB and mnesia, here’s some reasons you may want to stick with mnesia:

Despite all that, I’m very happy with MongoDB. Installation and setup were a breeze, and schema-less data storage is very nice when you have variable fields and a high probability of adding and/or removing fields in the future. It’s simple, scalable, and as mentioned above, it’s very easy to access from many different languages. emongo isn’t perfect, but it’s open source and will hopefully benefit from more exposure.

erldis – an Erlang Redis Client

Since it’s now featured on the redis homepage, I figure I should tell people about my fork of erldis, an erlang redis client focused on synchronous operations.

Synchronicity

The original client, which still exists as erldis_client.erl, implements asynchronous pipelining. This means you send a bunch of redis commands, then collect all the results at the end. This didn’t work for me, as I needed a client that could handle parallel synchronous requests from multiple concurrent processes. So I copied erldis_client.erl to erldis_sync_client.erl and modified it to send replies back as soon as they are received from redis (in FIFO order). Many thanks to dialtone_ for writing the original erldis app as I’m not sure I would’ve created the synchronous client without it. And thanks to cstar for patches, such as making erldis_sync_client the default client for all functions in erldis.erl.

Extras

In addition to the synchronous client, I’ve added some extra functions and modules to make interfacing with redis more erlangy. Here’s a brief overview…

erldis_sync_client:transact

erldis_sync_client:transact is analagous to mnesia:transaction in that it does a unit of work against a redis database, like so:

  1. starts erldis_sync_client
  2. calls your function with the client PID as the argument
  3. stops the client
  4. returns the result of your function

The goal being to reduce boilerplate start/stop code.

erldis_dict module

erldis_dict provides similar semantics as the dict module in stdlib, using redis key-value commands.

erldis_list module

erldis_list provides a number of functions operating on redis lists, inspired by the array, lists, and queue modules in stdlib. You must pass in both the client PID and a redis list key.

erldis_sets module

erldis_sets works like the sets module, but you have to provide both the client PID and a redis set key.

Usage

Despite the low version numbers, I’ve been successfully using erldis as a component in parallel/distributed information retrieval (in conjunction with plists), and for accessing data shared with python / django apps. It’s a fully compliant erlang application that you can include in your target system release structure.

If also you’re using erldis for your redis needs, I’d love to hear about it.

Execnet vs Disco for Distributed NLTK

There’s a number of options for distributed processing and mapreduce in python. Before execnet surfaced, I’d been using Disco to do distributed NLTK. Now that I’ve happily switched to distributed NLTK with execnet, I can explain some of the differences and why execnet is so much better for my purposes.

Disco Overhead

Disco is a mapreduce framework for python, with an erlang core. This is very cool, but unfortunately introduces overhead costs when your functions are not pure (meaning they require external code and/or data). And part of speech tagging with NLTK is definitely not pure; the map function requires a part of speech tagger in order to do anything. So to use a part of speech tagger within a Disco map function, it must be loaded inline, which means unpickling the object before doing any work. And since a pickled part of speech tagger can easily exceed 500K, unpickling it can take over 2 seconds. When every map call has a fixed overhead of 2 seconds, your mapreduce task can take orders of magnitude longer to complete.

As an example, let’s say you need to do 6000 map calls, at 1 second of pure computation each. That’s 100 minutes, not counting overhead. Now add in the 2s fixed overhead on each call, and you’re at 300 minutes. What should be just over 1.6 hours of computation has jumped to 5 hours.

Execnet FTW

execnet provides a very different computational model: start some gateways and communicate thru message channels. In my case, all the fixed overhead can be done up-front, loading the part of speech tagger once per gateway, resulting in greatly reduced compute times. I did have to change my old Disco based code to work with execnet, but I actually ended up with less code that’s easier to understand.

Conclusion

If you’re just doing pure mapreduce computations, then consider using Disco. After the one time setup (which can be non-trivial), writing the functions will be relatively easy, and you’ll get a nice web UI for configuration and monitoring. But if you’re doing any dirty operations that need expensive initialization procedures, or can’t quite fit what you need into a pure mapreduce framework, then execnet is for you.