Tag Archives: json

Mnesia Records to MongoDB Documents

I recently migrated about 50k records from mnesia to MongoDB using my fork of emongo, which adds supervisors with transparent connection restarting, for reasons I’ll explain below.

Why Mongo instead of Mnesia

mnesia is great for a number of reasons, but here’s why I decided to move weotta’s place data into MongoDB:

Converting Records to Docs and vice versa

First, I needed to convert records to documents. In erlang, mongo documents are basically proplists. Keys going into emongo can be atoms, strings, or binaries, but keys coming out will always by binaries. Here’s a simple example of record to document conversion:

record_to_doc(Record, Attrs) ->
    % tl will drop record name
    lists:zip(Attrs, tl(tuple_to_list(Record))).

This would be called like record_to_doc(MyRecord, record_info(fields, my_record)). If you have nested dicts then you’ll have to flatten them using dict:to_list. Also note that list values are coming out of emongo are treated like yaws JSON arrays, i.e. [{key, {array, [val]}}]. For more examples, check out the emongo docs.

Heavy Write Load

To do the migration, I used etable:foreach to insert each document. Bulk insertion would probably be more efficient, but etable makes single record iteration very easy.

I started using the original emongo with a pool size of 10, but it was crashy when I dumped records as fast as possible. So initially I slowed it down with timer:sleep(200), but after adding supervised connections, I was able to dump with no delay. I’m not exactly sure what I fixed in this case, but I think the lesson is that using supervised gen_servers will give you reliability with little effort.

Read Performance

Now that I had data in mongo to play with, I compared the read performance to mnesia. Using timer:tc, I found that mnesia:dirty_read takes about 21 microseconds, whereas emongo:find_one can take anywhere from 600 to 1200 microseconds, querying on an indexed field. Without an index, read performance ranged from 900 to 2000 microseconds. I also tested only requesting specific fields, as recommended on the MongoDB Optimiziation page, but with small documents (<10 fields) that did not seem to have any effect. So while mongodb queries are pretty fast at 1ms, mnesia is about 50 times faster. Further inspection with fprof showed that nearly half of the cpu time of emongo:find is taken by BSON decoding.

Heavy Read Load

Under heavy read load (thousands of find_one calls in less than second), emongo_conn would get into a locked state. Somehow the process had accumulated unparsable data and wouldn’t reply. This problem went away when I increased the size of the pool size to 100, but that’s a ridiculous number of connections to keep open permanently. So instead I added some code to kill the connection on timeout and retry the find call. This was the main reason I added supervision. Now, every pool is locally registered as a simple_one_for_one supervisor that supervises every emongo_server connection. This pool is in turn supervised by emongo_sup, with dynamically added child specs. All this supervision allowed me to lower the pool size back to 10, and made it easy to kill and restart emongo_server connections as needed.

Why you may want to stick with Mnesia

Now that I have experience with both MongoDB and mnesia, here’s some reasons you may want to stick with mnesia:

Despite all that, I’m very happy with MongoDB. Installation and setup were a breeze, and schema-less data storage is very nice when you have variable fields and a high probability of adding and/or removing fields in the future. It’s simple, scalable, and as mentioned above, it’s very easy to access from many different languages. emongo isn’t perfect, but it’s open source and will hopefully benefit from more exposure.

Django Datetime Snippets

I’ve started posting over at Django snippets, which is a great resource for finding useful bits of functionality. My first set of snippets is focused on datetime conversions.

The Snippets

FuzzyDateTimeField is a drop in replacement for the standard DateTimeField that uses dateutil.parser with fuzzy=True to clean the value, allowing the parser to be more liberal with the input formats it accepts.

The isoutc template filter produces an ISO format UTC datetime string from a timezone aware datetime object.

The timeto template filter is a more compact version of django’s timeuntil filter that only shows hours & minutes, such as “1hr 30min”.

JSON encode ISO UTC datetime is a way to encode datetime objects as ISO strings just like the isoutc template filter.

JSON decode datetime is a simplejson object hook for converting the datetime attribute of a JSON object to a python datetime object. This is especially useful if you have a list of objects that all have datetime attributes that need to be decoded.

Use Case

Imagine you’re making a time based search engine for movies and/or events. Because your data will span many timezones, you decide that all dates & times should be stored on the server as UTC. This pushes local timezone conversion to the client side, where it belongs, simplifying the server side data structures and search operations. You want your search engine to be AJAX enabled, but you don’t like XML because it’s so verbose, so you go with JSON for serialization. You also want users to be able to input their own range based queries without being forced to use specific datetime formats. Leaving out all the hard stuff, the above snippets can be used for communication between a django webapp and a time based search engine.