Mnesia Records to MongoDB Documents
Feb 1st
I recently migrated about 50k records from mnesia to MongoDB using my fork of emongo, which adds supervisors with transparent connection restarting, for reasons I’ll explain below.
Why Mongo instead of Mnesia
mnesia is great for a number of reasons, but here’s why I decided to move weotta’s place data into MongoDB:
- easy to access from python and other languages
- schema-less data, so you’re not constrained to records, and will never have to do mnesia:transform_table ever again
- don’t have to keep everything in memory (or only on disk as the case may be)
- simple & flexible indexing & querying
Converting Records to Docs and vice versa
First, I needed to convert records to documents. In erlang, mongo documents are basically proplists. Keys going into emongo can be atoms, strings, or binaries, but keys coming out will always by binaries. Here’s a simple example of record to document conversion:
record_to_doc(Record, Attrs) ->
% tl will drop record name
lists:zip(Attrs, tl(tuple_to_list(Record))).
This would be called like record_to_doc(MyRecord, record_info(fields, my_record)). If you have nested dicts then you’ll have to flatten them using dict:to_list. Also note that list values are coming out of emongo are treated like yaws JSON arrays, i.e. [{key, {array, [val]}}]. For more examples, check out the emongo docs.
Heavy Write Load
To do the migration, I used etable:foreach to insert each document. Bulk insertion would probably be more efficient, but etable makes single record iteration very easy.
I started using the original emongo with a pool size of 10, but it was crashy when I dumped records as fast as possible. So initially I slowed it down with timer:sleep(200), but after adding supervised connections, I was able to dump with no delay. I’m not exactly sure what I fixed in this case, but I think the lesson is that using supervised gen_servers will give you reliability with little effort.
Read Performance
Now that I had data in mongo to play with, I compared the read performance to mnesia. Using timer:tc, I found that mnesia:dirty_read takes about 21 microseconds, whereas emongo:find_one can take anywhere from 600 to 1200 microseconds, querying on an indexed field. Without an index, read performance ranged from 900 to 2000 microseconds. I also tested only requesting specific fields, as recommended on the MongoDB Optimiziation page, but with small documents (<10 fields) that did not seem to have any effect. So while mongodb queries are pretty fast at 1ms, mnesia is about 50 times faster. Further inspection with fprof showed that nearly half of the cpu time of emongo:find is taken by BSON decoding.
Heavy Read Load
Under heavy read load (thousands of find_one calls in less than second), emongo_conn would get into a locked state. Somehow the process had accumulated unparsable data and wouldn’t reply. This problem went away when I increased the size of the pool size to 100, but that’s a ridiculous number of connections to keep open permanently. So instead I added some code to kill the connection on timeout and retry the find call. This was the main reason I added supervision. Now, every pool is locally registered as a simple_one_for_one supervisor that supervises every emongo_server connection. This pool is in turn supervised by emongo_sup, with dynamically added child specs. All this supervision allowed me to lower the pool size back to 10, and made it easy to kill and restart emongo_server connections as needed.
Why you may want to stick with Mnesia
Now that I have experience with both MongoDB and mnesia, here’s some reasons you may want to stick with mnesia:
- very fast in-memory reads
- transactional
- simple master-master replication
- great for distributed read-heavy applications
Despite all that, I’m very happy with MongoDB. Installation and setup were a breeze, and schema-less data storage is very nice when you have variable fields and a high probability of adding and/or removing fields in the future. It’s simple, scalable, and as mentioned above, it’s very easy to access from many different languages. emongo isn’t perfect, but it’s open source and will hopefully benefit from more exposure.
A/B Testing Links
Jan 27th
Proxy Links
Jan 18th
Programming Philosophy Links
Jan 13th
- Software Is Hard
- Maker’s Schedule, Manager’s Schedule
- programming | Quotes Archive
- How to be a Programmer: A Short, Comprehensive, and Personal Summary
- The “free electron” programmer
- Software Carpentry
- Edited Contributions – Programmer 97-things
- It’s OK Not to Write Unit Tests
- Psychology and Security Resource Page
- How To Make Life Suck Less (While Making Scalable Systems)
- The myth of “undesigned”
- The Problem with Design and Implementation
Far Future Expires Header with django-storages S3Storage
Jan 10th
One way to decrease your site’s load time is to set a far future Expires header on all your static content. This doesn’t help first-time visitors, but can greatly improve the experience of returning visitors. And you get to decrease your bandwidth needs at the same time, because all your static content will be cached by their browser.
S3
weotta puts all of its awesome plan images in Amazon’s S3 using django-storages S3Storage backend, which by default does not set any Expires header. To remedy this, I set AWS_HEADERS in settings.py like so
from datetime import date, timedelta
tenyrs = date.today() + timedelta(days=365*10)
# Expires 10 years in the future at 8PM GMT
AWS_HEADERS = {
'Expires': tenyrs.strftime('%a, %d %b %Y 20:00:00 GMT')
}
Now every uploaded file gets an Expires header set to 10 years in the future.
upload_to
One potential drawback to using a far future Expires header is that if you change the file content without also changing the file name, no one will notice because they’ll keep using the old cached version of the file. Luckily, Django makes it easy to create (mostly) unique new file names by letting you include strftime formatting codes in a FileField or ImageField upload_to path, such as upload_to='images/%Y/%m/%d'. This way, every uploaded file automatically gets stored by date, which means it would take some deliberate effort to change the contents of a file without also changing the file name.
erldis – an Erlang Redis Client
Dec 21st
Since it’s now featured on the redis homepage, I figure I should tell people about my fork of erldis, an erlang redis client focused on synchronous operations.
Synchronicity
The original client, which still exists as erldis_client.erl, implements asynchronous pipelining. This means you send a bunch of redis commands, then collect all the results at the end. This didn’t work for me, as I needed a client that could handle parallel synchronous requests from multiple concurrent processes. So I copied erldis_client.erl to erldis_sync_client.erl and modified it to send replies back as soon as they are received from redis (in FIFO order). Many thanks to dialtone_ for writing the original erldis app as I’m not sure I would’ve created the synchronous client without it. And thanks to cstar for patches, such as making erldis_sync_client the default client for all functions in erldis.erl.
Extras
In addition to the synchronous client, I’ve added some extra functions and modules to make interfacing with redis more erlangy. Here’s a brief overview…
erldis_sync_client:transact
erldis_sync_client:transact is analagous to mnesia:transaction in that it does a unit of work against a redis database, like so:
- starts
erldis_sync_client - calls your function with the client PID as the argument
- stops the client
- returns the result of your function
The goal being to reduce boilerplate start/stop code.
erldis_dict module
erldis_dict provides similar semantics as the dict module in stdlib, using redis key-value commands.
erldis_list module
erldis_list provides a number of functions operating on redis lists, inspired by the array, lists, and queue modules in stdlib. You must pass in both the client PID and a redis list key.
erldis_sets module
erldis_sets works like the sets module, but you have to provide both the client PID and a redis set key.
Usage
Despite the low version numbers, I’ve been successfully using erldis as a component in parallel/distributed information retrieval (in conjunction with plists), and for accessing data shared with python / django apps. It’s a fully compliant erlang application that you can include in your target system release structure.
If also you’re using erldis for your redis needs, I’d love to hear about it.
Execnet vs Disco for Distributed NLTK
Dec 14th
There’s a number of options for distributed processing and mapreduce in python. Before execnet surfaced, I’d been using Disco to do distributed NLTK. Now that I’ve happily switched to distributed NLTK with execnet, I can explain some of the differences and why execnet is so much better for my purposes.
Disco Overhead
Disco is a mapreduce framework for python, with an erlang core. This is very cool, but unfortunately introduces overhead costs when your functions are not pure (meaning they require external code and/or data). And part of speech tagging with NLTK is definitely not pure; the map function requires a part of speech tagger in order to do anything. So to use a part of speech tagger within a Disco map function, it must be loaded inline, which means unpickling the object before doing any work. And since a pickled part of speech tagger can easily exceed 500K, unpickling it can take over 2 seconds. When every map call has a fixed overhead of 2 seconds, your mapreduce task can take orders of magnitude longer to complete.
As an example, let’s say you need to do 6000 map calls, at 1 second of pure computation each. That’s 100 minutes, not counting overhead. Now add in the 2s fixed overhead on each call, and you’re at 300 minutes. What should be just over 1.6 hours of computation has jumped to 5 hours.
Execnet FTW
execnet provides a very different computational model: start some gateways and communicate thru message channels. In my case, all the fixed overhead can be done up-front, loading the part of speech tagger once per gateway, resulting in greatly reduced compute times. I did have to change my old Disco based code to work with execnet, but I actually ended up with less code that’s easier to understand.
Conclusion
If you’re just doing pure mapreduce computations, then consider using Disco. After the one time setup (which can be non-trivial), writing the functions will be relatively easy, and you’ll get a nice web UI for configuration and monitoring. But if you’re doing any dirty operations that need expensive initialization procedures, or can’t quite fit what you need into a pure mapreduce framework, then execnet is for you.
Distributed NLTK with execnet
Nov 29th
Want to speed up your natural language processing with NLTK? Have a lot of files to process, but don’t know how to distribute NLTK across many cores?
Well, here’s how you can use execnet to do distributed part of speech tagging with NLTK.
execnet
execnet is a simple library for creating a network of gateways and channels that you can use for distributed computation in python. With it, you can start python shells over ssh, send code and/or data, then receive results. Below are 2 scripts that will test the accuracy of NLTK’s recommended part of speech tagger against every file in the brown corpus. The first script (the runner) does all the setup and receives the results, while the second script (the remote module) runs on every gateway, calculating and sending the accuracy of each file it receives for processing.
Runner
The runner does the following:
- Defines the hosts and number of gateways. I recommend 1 gateway per core per host.
- Loads and pickles the default NLTK part of speech tagger.
- Opens each gateway and creates a remote execution channel with the
tag_filesmodule (the remote module covered below). - Sends the pickled tagger and the name of a corpus (
brown) thru the channel. - Once all the channels have been created and initialized, it then sends all of the fileids in the corpus to alternating channels to distribute the work.
- Finally, it creates a receive queue and prints the accuracy response from each channel.
run_tag_files.py
import execnet
import nltk.tag, nltk.data
import cPickle as pickle
import tag_files
HOSTS = {
'localhost': 2
}
NICE = 20
channels = []
tagger = pickle.dumps(nltk.data.load(nltk.tag._POS_TAGGER))
for host, count in HOSTS.items():
print 'opening %d gateways at %s' % (count, host)
for i in range(count):
gw = execnet.makegateway('ssh=%s//nice=%d' % (host, NICE))
channel = gw.remote_exec(tag_files)
channels.append(channel)
channel.send(tagger)
channel.send('brown')
count = 0
chan = 0
for fileid in nltk.corpus.brown.fileids():
print 'sending %s to channel %d' % (fileid, chan)
channels[chan].send(fileid)
count += 1
# alternate channels
chan += 1
if chan >= len(channels): chan = 0
multi = execnet.MultiChannel(channels)
queue = multi.make_receive_queue()
for i in range(count):
channel, response = queue.get()
print response
Remote Module
The remote module is much simpler.
- Receives and unpickles the tagger.
- Receives the corpus name and loads it.
- For each fileid received, evaluates the accuracy of the tagger on the tagged sentences and sends an accuracy response.
tag_files.py
import nltk.corpus
import cPickle as pickle
if __name__ == '__channelexec__':
tagger = pickle.loads(channel.receive())
corpus_name = channel.receive()
corpus = getattr(nltk.corpus, corpus_name)
for fileid in channel:
accuracy = tagger.evaluate(corpus.tagged_sents(fileids=[fileid]))
channel.send('%s: %f' % (fileid, accuracy))
Putting it all together
Make sure you have NLTK and the corpus data installed on every host. You must also have passwordless ssh access to each host from the master host (the machine you run run_tag_files.py on).
run_tag_files.py and tag_files.py only need to be on the master host; execnet will take care of distributing the code. Assuming run_tag_files.py and tag_files.py are in the same directory, all you need to do is run python run_tag_files.py. You should get a message about opening gateways followed by a bunch of send messages. Then, just wait and watch the accuracy responses to see how accurate the built in part of speech tagger is on the brown corpus.
If you’d like test the accuracy of a different corpus, make sure every host has the corpus data, then send that corpus name instead of brown, and send the fileids from the new corpus.
If you want to test your own tagger, pickle it to a file, then load and send it instead of NLTK’s tagger. Or you can train it on the master first, then send it once training is complete.
Distributed File Processing
In practice, it’s often a PITA to make sure every host has every file you want to process, and you’ll want to process files outside of NLTK’s builtin corpora. My recommendation is to setup a GlusterFS storage cluster so that every host has a common mount point with access to every file that you want to process. If every host has the same mount point, you can send any file path to any channel for processing.
Django Tools and Links
Nov 9th
Using Django
- Top 10 tips to a new django developer : Dpeepul Blog
- Henrique C. Alves – Keeping simple with Django
- Django Dose – Handling Development, Staging, and Production Environments
- Django signals | Mercurytide
Social Apps
- uswaretech’s Django-Socialauth at master – GitHub
- Django-SocialAuth – Login via twitter, facebook, openid, yahoo, google using a single app. — The Uswaretech Blog – Django Web Development
Forms
- django-simple-captcha – Project Hosting on Google Code
- Marco Fucci – Integrating reCAPTCHA with Django
- simonw’s django-safeform at master – GitHub
Notifications
Geolocation
- geodjango-basic-apps – Project Hosting on Google Code
- GeoDjango and the UK postcode database « lamby
Misc
- taras’s Django-Scraper at master – GitHub
- Semantic Django – Tools for semantic stuff in Django
- sorl-thumbnail – Project Hosting on Google Code
- The apps that power Django-Mingus | Monty Lounge Blog
- fullhistory – Project Hosting on Google Code
- jaycee’s weaver at master – GitHub
- zain’s jogging at master – GitHub








