All posts by Jacob

Building a NLTK FreqDist on Redis

Say you want to build a frequency distribution of many thousands of samples with the following characteristics:

  • fast to build
  • persistent data
  • network accessible (with no locking requirements)
  • can store large sliceable index lists

The only solution I know that meets those requirements is Redis. NLTK’s FreqDist is not persistent , shelve is far too slow, BerkeleyDB is not network accessible (and is generally a PITA to manage), and AFAIK there’s no other key-value store that makes sliceable lists really easy to create & access. So far I’ve been quite pleased with Redis, especially given how new it is. It’s quite fast, is network accessible, atomic operations make locking unnecessary, supports sortable and sliceable list structures, and is very easy to configure.

Why build a NLTK FreqDist on Redis

Building a NLTK FreqDist on top of Redis allows you to create a ProbDist, which in turn can be used for classification. Having it be persistent lets you examine the data later. And the ability to create sliceable lists allows you to make sorted indexes for paging thru your samples.

Here’s some more concrete use cases for persistent frequency distributions:

RedisFreqDist

I put the code I’ve been using to build frequency distributions over large sets of words up at BitBucketprobablity.py contains RedisFreqDist, which works just like the NTLK FreqDist, except it stores samples and frequencies as keys and values in Redis. That means samples must be strings. Internally, RedisFreqDist also stores a set of all the samples under the key __samples__ for efficient lookup and sorting. Here’s some example code for using it. For more info, checkout the wiki, or read the code.

def make_freq_dist(samples, host='localhost', port=6379, db=0):
	freqs = RedisFreqDist(host=host, port=port, db=db)

	for sample in samples:
		freqs.inc(sample)

Unfortunately, I had to muck about with some of FreqDist’s internal implementation to remain compatible, so I can’t promise the code will work beyond NLTK version 0.9.9. probablity.py also includes ConditionalRedisFreqDist for creating ConditionalProbDists.

Lists

For creating lists of samples, that very much depends on your use case, but here’s some example code for doing so. r is a redis object, key is the index key for storing the list, and samples is assumed to be a sorted list. The get_samples function demonstrates how to get a slice of samples from the list.

def index_samples(r, key, samples):
	r.delete(key)

	for word in words:
		r.push(key, word, tail=True)

def get_samples(r, key, start, end):
	return r.lrange(key, start, end)

Yes, Redis is still fairly alpha, so I wouldn’t use it for critical systems. But I’ve had very few issues so far, especially compared to dealing with BerkeleyDB. I highly recommend it for your non-critical computational needs 🙂 Redis has been quite stable for a while now, and many sites are using it successfully in production

Deploying Django with Mercurial, Fab and Nginx

Writing web apps with Django can be a lot of fun, but deploying them can be a chore, even if you’re using Apache. Here’s a setup I’ve been using that makes deployment fast and easy. This all assumes you’ve got sudo access on a remote server running Ubuntu or something similar.

Mercurial

This setup assumes you’ve got 2 mercurial repositories: 1 on your local machine, and 1 on the remote server you’re deploying to. In the remote repository, add the following to .hg/hgrc

[hooks]
changegroup = hg up

This makes mercurial run hg up whenever you push new code. Then in your local repo’s .hg/hgrc, make sure the default path is to your remote repo. Here’s an example

[paths]
default = ssh://user@domain.com/repo

Now when you run hg push, you don’t need to include the path to the repo, and your code will be updated immediately.

Django FastCGI Deployment

Since I’m using nginx instead of Apache, we’ll be deploying Django with FastCGI. Here’s an example script you can use to start and restart your Django FastCGI server. Add this script to your mercurial repo as run_fcgi.sh.

#!/bin/bash
PIDFILE="/tmp/django.pid"
SOCKET="/tmp/django.sock"

# kill current fcgi process if it exists
if [ -f $PIDFILE ]; then
    kill `cat -- $PIDFILE`
    rm -f -- $PIDFILE
fi

python manage.py runfcgi socket=$SOCKET pidfile=$PIDFILE method=prefork

Important note: the FastCGI socket file will need to be readable & writable by nginx worker processes, which run as the www-data user in Ubuntu. This will be handled by the fab restart command below, or you could add chmod a+w $SOCKET to the end of the above script.

Nginx FastCGI Proxy

Nginx is a great high performance web server with simple configuration. Here’s a simple example server config for proxying to your Django FastCGI process. Add this config to your mercurial repo as django.nginx.

server {
    listen 80;
    # change to your FQDN
    server_name YOUR.DOMAIN.COM;

    location / {
        # must be the same socket file as in the above fcgi script
        fastcgi_pass unix:/tmp/django.sock;
    }
}

On the remote server, make sure the following lines are in the http section of /etc/nginx/nginx.conf

include /etc/nginx/sites-enabled/*;
# fastcgi_params should contain a lot of fastcgi_param variables
include /etc/nginx/fastcgi_params;

You must also make sure there is a link in /etc/nginx/sites-enabled to your django.nginx config. Don’t worry if django.nginx doesn’t exist yet, it will once you run fab nginx the first time.

you@remote.ubuntu$ cd /etc/nginx/sites-enabled
you@remote.ubuntu$ sudo ln -s ../sites-available/django.nginx django.nginx

Python Fabric

Fab, or properly Fabric, is my favorite new tool. It’s designed specifically for making remote deployment simple and easy. You create a fabfile where each function is a fab command that can run remote and sudo commands on one or more remote hosts. So let’s deploy Django using fab. Here’s an example fabfile with 2 commands: restart and nginx. These commands should only be run after you’ve done a hg push.

config.fab_hosts = ['YOUR.DOMAIN.COM']
config.projdir = '/PATH/TO/YOUR/REMOTE/HG/REPO'

def restart():
    sudo('cd %(projdir)s; run_fcgi.sh', user='www-data', fail='abort')

def nginx():
    sudo('cp %(projdir)s/django.nginx /etc/nginx/sites-available/', fail='abort')
    sudo('killall -HUP nginx', fail='abort')

fab restart

You only need to run fab restart if you’ve changed the actual Django python code. Changes to templates or static files don’t require a restart and will be used automatically (because of the hg up changegroup hook). Executing run_fcgi.sh as the www-data user ensures that nginx can read & write the socket.

fab nginx

If you’ve changed your nginx server config, you can run fab nginx to install and reload the new server config without restarting the nginx server.

Wrap Up

Now that everything is setup, the next time you want to deploy some new code, it’s as simple as hg push && fab restart. And if you’ve only changed templates, all you need to do is hg push. I hope this helps make your Django development life easier. It has certainly done so for me 🙂

Django Datetime Snippets

I’ve started posting over at Django snippets, which is a great resource for finding useful bits of functionality. My first set of snippets is focused on datetime conversions.

The Snippets

FuzzyDateTimeField is a drop in replacement for the standard DateTimeField that uses dateutil.parser with fuzzy=True to clean the value, allowing the parser to be more liberal with the input formats it accepts.

The isoutc template filter produces an ISO format UTC datetime string from a timezone aware datetime object.

The timeto template filter is a more compact version of django’s timeuntil filter that only shows hours & minutes, such as “1hr 30min”.

JSON encode ISO UTC datetime is a way to encode datetime objects as ISO strings just like the isoutc template filter.

JSON decode datetime is a simplejson object hook for converting the datetime attribute of a JSON object to a python datetime object. This is especially useful if you have a list of objects that all have datetime attributes that need to be decoded.

Use Case

Imagine you’re making a time based search engine for movies and/or events. Because your data will span many timezones, you decide that all dates & times should be stored on the server as UTC. This pushes local timezone conversion to the client side, where it belongs, simplifying the server side data structures and search operations. You want your search engine to be AJAX enabled, but you don’t like XML because it’s so verbose, so you go with JSON for serialization. You also want users to be able to input their own range based queries without being forced to use specific datetime formats. Leaving out all the hard stuff, the above snippets can be used for communication between a django webapp and a time based search engine.

Dates and Times in Python and Javascript

If you are dealing with dates & times in python and/or javascript, there are two must have libraries.

  1. Datejs
  2. python-dateutil

Datejs

Datejs, being javascript, is designed for parsing and creating human readable dates & times. It’s powerful parse() function can handle all the dates & times you’d expect, plus fuzzier human readable date words. Here are some examples from their site.

Date.parse("February 20th 1973");
Date.parse("Thu, 1 July 2004 22:30:00");
Date.parse("today");
Date.parse("next thursday");

And if you are programmatically creating Date objects, here’s a few functions I find myself using frequently.

// get a new Date object set to local date
var dt = Date.today();
// get that same Date object set to current time
var dt = Date.today().setTimeToNow();

// set the local time to 10:30 AM
var dt = Date.today().set({hour: 10, minute: 30});
// produce an ISO formatted datetime string converted to UTC
dt.toISOString();

There’s plenty more in the documentation; pretty much everything you need for manipulation, comparison, and string conversion. Datejs cleanly extends the default Date object, has been integrated into a couple date pickers, and supports culture specific parsing for i18n.

python-dateutil

Like Datejs, dateutil also has a powerful parse() function. While it can’t handle words like “today” or “tomorrow”, it can handle nearly every (American) date format that exists. Here’s a few examples.

>>> from dateutil import parser
>>> parser.parse("Thu, 4/2/09 09:00 PM")
datetime.datetime(2009, 4, 2, 21, 0)
>>> parser.parse("04/02/09 9:00PM")
datetime.datetime(2009, 4, 2, 21, 0)
>>> parser.parse("04-02-08 9pm")
datetime.datetime(2009, 4, 2, 21, 0)

An option that comes especially in handy is to pass in fuzzy=True. This tells parse() to ignore unknown tokens while parsing. This next example would raise a ValueError without fuzzy=True.

>>> parser.parse("Thurs, 4/2/09 09:00 PM", fuzzy=True)

It don’t know how well it works for international date formats, but parse() does have options for reading days first and years first, so I’m guessing it can be made to work.

dateutil also provides some great timezone support. I’ve always been surprised at python’s lack of concrete tzinfo classes, but dateutil.tz more than makes up for it (there’s also pytz, but I haven’t figured out why I need it instead of or in addition to dateutil.tz). Here’s a function for parsing a string and returning a UTC datetime object.

from dateutil import parser, tz
def parse_to_utc(s):
    dt = parser.parse(s, fuzzy=True)
    dt = dt.replace(tzinfo=tz.tzlocal())
    return dt.astimezone(tz.tzutc())

dateutil does a lot more than provide tzinfo objects and parse datetimes; it can also calculate relative deltas and handle iCal recurrence rules. I’m sure a whole calendar application could be built based on dateutil, but my interest is in parsing and converting datetimes to and from UTC, and in that respect dateutil excels.

mapfilter

Have you ever mapped a list, then filtered it? Or filtered first, then mapped? Why not do it all in one pass with mapfilter?

mapfilter?

mapfilter is a function that combines the traditional map & filter of functional programming by using the following logic:

  1. if your function returns false, then the element is discarded
  2. any other return value is mapped into the list

Why?

Doing a map and then a filter is O(2N), whereas mapfilter is O(N). That’s twice as a fast! If you are dealing with large lists, this can be a huge time saver. And for the case where a large list contains small IDs for looking up a larger data structure, then using mapfilter can result in half the number of database lookups.

Obviously, mapfilter won’t work if you want to produce a list of boolean values, as it would filter out all the false values. But why would you want to map to a list booleans?

Erlang Code

Here’s some erlang code I’ve been using for a while:

mapfilter(F, List) -> lists:reverse(mapfilter(F, List, [])).

mapfilter(_, [], Results) ->
        Results;
mapfilter(F, [Item | Rest], Results) ->
        case F(Item) of
                false -> mapfilter(F, Rest, Results);
                Term -> mapfilter(F, Rest, [Term | Results])
        end.

Has anyone else done this for themselves? Does mapfilter exist in any programming language? If so, please leave a comment. I think mapfilter is a very simple & useful concept that should be a included in the standard library of every (functional) programming language. Erlang already has mapfoldl (map-reduce in one pass), so why not also have mapfilter?

Chunk Extraction with NLTK

Chunk extraction is a useful preliminary step to information extraction, that creates parse trees from unstructured text with a chunker. Once you have a parse tree of a sentence, you can do more specific information extraction, such as named entity recognition and relation extraction.

Chunking is basically a 3 step process:

  1. Tag a sentence
  2. Chunk the tagged sentence
  3. Analyze the parse tree to extract information

I’ve already written about how to train a NLTK part of speech tagger and a chunker, so I’ll assume you’ve already done the training, and now you want to use your pos tagger and iob chunker to do something useful.

IOB Tag Chunker

The previously trained chunker is actually a chunk tagger. It’s a Tagger that assigns IOB chunk tags to part-of-speech tags. In order to use it for proper chunking, we need some extra code to convert the IOB chunk tags into a parse tree. I’ve created a wrapper class that complies with the nltk ChunkParserI interface and uses the trained chunk tagger to get IOB tags and convert them to a proper parse tree.

import nltk.chunk
import itertools

class TagChunker(nltk.chunk.ChunkParserI):
    def __init__(self, chunk_tagger):
        self._chunk_tagger = chunk_tagger

    def parse(self, tokens):
        # split words and part of speech tags
        (words, tags) = zip(*tokens)
        # get IOB chunk tags
        chunks = self._chunk_tagger.tag(tags)
        # join words with chunk tags
        wtc = itertools.izip(words, chunks)
        # w = word, t = part-of-speech tag, c = chunk tag
        lines = [' '.join([w, t, c]) for (w, (t, c)) in wtc if c]
        # create tree from conll formatted chunk lines
        return nltk.chunk.conllstr2tree('\n'.join(lines))

Chunk Extraction

Now that we have a proper NLTK chunker, we can use it to extract chunks. Here’s a simple example that tags a sentence, chunks the tagged sentence, then prints out each noun phrase.

# sentence should be a list of words
tagged = tagger.tag(sentence)
tree = chunker.parse(tagged)
# for each noun phrase sub tree in the parse tree
for subtree in tree.subtrees(filter=lambda t: t.node == 'NP'):
    # print the noun phrase as a list of part-of-speech tagged words
    print subtree.leaves()

Each sub tree has a phrase tag, and the leaves of a sub tree are the tagged words that make up that chunk. Since we’re training the chunker on IOB tags, NP stands for Noun Phrase. As noted before, the results of this natural language processing are heavily dependent on the training data. If your input text isn’t similar to the your training data, then you probably won’t be getting many chunks.

Test Driven Development in Python

One of my favorite aspects of Python is that it makes practicing TDD very easy. What makes it so frictionless is the doctest module. It allows you to write a test at the same time you define a function. No setup, no boilerplate, just write a function call and the expected output in the docstring. Here’s a quick example of a fibonacci function.

def fib(n):
        '''Return the nth fibonacci number.
        >>> fib(0)
        0
        >>> fib(1)
        1
        >>> fib(2)
        1
        >>> fib(3)
        2
        >>> fib(4)
        3
        '''
        if n == 0:
                return 0
        elif n == 1:
                return 1
        else:
                return fib(n - 1) + fib(n - 2)

If you want to run your doctests, just add the following three lines to the bottom of your module.

if __name__ == '__main__':
        import doctest
        doctest.testmod()

Now you can run your module to run the doctests, like python fib.py.

So how well does this fit in with the TDD philosophy? Here’s the basic TDD practices.

  1. Think about what you want to test
  2. Write a small test
  3. Write just enough code to fail the test
  4. Run the test and watch it fail
  5. Write just enough code to pass the test
  6. Run the test and watch it pass (if it fails, go back to step 4)
  7. Go back to step 1 and repeat until done

And now a step-by-step breakdown of how to do this with doctests, in excruciating detail.

1. Define a new empty method

def fib(n):
'''Return the nth fibonacci number.'''
pass

if __name__ == '__main__':
import doctest
doctest.testmod()

2. Write a doctest

def fib(n):
        '''Return the nth fibonacci number.
        >>> fib(0)
        0
        '''
        pass

3. Run the module and watch the doctest fail

python fib.py
**********************************************************************
File "fib1.py", line 3, in __main__.fib
Failed example:
    fib(0)
Expected:
    0
Got nothing
**********************************************************************
1 items had failures:
   1 of   1 in __main__.fib
***Test Failed*** 1 failures.

4. Write just enough code to pass the failing doctest

def fib(n):
        '''Return the nth fibonacci number.
        >>> fib(0)
        0
        '''
        return 0

5. Run the module and watch the doctest pass

python fib.py

6. Go back to step 2 and repeat

Now you can start filling in the rest of function, one test at time. In practice, you may not write code exactly like this, but the point is that doctests provide a really easy way to test your code as you write it.

Unit Tests

Ok, so doctests are great for simple tests. But what if your tests need to be a bit more complex? Maybe you need some external data, or mock objects. In that case, you’ll be better off with more traditional unit tests. But first, take a little time to see if you can decompose your code into a set of smaller functions that can be tested individually. I find that code that is easier to test is also easier to understand.

Running Tests

For running my tests, I use nose. I have a tests/ directory with a simple configuration file, nose.cfg

[nosetests]
verbosity=3
with-doctest=1

Then in my Makefile, I add a test command so I can run make test.

test:
        @nosetests --config=tests/nose.cfg tests PACKAGE1 PACKAGE2

PACKAGE1 and PACKAGE2 are optional paths to your code. They could point to unit test packages and/or production code containing doctests.

And finally, if you’re looking for a continuous integration server, try Buildbot.

Static Analysis of Erlang Code with Dialyzer

Dialyzer is a tool that does static analysis of your erlang code. It’s great for identifying type errors and unreachable code. Here’s how to use it from the command line.

dialyzer -r PATH/TO/APP -I PATH/TO/INCLUDE

Pretty simple! PATH/TO/APP should be an erlang application directory containing your ebin/ and/or src/ directories. PATH/TO/INCLUDE should be a path to a directory that contains any .hrl files that need to be included. The -I is optional if you have no include files. You can have as many -r and -I options as you need. If you add -q, then dialyzer runs more quietly, succeeding silently or reporting any errors found.

If you have a test/ directory with Common Test suites, then you’ll want to add “-I /usr/lib/erlang/lib/test_server*/include/” and “-I /usr/lib/erlang/lib/common_test*/include/”. I’ve actually set this up in my Makefile to run as make check. It’s been great for catching bad return types, misspellings, and wrong function parameters.