I’ve finally setup a top level domain and migrated the blog to streamhacker.com. Expect the same esoteric programming articles, but with a brand new look & feel, courtesy of LightWord.
All posts by Jacob
Building a NLTK FreqDist on Redis
Say you want to build a frequency distribution of many thousands of samples with the following characteristics:
- fast to build
- persistent data
- network accessible (with no locking requirements)
- can store large sliceable index lists
The only solution I know that meets those requirements is Redis. NLTK’s FreqDist is not persistent , shelve is far too slow, BerkeleyDB is not network accessible (and is generally a PITA to manage), and AFAIK there’s no other key-value store that makes sliceable lists really easy to create & access. So far I’ve been quite pleased with Redis, especially given how new it is. It’s quite fast, is network accessible, atomic operations make locking unnecessary, supports sortable and sliceable list structures, and is very easy to configure.
Why build a NLTK FreqDist on Redis
Building a NLTK FreqDist on top of Redis allows you to create a ProbDist, which in turn can be used for classification. Having it be persistent lets you examine the data later. And the ability to create sliceable lists allows you to make sorted indexes for paging thru your samples.
Here’s some more concrete use cases for persistent frequency distributions:
RedisFreqDist
I put the code I’ve been using to build frequency distributions over large sets of words up at BitBucket. probablity.py contains RedisFreqDist
, which works just like the NTLK FreqDist, except it stores samples and frequencies as keys and values in Redis. That means samples must be strings. Internally, RedisFreqDist
also stores a set of all the samples under the key __samples__ for efficient lookup and sorting. Here’s some example code for using it. For more info, checkout the wiki, or read the code.
def make_freq_dist(samples, host='localhost', port=6379, db=0): freqs = RedisFreqDist(host=host, port=port, db=db) for sample in samples: freqs.inc(sample)
Unfortunately, I had to muck about with some of FreqDist’s internal implementation to remain compatible, so I can’t promise the code will work beyond NLTK version 0.9.9. probablity.py also includes ConditionalRedisFreqDist
for creating ConditionalProbDists.
Lists
For creating lists of samples, that very much depends on your use case, but here’s some example code for doing so. r
is a redis object, key
is the index key for storing the list, and samples
is assumed to be a sorted list. The get_samples
function demonstrates how to get a slice of samples from the list.
def index_samples(r, key, samples): r.delete(key) for word in words: r.push(key, word, tail=True) def get_samples(r, key, start, end): return r.lrange(key, start, end)
Yes, Redis is still fairly alpha, so I wouldn’t use it for critical systems. But I’ve had very few issues so far, especially compared to dealing with BerkeleyDB. I highly recommend it for your non-critical computational needs 🙂 Redis has been quite stable for a while now, and many sites are using it successfully in production
Deploying Django with Mercurial, Fab and Nginx
Writing web apps with Django can be a lot of fun, but deploying them can be a chore, even if you’re using Apache. Here’s a setup I’ve been using that makes deployment fast and easy. This all assumes you’ve got sudo
access on a remote server running Ubuntu or something similar.
Mercurial
This setup assumes you’ve got 2 mercurial repositories: 1 on your local machine, and 1 on the remote server you’re deploying to. In the remote repository, add the following to .hg/hgrc
[hooks]
changegroup = hg up
This makes mercurial run hg up
whenever you push new code. Then in your local repo’s .hg/hgrc, make sure the default path is to your remote repo. Here’s an example
[paths] default = ssh://user@domain.com/repo
Now when you run hg push
, you don’t need to include the path to the repo, and your code will be updated immediately.
Django FastCGI Deployment
Since I’m using nginx instead of Apache, we’ll be deploying Django with FastCGI. Here’s an example script you can use to start and restart your Django FastCGI server. Add this script to your mercurial repo as run_fcgi.sh
.
#!/bin/bash
PIDFILE="/tmp/django.pid"
SOCKET="/tmp/django.sock"
# kill current fcgi process if it exists
if [ -f $PIDFILE ]; then
kill `cat -- $PIDFILE`
rm -f -- $PIDFILE
fi
python manage.py runfcgi socket=$SOCKET pidfile=$PIDFILE method=prefork
Important note: the FastCGI socket file will need to be readable & writable by nginx worker processes, which run as the www-data user in Ubuntu. This will be handled by the fab restart
command below, or you could add chmod a+w $SOCKET
to the end of the above script.
Nginx FastCGI Proxy
Nginx is a great high performance web server with simple configuration. Here’s a simple example server config for proxying to your Django FastCGI process. Add this config to your mercurial repo as django.nginx
.
server {
listen 80;
# change to your FQDN
server_name YOUR.DOMAIN.COM;
location / {
# must be the same socket file as in the above fcgi script
fastcgi_pass unix:/tmp/django.sock;
}
}
On the remote server, make sure the following lines are in the http
section of /etc/nginx/nginx.conf
include /etc/nginx/sites-enabled/*;
# fastcgi_params should contain a lot of fastcgi_param variables
include /etc/nginx/fastcgi_params;
You must also make sure there is a link in /etc/nginx/sites-enabled
to your django.nginx
config. Don’t worry if django.nginx
doesn’t exist yet, it will once you run fab nginx
the first time.
you@remote.ubuntu$ cd /etc/nginx/sites-enabled
you@remote.ubuntu$ sudo ln -s ../sites-available/django.nginx django.nginx
Python Fabric
Fab, or properly Fabric, is my favorite new tool. It’s designed specifically for making remote deployment simple and easy. You create a fabfile
where each function is a fab command that can run remote and sudo commands on one or more remote hosts. So let’s deploy Django using fab. Here’s an example fabfile
with 2 commands: restart
and nginx
. These commands should only be run after you’ve done a hg push
.
config.fab_hosts = ['YOUR.DOMAIN.COM'] config.projdir = '/PATH/TO/YOUR/REMOTE/HG/REPO' def restart(): sudo('cd %(projdir)s; run_fcgi.sh', user='www-data', fail='abort') def nginx(): sudo('cp %(projdir)s/django.nginx /etc/nginx/sites-available/', fail='abort') sudo('killall -HUP nginx', fail='abort')
fab restart
You only need to run fab restart
if you’ve changed the actual Django python code. Changes to templates or static files don’t require a restart and will be used automatically (because of the hg up
changegroup hook). Executing run_fcgi.sh
as the www-data user ensures that nginx can read & write the socket.
fab nginx
If you’ve changed your nginx server config, you can run fab nginx
to install and reload the new server config without restarting the nginx server.
Wrap Up
Now that everything is setup, the next time you want to deploy some new code, it’s as simple as hg push && fab restart
. And if you’ve only changed templates, all you need to do is hg push
. I hope this helps make your Django development life easier. It has certainly done so for me 🙂
Django Datetime Snippets
I’ve started posting over at Django snippets, which is a great resource for finding useful bits of functionality. My first set of snippets is focused on datetime conversions.
The Snippets
FuzzyDateTimeField is a drop in replacement for the standard DateTimeField that uses dateutil.parser with fuzzy=True
to clean the value, allowing the parser to be more liberal with the input formats it accepts.
The isoutc template filter produces an ISO format UTC datetime string from a timezone aware datetime object.
The timeto template filter is a more compact version of django’s timeuntil filter that only shows hours & minutes, such as “1hr 30min”.
JSON encode ISO UTC datetime is a way to encode datetime objects as ISO strings just like the isoutc template filter.
JSON decode datetime is a simplejson object hook for converting the datetime
attribute of a JSON object to a python datetime object. This is especially useful if you have a list of objects that all have datetime
attributes that need to be decoded.
Use Case
Imagine you’re making a time based search engine for movies and/or events. Because your data will span many timezones, you decide that all dates & times should be stored on the server as UTC. This pushes local timezone conversion to the client side, where it belongs, simplifying the server side data structures and search operations. You want your search engine to be AJAX enabled, but you don’t like XML because it’s so verbose, so you go with JSON for serialization. You also want users to be able to input their own range based queries without being forced to use specific datetime formats. Leaving out all the hard stuff, the above snippets can be used for communication between a django webapp and a time based search engine.
Dates and Times in Python and Javascript
If you are dealing with dates & times in python and/or javascript, there are two must have libraries.
Datejs
Datejs, being javascript, is designed for parsing and creating human readable dates & times. It’s powerful parse() function can handle all the dates & times you’d expect, plus fuzzier human readable date words. Here are some examples from their site.
Date.parse("February 20th 1973"); Date.parse("Thu, 1 July 2004 22:30:00"); Date.parse("today"); Date.parse("next thursday");
And if you are programmatically creating Date objects, here’s a few functions I find myself using frequently.
// get a new Date object set to local date var dt = Date.today(); // get that same Date object set to current time var dt = Date.today().setTimeToNow(); // set the local time to 10:30 AM var dt = Date.today().set({hour: 10, minute: 30}); // produce an ISO formatted datetime string converted to UTC dt.toISOString();
There’s plenty more in the documentation; pretty much everything you need for manipulation, comparison, and string conversion. Datejs cleanly extends the default Date object, has been integrated into a couple date pickers, and supports culture specific parsing for i18n.
python-dateutil
Like Datejs, dateutil also has a powerful parse() function. While it can’t handle words like “today” or “tomorrow”, it can handle nearly every (American) date format that exists. Here’s a few examples.
>>> from dateutil import parser >>> parser.parse("Thu, 4/2/09 09:00 PM") datetime.datetime(2009, 4, 2, 21, 0) >>> parser.parse("04/02/09 9:00PM") datetime.datetime(2009, 4, 2, 21, 0) >>> parser.parse("04-02-08 9pm") datetime.datetime(2009, 4, 2, 21, 0)
An option that comes especially in handy is to pass in fuzzy=True. This tells parse() to ignore unknown tokens while parsing. This next example would raise a ValueError without fuzzy=True.
>>> parser.parse("Thurs, 4/2/09 09:00 PM", fuzzy=True)
It don’t know how well it works for international date formats, but parse() does have options for reading days first and years first, so I’m guessing it can be made to work.
dateutil also provides some great timezone support. I’ve always been surprised at python’s lack of concrete tzinfo classes, but dateutil.tz more than makes up for it (there’s also pytz, but I haven’t figured out why I need it instead of or in addition to dateutil.tz). Here’s a function for parsing a string and returning a UTC datetime object.
from dateutil import parser, tz def parse_to_utc(s): dt = parser.parse(s, fuzzy=True) dt = dt.replace(tzinfo=tz.tzlocal()) return dt.astimezone(tz.tzutc())
dateutil does a lot more than provide tzinfo objects and parse datetimes; it can also calculate relative deltas and handle iCal recurrence rules. I’m sure a whole calendar application could be built based on dateutil, but my interest is in parsing and converting datetimes to and from UTC, and in that respect dateutil excels.
mapfilter
Have you ever mapped a list, then filtered it? Or filtered first, then mapped? Why not do it all in one pass with mapfilter?
mapfilter?
mapfilter is a function that combines the traditional map & filter of functional programming by using the following logic:
- if your function returns false, then the element is discarded
- any other return value is mapped into the list
Why?
Doing a map and then a filter is O(2N), whereas mapfilter is O(N). That’s twice as a fast! If you are dealing with large lists, this can be a huge time saver. And for the case where a large list contains small IDs for looking up a larger data structure, then using mapfilter can result in half the number of database lookups.
Obviously, mapfilter won’t work if you want to produce a list of boolean values, as it would filter out all the false values. But why would you want to map to a list booleans?
Erlang Code
Here’s some erlang code I’ve been using for a while:
mapfilter(F, List) -> lists:reverse(mapfilter(F, List, [])). mapfilter(_, [], Results) -> Results; mapfilter(F, [Item | Rest], Results) -> case F(Item) of false -> mapfilter(F, Rest, Results); Term -> mapfilter(F, Rest, [Term | Results]) end.
Has anyone else done this for themselves? Does mapfilter exist in any programming language? If so, please leave a comment. I think mapfilter is a very simple & useful concept that should be a included in the standard library of every (functional) programming language. Erlang already has mapfoldl (map-reduce in one pass), so why not also have mapfilter?
Chunk Extraction with NLTK
Chunk extraction is a useful preliminary step to information extraction, that creates parse trees from unstructured text with a chunker. Once you have a parse tree of a sentence, you can do more specific information extraction, such as named entity recognition and relation extraction.
Chunking is basically a 3 step process:
- Tag a sentence
- Chunk the tagged sentence
- Analyze the parse tree to extract information
I’ve already written about how to train a NLTK part of speech tagger and a chunker, so I’ll assume you’ve already done the training, and now you want to use your pos tagger and iob chunker to do something useful.
IOB Tag Chunker
The previously trained chunker is actually a chunk tagger. It’s a Tagger that assigns IOB chunk tags to part-of-speech tags. In order to use it for proper chunking, we need some extra code to convert the IOB chunk tags into a parse tree. I’ve created a wrapper class that complies with the nltk ChunkParserI interface and uses the trained chunk tagger to get IOB tags and convert them to a proper parse tree.
import nltk.chunk import itertools class TagChunker(nltk.chunk.ChunkParserI): def __init__(self, chunk_tagger): self._chunk_tagger = chunk_tagger def parse(self, tokens): # split words and part of speech tags (words, tags) = zip(*tokens) # get IOB chunk tags chunks = self._chunk_tagger.tag(tags) # join words with chunk tags wtc = itertools.izip(words, chunks) # w = word, t = part-of-speech tag, c = chunk tag lines = [' '.join([w, t, c]) for (w, (t, c)) in wtc if c] # create tree from conll formatted chunk lines return nltk.chunk.conllstr2tree('\n'.join(lines))
Chunk Extraction
Now that we have a proper NLTK chunker, we can use it to extract chunks. Here’s a simple example that tags a sentence, chunks the tagged sentence, then prints out each noun phrase.
# sentence should be a list of words tagged = tagger.tag(sentence) tree = chunker.parse(tagged) # for each noun phrase sub tree in the parse tree for subtree in tree.subtrees(filter=lambda t: t.node == 'NP'): # print the noun phrase as a list of part-of-speech tagged words print subtree.leaves()
Each sub tree has a phrase tag, and the leaves of a sub tree are the tagged words that make up that chunk. Since we’re training the chunker on IOB tags, NP stands for Noun Phrase. As noted before, the results of this natural language processing are heavily dependent on the training data. If your input text isn’t similar to the your training data, then you probably won’t be getting many chunks.
Test Driven Development in Python
One of my favorite aspects of Python is that it makes practicing TDD very easy. What makes it so frictionless is the doctest module. It allows you to write a test at the same time you define a function. No setup, no boilerplate, just write a function call and the expected output in the docstring. Here’s a quick example of a fibonacci function.
def fib(n): '''Return the nth fibonacci number. >>> fib(0) 0 >>> fib(1) 1 >>> fib(2) 1 >>> fib(3) 2 >>> fib(4) 3 ''' if n == 0: return 0 elif n == 1: return 1 else: return fib(n - 1) + fib(n - 2)
If you want to run your doctests, just add the following three lines to the bottom of your module.
if __name__ == '__main__': import doctest doctest.testmod()
Now you can run your module to run the doctests, like python fib.py.
So how well does this fit in with the TDD philosophy? Here’s the basic TDD practices.
- Think about what you want to test
- Write a small test
- Write just enough code to fail the test
- Run the test and watch it fail
- Write just enough code to pass the test
- Run the test and watch it pass (if it fails, go back to step 4)
- Go back to step 1 and repeat until done
And now a step-by-step breakdown of how to do this with doctests, in excruciating detail.
1. Define a new empty method
def fib(n): '''Return the nth fibonacci number.''' pass if __name__ == '__main__': import doctest doctest.testmod()
2. Write a doctest
def fib(n): '''Return the nth fibonacci number. >>> fib(0) 0 ''' pass
3. Run the module and watch the doctest fail
python fib.py ********************************************************************** File "fib1.py", line 3, in __main__.fib Failed example: fib(0) Expected: 0 Got nothing ********************************************************************** 1 items had failures: 1 of 1 in __main__.fib ***Test Failed*** 1 failures.
4. Write just enough code to pass the failing doctest
def fib(n): '''Return the nth fibonacci number. >>> fib(0) 0 ''' return 0
5. Run the module and watch the doctest pass
python fib.py
6. Go back to step 2 and repeat
Now you can start filling in the rest of function, one test at time. In practice, you may not write code exactly like this, but the point is that doctests provide a really easy way to test your code as you write it.
Unit Tests
Ok, so doctests are great for simple tests. But what if your tests need to be a bit more complex? Maybe you need some external data, or mock objects. In that case, you’ll be better off with more traditional unit tests. But first, take a little time to see if you can decompose your code into a set of smaller functions that can be tested individually. I find that code that is easier to test is also easier to understand.
Running Tests
For running my tests, I use nose. I have a tests/ directory with a simple configuration file, nose.cfg
[nosetests] verbosity=3 with-doctest=1
Then in my Makefile, I add a test command so I can run make test.
test: @nosetests --config=tests/nose.cfg tests PACKAGE1 PACKAGE2
PACKAGE1 and PACKAGE2 are optional paths to your code. They could point to unit test packages and/or production code containing doctests.
And finally, if you’re looking for a continuous integration server, try Buildbot.
Programming As Information Architecture
Code = Information. [1] Therefore, Software Architecture can be approached as Information Architecture. Information Architecture can be defined as
- The structural design of shared information environments.
- The art and science of shaping information products.
The above definitions and much of the inspiration for this article comes from the book Information Architecture for the World Wide Web. My goal is to explain some of what an Information Architect does, and that software developers, especially the lead developer, should approach their code as an information system, applying the principles of Information Architecture. Why? Because it will lead to more organized, better structured, easier to understand code, which will reduce maintenance costs, decrease training time, and generally make it easier for you and your team to get things done.
So, what are the core focus areas for an Information Architect?
- Organization Systems
- Labeling Systems
- Navigation and Search Systems
- Controlled Vocabularies and Metadata
- Research
- Strategy
- Design and Documentation
Organization Systems
Organization Systems are exactly what you think they are: systems to organize information. Imagine if you had to come into your current code base completely fresh, knowing nothing about it. Does that thought horrify you? If your code isn’t organized, then it can be very hard for new developers to come in and figure out what’s going on. Think of your code repository as a shared information environment. If you are the only one that can navigate it, let alone modify it, then you’ll always be stuck maintaining it. Hopefully your goal is not job security, but to provide an environment conducive to change.
So how should you organize your code? Unfortunately, that’s not something that’s really taught anywhere. My general practice is to follow the recommendations of the language/platform. If they say all code should go in a directory called src/, then that’s where I put it. If every class is supposed to be in its own file, then that’s what I do. And if the platform documentation doesn’t specify how to do something, I’ll find a major open source project and see how they do things. The key to an organization system is to maintain logical consistency. Then, as long as you know the logic, you can figure out where things are or where something should go.
Labeling Systems
Labeling Systems are basically standard naming practices. In IA, a labeling system specifies what label goes with each element in every context. For programming, you’ll want a consistent naming scheme to make sure the all your code objects are consistently and clearly labeled. Good labels are simple, make sense in context, and hint at the details of the labeled object. The goal is to communicate information efficiently. You are not just writing code for yourself, you’re writing code for the team. The best code is not only functional, it’s readable, concise, and even beautiful. Clear labeling goes a long way towards achieving that ideal.
Navigation and Search Systems
Navigation and Search Systems are very important to information focused websites, but they don’t apply much to code by itself. However, good Navigation and Search Systems are essential for API documentation. I believe that the quality of the API documentation has a huge effect on the adoption rate of libraries and platforms. [2] Good API docs can be great resource for quickly looking up a function and understanding how to use it. But if a developer can’t navigate and search your API documentation, then how will they figure out how to that function works? Luckily for us programmers, navigation is usually provided for free with a documentation generator. And google can handle the search for you.
Controlled Vocabularies and Metadata
While all programming languages have a controlled vocabulary, in IA this refers to domain knowledge. The principle is to use words and jargon that are common to whatever domain you are developing for.
Metadata, in this case, is information about the code, such as comments and documentation. Just as an Information Architect is in charge of the language use within a system, the lead developer should be in charge of the domain language and how to use it.
Research
The goal of IA Research is to understand what needs to be designed and built before doing the work. In programming, you are often presented with problems you’ve never solved before. Hacking, or exploratory programming, is a way to figure out and evaluate possible solutions. Hacking is Research. The goal of research oriented hacking is to figure out possible solutions, evaluate platforms and technologies, and understand the constraints that come with each technology and solution. The knowledge you gather from research is used to drive your strategic choices.
Strategy
IA Strategy is about platform, process and design. What programming language(s) will you use? What are the core design patterns and architectural choices? What version control system will the team use? How will you track progress? Your strategic decisions will set the design constraints of the implementation and drive the development process.
Design and Documentation
In software development, the code is the design, but not everyone will want to read your code to understand how things work. You may need to communicate the design in other ways, such as with diagrams, comments, and documentation. And if the code is being written by someone else, then it’s your job to communicate how their code will fit in to the rest of the system. Design documents aren’t for you, they’re for the other people on the team. You do want other team members to understand your code, right? And if your diagrams and documentation are good enough, you might even get business people to think they understand your software too 🙂
Conclusion
Information Architecture provides a top-down view of your software system. As a lead developer or software architect, IA principles and practices can help make sure that your system is well designed and that the design is communicated clearly to all team members. For further reading, I recommend Information Architecture for the World Wide Web and Documenting Software Architectures.
Notes
[1] Code = Data, Data = Information, Code = Information.
[2] I wish I had some data to back this up, but it’s certainly how I behave. Lack of clear documentation = fail.
Static Analysis of Erlang Code with Dialyzer
Dialyzer is a tool that does static analysis of your erlang code. It’s great for identifying type errors and unreachable code. Here’s how to use it from the command line.
dialyzer -r PATH/TO/APP -I PATH/TO/INCLUDE
Pretty simple! PATH/TO/APP should be an erlang application directory containing your ebin/ and/or src/ directories. PATH/TO/INCLUDE should be a path to a directory that contains any .hrl files that need to be included. The -I is optional if you have no include files. You can have as many -r and -I options as you need. If you add -q, then dialyzer runs more quietly, succeeding silently or reporting any errors found.
If you have a test/ directory with Common Test suites, then you’ll want to add “-I /usr/lib/erlang/lib/test_server*/include/” and “-I /usr/lib/erlang/lib/common_test*/include/”. I’ve actually set this up in my Makefile to run as make check. It’s been great for catching bad return types, misspellings, and wrong function parameters.