Category Archives: programming

Recent Advances in Deep Learning for Natural Language Processing

This article was original published at The New Stack under the title “How Deep Learning Supercharges Natural Language Processing“.

Voice search, intelligent assistants, and chatbots are becoming common features of modern technology. Users and customers are demanding a better, more human experience when interacting with computers. According to Tableau’s business trends report, IDC predicts that by 2019, intelligent assistants will become commonly accessible to enterprise workers, while Gartner predicts that by 2020, 50 percent of analytics queries will involve some form of natural language processing. Chatbots, intelligent assistants, natural language queries, and voice-enabled applications all involve various forms of natural language processing. To fully realize these new user experiences, we will need to build upon the latest methods, some of which I will cover here.

Let’s start with the basics: what is natural language processing? Natural language processing (NLP), is a collection of techniques for helping machines understand human language. For example, one of the essential techniques is tokenization: breaking up text into “tokens,” such as words. Given individual words in sequence, you can start to apply reason to them, and do things like sentiment analysis to determine if a piece of text is positive or negative. But even a task as simple as word identification can be quite tricky. Is the word what’s really one word or two (what + is, or what + was)? What about languages that use characters to represent multi-word concepts, like Kanjii?

Deep learning is an advanced type of machine learning using neural networks. It became popular due to the success of the techniques at solving problems such as image classification (labeling an image based on visual content) and speech recognition (converting sounds into text). Many people thought that deep learning techniques, when applied to natural language, would quickly achieve similar levels of performance. But because of all the idiosyncrasies of natural language, the field has not seen the same kind of breakthrough success with deep learning as other fields, like image processing. However, that appears to be changing. In the past few years, researchers have been applying newer deep learning methods to natural language processing, and I will share some of these recent successes.

Deep learning — through recent improvements to word embeddings, a focus on attention, mobile enablement, and its appearance in the home — is starting to capture natural language processing like it previously captured image processing. In this article, I will cover some recent deep learning-based NLP research successes that have made an impact on the field. Because of these improvements, we will see simpler and more natural user experiences, better software performance, and more powerful home and mobile applications.

Word Embeddings

Words are essential to every natural language processing system. Traditional NLP looks at words as strings, but deep learning techniques can only process numeric vectors. Word embeddings were invented as a way to transform words into vectors, enabling new kinds of mathematical feature analysis. But the vector representation of words is only as good as the text it was trained on.

The more common word embeddings are trained on Wikipedia, but Wikipedia text may not be representative of whatever text you’re processing. It’s generally written as well structured factual statements, which is nothing like text found on twitter, and both of these are different than restaurant reviews. So vectors trained on Wikipedia might be mathematically misleading if you use those vectors to analyze a different style of text. Text from the Common Crawl provides a more diverse set of text for training a word embedding model. The FastText library provides some great pre-trained English word vectors, along with tools for training your own. Training your own vectors is essential if you’re processing any language other than English.

Character level embeddings have also shown surprising results. This technique tries to learn vectors for individual characters, where words would be represented as a composition of the individual character vectors. In an effort to learn how to predict the next character in reviews, researchers discovered a sentiment neuron, which they could control to produce positive or negative review output. Using the sentiment neuron, they were able to beat the previous top accuracy score on the sentiment treebank. This is quite an impressive result for something discovered as a side effect of other research.

CNNs, RNNs, and Attention

Moving beyond vectors, deep learning requires training neural networks for various tasks. Vectors are the input and output, in between are layers of nodes connected together in a network. The nodes represent functions on the input data, with each function taking the input from the previous layer and producing output for the next layer. The structure of the network and how the nodes are connected very much determines the learning capabilities and performance.

In general, the deeper and more complicated a network, the longer it takes to train. When using large datasets, many networks can only be effectively trained using clusters of graphics processors (GPUs), because GPUs are optimized for the necessary floating point math. This puts some types of deep learning outside the reach of anyone not at large companies or institutions that can afford the expensive GPU clusters necessary for deep learning on big data.

Standard neural networks are feedforward networks, where each node in a layer is forward connected to every node in the next layer. A Recurrent Neural Network (RNN) is a network where the nodes in each layer also connect back to the previous layer. This creates a kind of memory that can be great for learning from sequences, such as words in a sentence.

A Convolutional Neural Networks (CNN) is a type feedforward network, but with more layers, and where the forward connections have been manipulated, or convoluted, to achieve certain properties. CNNs tend to be good at extracting position invariant features, meaning they do not care so much about sequence ordering. Because of this, CNNs can be trained in a more parallel manner, leading to faster training and optimization compared to RNNs.

While CNNs may win in raw speed, both types of neural networks tend to have comparable performance characteristics.  In fact, RNNs have a slight edge when it comes to sequence oriented tasks like Part-of-Speech tagging, where you are trying to identify the part of speech (such as “noun” or “verb”) for each word in a sentence. For a detailed performance comparison of CNNs and RNNs applied to NLP see: Comparative Study of CNN and RNN for Natural Language Processing.

The most successful RNN models are the LSTM (Long short-term memory) and GRU (gated recurrent unit). These use attention gates, which act as a kind of short-term memory for the network. However, a newer research paper implies that attention may be all you need. By doing away with recurrence networks and convolution, and keeping only attention mechanisms, these models can be trained in parallel like a CNN, but even faster, and have comparable better performance than RNNs on some sequence learning tasks, such machine translation.

Reducing the training cost while maintaining comparable performance means that smaller companies and individuals can throw more data at their deep learning models, and potentially compete more effectively with larger companies and institutions.

Software 2.0

One of the nice properties of neural network models is that the core algorithms and math are mostly the same. Once you have the infrastructure, model definition, and training algorithms all setup, these models are very reusable. “Software 2.0” is the idea that significant components of an application or system can be replaced by neural network models. Instead of writing code, developers:

  1. Collect training data
  2. Clean and label the data
  3. Train a model
  4. Integrate the model

While the most interesting parts are often steps three and four, most of the work happens in the data preparation steps one and two. Collecting and curating good, useful, clean data can be a significant amount of work, which is why methods like corpus bootstrapping are important for getting to good data faster. In the long run, it is often easier to make better data than it is to design better algorithms.

The past few years have demonstrated that neural networks can achieve much better performance than many alternatives, sometimes even in areas not traditionally touched by machine learning. One of the most recent and interesting advances is in learning data indexing structures. B-tree indexes are a commonly used data structure that provides an efficient way of finding data, assuming the tree is structured well. However, these newly learned indexes significantly outperformed the traditional B-tree indexes in both speed and memory usage. Such low-level data structure performance improvements could have far-reaching impacts if it can be integrated into standard development practices.

As research progresses, and the necessary infrastructure becomes cheaper and more available, deep learning models are likely to be used in more and more parts of the software stack, including mobile applications.

Mobile Machine Learning

Most deep learning requires clusters of expensive GPUs and lots of RAM. This level of compute power is only accessible to those who can afford it, usually in the cloud. But consumers are increasingly using mobile devices, and much of the world does not have reliable and affordable full-time wireless connectivity. Getting machine learning into mobile devices will enable more developers to create all sorts of new applications.

  • Apple’s CoreML framework enables a number of NLP capabilities on iOS devices, such as language identification and named entity recognition.
  • Baidu developed a CNN library for mobile deep learning that works on both iOS and Android.
  • Qualcomm created a Neural Processing Engine for its mobile processors, enabling popular deep learning frameworks to operate on mobile devices.

Expect a lot more of this in the near future, as mobile devices continue to become more powerful and ubiquitous. Marc Andreessen famously said that “software is eating the world,” and now machine learning appears to be eating software. Not only is it in our pocket, it is also in our homes.

Deep Learning in the Home

Alexa and other voice assistants became mainstream in 2017, bringing NLP into millions of homes. Mobile users are already familiar with Siri and Google Assistant, but the popularity of Alexa and Google Home shows how many people have become comfortable having conversations with voice-activated dialogue systems. How much these systems rely on deep learning is somewhat unknown, but it is fairly certain that significant parts of their dialogue systems use deep learning models for core functions such as speech to text, part of speech tagging, natural language generation, and text to speech.

As research advances and these companies collect increasing amounts of data from their users, deep learning capabilities will improve as well, and implementations of “software 2.0” will become pervasive. While a few large companies are creating powerful data moats, there is always room on the edges for highly specialized, domain-specific applications of natural languages, such as cybersecurity, IT operations, and data analytics.

Deep learning has become a core component of modern natural language processing systems.

However, many traditional natural language processing techniques are still quite effective and useful, especially in areas that lack the huge amounts of training data necessary for deep learning. I will cover these traditional statistical techniques in an upcoming article.

Monetizing the Text-Processing API with Mashape

This is a short story about the text-processing.com API, and how it became a profitable side-project, thanks to Mashape.

Text-Processing API

When I first created text-processing.com, in the summer of 2010, my initial intention was to provide an online demo of NLTK’s capabilities. I trained a bunch of models on various NLTK corpora using nltk-trainer, then started making some simple Django forms to display the results. But as I was doing this, I realized I could fairly easily create an API based on these models. Instead of rendering HTML, I could just return the results as JSON.

I wasn’t sure if anyone would actually use the API, but I knew the best way to find out was to just put it out there. So I did, initially making it completely open, with a rate limit of 1000 calls per day per IP address. I figured at the very least, I might get some PHP or Ruby users that wanted the power of NLTK without having to interface with Python. Within a month, people were regularly exceeding that limit, and I quietly increased it to 5000 calls/day, while I started searching for the simplest way to monetize the API. I didn’t like what I found.

Monetizing APIs

Before Mashape, your options for monetizing APIs were either building a custom solution for authentication, billing, and tracking, or pay thousands of dollars a month for an “enterprise” solution from Mashery or Apigee. While I have no doubt Mashery & Apigee provide quality services, they are not in the price range for most developers. And building a custom solution is far more work than I wanted to put into it. Even now, when companies like Stripe exist to make billing easier, you’d still have to do authentication & call tracking. But Stripe didn’t exist 2 years ago, and the best billing option I could find was Paypal, whose API documentation is great at inducing headaches. Lucky for me, Mashape was just opening up for beta testing, and appeared to be in the process of solving all of my problems 🙂

Mashape

Mashape was just what I needed to monetize the text-processing API, and it’s improved tremendously since I started using it. They handle all the necessary details, like integrated billing, plus a lot more, such as usage charts, latency & uptime measurements, and automatic client library generation. This last is one of my favorite features, because the client libraries are generated using your API documentation, which provides a great incentive to accurately document the ins & outs of your API. Once you’ve documented your API, downloadable libraries in 5 different programming languages are immediately available, making it that much easier for new users to consume your API. As of this writing, those languages are Java, PHP, Python, Ruby, and Objective C.

Here’s a little history for the curious: Mashape originally did authentication and tracking by exchanging tokens thru an API call. So you had to write some code to call their token API on every one of your API calls, then check the results to see if the call was valid, or if the caller had reached their limit. They didn’t have all of the nice charts they have now, and their billing solution was the CEO manually handling Paypal payments. But none of that mattered, because it worked, and from conversations with them, I knew they were focused on more important things: building up their infrastructure and positioning themselves as a kind of app-store for APIs.

Mashape has been out of beta for a while now, with automated billing, and a custom proxy server for authenticating, routing, and tracking all API calls. They’re releasing new features on a regular basis, and sponsoring events like MusicHackDay. I’m very impressed with everything they’re doing, and on top of that, they’re good hard-working people. I’ve been over to their “hacker house” in San Francisco a few times, and they’re very friendly and accomodating. And if you’re ever in the neighborhood, I’m sure they’d be open to a visit.

Profit

Once I had integrated Mashape, which was maybe 20 lines of code, the money started rolling in :). Just kidding, but using the typical definition of profit, when income exceeds costs, the text-processing API was profitable within a few months, and has remained so ever since. My only monetary cost is a single Linode server, so as long as people keep paying for the API, text-processing.com will remain online. And while it has a very nice profit margin, total monthly income barely approaches the cost of living in San Francisco. But what really matters to me is that text-processing.com has become a self-sustaining excuse for me to experiment with natural language processing techniques & data sets, test my models against the market, and provide developers with a simple way to integrate NLP into their own projects.

So if you’ve got an idea for an API, especially if it’s something you could charge money for, I encourage you to build it and put it up on Mashape. All you need is a working API, a unique image & name, and a Paypal account for receiving payments. Like other app stores, Mashape takes a 20% cut of all revenue, but I think it’s well worth it compared to the cost of replicating everything they provide. And unlike some app stores, you’re not locked in. Many of the APIs on Mashape also provide alternative usage options (including text-processing), but they’re on Mashape because of the increased exposure, distribution, and additional features, like client library generation. SaaS APIs are becoming a significant part of modern computing infrastructure, and Mashape provides a great platform for getting started.

Testing Command Line Scripts with Roundup

As nltk-trainer becomes more stable, I realized that I needed some way to test the command line scripts. My previous ad-hoc method of “test whatever script options I can remember” was becoming unwieldy and unreliable. But how do you make repeatable tests for a command line script? It doesn’t really fit into the standard unit testing model.

Enter roundup by Blake Mizerany. (NOTE: do not try to do apt-get install roundup. You will get an issue tracking system, not a script testing tool).

Roundup provides a great way to prevent shell bugs by creating simple test functions within a shell script. Here’s the first dozen lines of train_classifier.sh, which you can probably guess tests train_classifier.py:

#!/usr/bin/env roundup

describe "train_classifier.py"

it_displays_usage_when_no_arguments() {
	./train_classifier.py 2>&1 | grep -q "usage: train_classifier.py"
}

it_cannot_find_foo() {
	last_line=$(./train_classifier.py foo 2>&1 | tail -n 1)
	test "$last_line" "=" "ValueError: cannot find corpus path for foo"
}

describe is like the name of a module or test case, and all test functions begin with test_. Within the test functions, you use standard shell commands that should produce no output on success (like grep -q or the test command). You can also match multiple lines of output, as in:

it_trains_movie_reviews_paras() {
	test "$(./train_classifier.py movie_reviews --no-pickle --no-eval --fraction 0.5 --instances paras)" "=" "loading movie_reviews
2 labels: ['neg', 'pos']
1000 training feats, 1000 testing feats
training NaiveBayes classifier"
}

Once you’ve got all your test functions defined, make sure your test script is executable and roundup is installed, then run your test script. You’ll get nice output that looks like:

nltk-trainer$ tests/train_classifier.sh
train_classifier.py
  it_displays_usage_when_no_arguments:             [PASS]
  it_cannot_find_foo:                              [PASS]
  it_cannot_import_reader:                         [PASS]
  it_trains_movie_reviews_paras:                   [PASS]
  it_trains_corpora_movie_reviews_paras:           [PASS]
  it_cross_fold_validates:                         [PASS]
  it_trains_movie_reviews_sents:                   [PASS]
  it_trains_movie_reviews_maxent:                  [PASS]
  it_shows_most_informative:                       [PASS]
=========================================================
Tests:    9 | Passed:   9 | Failed:   0

So far, roundup has been a perfect tool for testing all the nltk-trainer scripts, and the only downside is the one-time manual installation. I highly recommend it for anyone writing custom commands and scripts, no matter what language you use to write them.

Hierarchical Classification

Hierarchical classification is an obscure but simple concept. The idea is that you arrange two or more classifiers in a hierarchy such that the classifiers lower in the hierarchy are only used if a higher classifier returns an appropriate result.

For example, the text-processing.com sentiment analysis demo uses hierarchical classification by combining a subjectivity classifier and a polarity classifier. The subjectivity classifier is first, and determines whether the text is objective or subjective. If the text is objective, then a label of neutral is returned, and the polarity classifier is not used. However, if the text is subjective (or polar),  then the polarity classifier is used to determine if the text is positive or negative.

Hierarchical Sentiment Classification Model

Hierarchical classification is a useful way to combine multiple binary classifiers, if you have a hierarchy of labels that can modeled as a binary tree. In this model, each branch of the tree either continues on to a new pair of branches, or stops, and at each branching you use a classifier to determine which branch to take.

The Perils of Web Crawling

I’ve written a lot of web crawlers, and these are some of the problems I’ve encountered in the process. Unfortunately, most of these problems have no solution, other than quitting in disgust. Instead of looking for a way out when you’re already deep in it, take this article as a warning of what to expect before you start down the perilous path of making a web spider. Now to be clear, I’m referring to deep web crawlers coded for a specific site, not a generic link harvester or image downloader. Generic crawlers probably have their own set of problems on top of everything mentioned below.

Most web pages suck

I have yet to encounter a website whose HTML didn’t generate at least 3 WTFs.

The only valid measure of code quality: WTFs/minute

Seriously, I now have a deep respect for web browsers that are able to render the mess of HTML that exists out there. If only browsers were better at memory management, I might actually like them again (see unified theory of browser suckage).

Inconsistent information & navigation

2 pages on the same site, with the same url pattern, and the same kind of information, may still not be the same. One will have additional information that you didn’t expect, breaking all your assumptions about what HTML is where. Another may not have essential information you expected all similar pages to have, again throwing your HTML parsing assumptions out the window.

Making things worse, the same kind of page on different sections of a site may be completely different. This generally occurs on larger sites where different sections have different teams managing them. The worst is when the differences are non-obvious, so that you think they’re the same, until you realize hours later that the underlying HTML structure is completely different, and CSS had been fooling you, making it look the same when it’s not.

90% of writing a crawler is debugging

Your crawler will break. It’s not a matter of if but when. Sometimes it breaks immediately after you write it, because it can’t deal with the malformed HTML garbage that you’re trying to parse. Sometimes it happens randomly, weeks or months later, when the site makes a minor change to their HTML that sends your crawler into a chaotic death spiral. Whatever the case, it’ll happen, it may not be your fault, but you’ll have to deal with it (I recommend cursing profusely). If you can’t deal with random sessions of angry debugging because some stranger you don’t know but want to punch in the face made a change that broke your crawler, quit now.

You can’t make their connection any faster

If your crawler is slow, it’s probably not your fault. Chances are, your incoming connection is much faster and less congested than their outgoing connection. The internet is a series of tubes, and while their tube may be a lot larger than yours, it’s filled with other people’s crap, slowing down your completely legitimate web crawl to, well, a crawl. The only solution is to request more pages in parallel, which leads directly to the next problem…

Sites will ban you

Many web sites don’t want to be crawled, at least not by you. They can’t really stop you, but they’ll try. In the end, they only win if you let them, but it can be quite annoying to deal with a site that doesn’t want you to crawl them. There’s a number of techniques for dealing with this:

  • route your crawls thru TOR, if you don’t mind crawling slower than a pentium 1 on with a 2.4 bps modem (if you’re too young to know what that means, think facebook or myspace, but 100 times slower, and you’re crying because you’re peeling a dozen onions while you wait for the page to load)
  • use anonymous proxies, which of course are always completely reliable, and never randomly go offline for no particular reason
  • slow down your crawl to the point a one armed blind man could do it faster
  • ignore robots.txt – what right do these sites have to tell you what you can & can’t crawl?!? This is America, isn’t it!?!

Conclusion

Don’t write a web crawler. Seriously, it’s the most annoying programming work in the world. Of course, chances are that if you’re actually considering writing a crawler, it’s because there’s no other option. In that case, good luck, and stock up on liquor.

Programming Philosophy Links

Machine Learning Links

How to Deploy hgwebdir.fcgi behind Nginx with Fab

If you’re managing multiple mercurial repositories, it’s nice to see them all in one place, using a simple web-based repository browser. There’s various ways to publish mercurial repositories, but hgwebdir is the only method that supports multiple repos. Since I prefer fastcgi and nginx, I decided to use hgwebdir.fcgi, which unfortunately isn’t documented on the mercurial wiki.

hgweb.config

Let’s start by creating hgweb.config, which tells hgwebdir where the repos are and what the web UI should look like.

[paths]
/REPO1NAME = /PATH/TO/REPO1
/REPO2NAME = /PATH/TO/REPO2

[web]
base =
style = monoblue

There’s a few different included themes you can choose from, I like the monoblue style. The empty base= line is apparently required to make everything work.

hgwebdir.fcgi

Next, make a copy of hgwebdir.fcgi, which in Ubuntu can be found in /usr/share/doc/mercurial/examples. Below is a simplified version with all comments removed. The one line you may want to change is the path to hgweb.config on the server, but I’ll assume you’ll want it in /etc/mercurial.

from mercurial import demandimport; demandimport.enable()
from mercurial.hgweb.hgwebdir_mod import hgwebdir
from mercurial.hgweb.request import wsgiapplication
from flup.server.fcgi import WSGIServer

def make_web_app():
    return hgwebdir("/etc/mercurial/hgweb.config")

WSGIServer(wsgiapplication(make_web_app)).run()

hg_server.conf

This is a simple nginx fastcgi config you can modify for your own purposes. It forwards all requests for hg.DOMAIN.COM to the hgwebdir.fcgi socket we’ll be starting below.

server {
	listen 80;
	server_name hg;
	server_name hg.DOMAIN.COM;

	access_log /var/log/hg_access.log;
	error_log /var/log/hg_error.log;

	location / {
		fastcgi_pass	unix:/var/run/hgwebdir.sock;
		fastcgi_param	PATH_INFO	$fastcgi_script_name;
		fastcgi_param	QUERY_STRING	$query_string;
		fastcgi_param	REQUEST_METHOD	$request_method;
		fastcgi_param	CONTENT_TYPE	$content_type;
		fastcgi_param	CONTENT_LENGTH	$content_length;
		fastcgi_param	SERVER_PROTOCOL	$server_protocol;
		fastcgi_param	SERVER_PORT	$server_port;
		fastcgi_param	SERVER_NAME	$server_name;
	}
}

The hg_server.conf file will need a link from /etc/nginx/sites-enabled to its location in /etc/nginx/sites-available, assuming that you’re using the default nginx config which includes every server conf found in /etc/nginx/sites-available.

fab hgweb restart_nginx

To make deployment easy, I use fab, so that if I make any changes to hgweb.config or hg_server.conf, I can simply run fab hgweb restart_nginx. For starting hgwebdir.fcgi, we can use spawn-fcgi, which usually comes with lighttpd, so you’ll need that installed too.

hgweb copies hgweb.config and hgwebdir.fcgi to appropriate locations on the server, then starts the fastcgi process with a socket at /var/run/hgwebdir.sock.

restart_nginx copies hg_server.conf to the server and tells nginx to reload its config.

def hgweb():
    env.runpath = '/var/run'
    put('hgweb.config', '/tmp')
    put('hgwebdir.fcgi', '/tmp')
    sudo('mv /tmp/hgwebdir.fcgi /usr/local/bin/')
    sudo('chmod +x /usr/local/bin/hgwebdir.fcgi')
    sudo('mv /tmp/hgweb.config /etc/mercurial/hgweb.config')
    sudo('kill `cat %s/hgwebdir.pid`' % env.runpath)
    sudo('spawn-fcgi -f /usr/local/bin/hgwebdir.fcgi -s %s/hgwebdir.sock -P %s/hgwebdir.pid' % (env.runpath, env.runpath), user='www-data')

def restart_nginx():
    put('hg_server.conf', '/tmp/')
    sudo('mv /tmp/hg_server.conf /etc/nginx/sites-available/')
    sudo('killall -HUP nginx')

Once you’ve got these commands in your fabfile.py, you can run fab hgweb restart_nginx to deploy.

hgrc

Now that you’ve got hgwebdir.fcgi running (you can make sure it works by going to http://hg.DOMAIN.COM), you’ll probably want to customize the info about each repo by editing .hg/hgrc.

[web]
description = All about my repo
contacts = Me

And that’s it, you should now have a fast web-based browser for multiple repos 🙂

mapfilter

Have you ever mapped a list, then filtered it? Or filtered first, then mapped? Why not do it all in one pass with mapfilter?

mapfilter?

mapfilter is a function that combines the traditional map & filter of functional programming by using the following logic:

  1. if your function returns false, then the element is discarded
  2. any other return value is mapped into the list

Why?

Doing a map and then a filter is O(2N), whereas mapfilter is O(N). That’s twice as a fast! If you are dealing with large lists, this can be a huge time saver. And for the case where a large list contains small IDs for looking up a larger data structure, then using mapfilter can result in half the number of database lookups.

Obviously, mapfilter won’t work if you want to produce a list of boolean values, as it would filter out all the false values. But why would you want to map to a list booleans?

Erlang Code

Here’s some erlang code I’ve been using for a while:

mapfilter(F, List) -> lists:reverse(mapfilter(F, List, [])).

mapfilter(_, [], Results) ->
        Results;
mapfilter(F, [Item | Rest], Results) ->
        case F(Item) of
                false -> mapfilter(F, Rest, Results);
                Term -> mapfilter(F, Rest, [Term | Results])
        end.

Has anyone else done this for themselves? Does mapfilter exist in any programming language? If so, please leave a comment. I think mapfilter is a very simple & useful concept that should be a included in the standard library of every (functional) programming language. Erlang already has mapfoldl (map-reduce in one pass), so why not also have mapfilter?

Test Driven Development in Python

One of my favorite aspects of Python is that it makes practicing TDD very easy. What makes it so frictionless is the doctest module. It allows you to write a test at the same time you define a function. No setup, no boilerplate, just write a function call and the expected output in the docstring. Here’s a quick example of a fibonacci function.

def fib(n):
        '''Return the nth fibonacci number.
        >>> fib(0)
        0
        >>> fib(1)
        1
        >>> fib(2)
        1
        >>> fib(3)
        2
        >>> fib(4)
        3
        '''
        if n == 0:
                return 0
        elif n == 1:
                return 1
        else:
                return fib(n - 1) + fib(n - 2)

If you want to run your doctests, just add the following three lines to the bottom of your module.

if __name__ == '__main__':
        import doctest
        doctest.testmod()

Now you can run your module to run the doctests, like python fib.py.

So how well does this fit in with the TDD philosophy? Here’s the basic TDD practices.

  1. Think about what you want to test
  2. Write a small test
  3. Write just enough code to fail the test
  4. Run the test and watch it fail
  5. Write just enough code to pass the test
  6. Run the test and watch it pass (if it fails, go back to step 4)
  7. Go back to step 1 and repeat until done

And now a step-by-step breakdown of how to do this with doctests, in excruciating detail.

1. Define a new empty method

def fib(n):
'''Return the nth fibonacci number.'''
pass

if __name__ == '__main__':
import doctest
doctest.testmod()

2. Write a doctest

def fib(n):
        '''Return the nth fibonacci number.
        >>> fib(0)
        0
        '''
        pass

3. Run the module and watch the doctest fail

python fib.py
**********************************************************************
File "fib1.py", line 3, in __main__.fib
Failed example:
    fib(0)
Expected:
    0
Got nothing
**********************************************************************
1 items had failures:
   1 of   1 in __main__.fib
***Test Failed*** 1 failures.

4. Write just enough code to pass the failing doctest

def fib(n):
        '''Return the nth fibonacci number.
        >>> fib(0)
        0
        '''
        return 0

5. Run the module and watch the doctest pass

python fib.py

6. Go back to step 2 and repeat

Now you can start filling in the rest of function, one test at time. In practice, you may not write code exactly like this, but the point is that doctests provide a really easy way to test your code as you write it.

Unit Tests

Ok, so doctests are great for simple tests. But what if your tests need to be a bit more complex? Maybe you need some external data, or mock objects. In that case, you’ll be better off with more traditional unit tests. But first, take a little time to see if you can decompose your code into a set of smaller functions that can be tested individually. I find that code that is easier to test is also easier to understand.

Running Tests

For running my tests, I use nose. I have a tests/ directory with a simple configuration file, nose.cfg

[nosetests]
verbosity=3
with-doctest=1

Then in my Makefile, I add a test command so I can run make test.

test:
        @nosetests --config=tests/nose.cfg tests PACKAGE1 PACKAGE2

PACKAGE1 and PACKAGE2 are optional paths to your code. They could point to unit test packages and/or production code containing doctests.

And finally, if you’re looking for a continuous integration server, try Buildbot.