StreamHacker Weotta be Hacking

27Feb/131

Monetizing the Text-Processing API with Mashape

This is a short story about the text-processing.com API, and how it became a profitable side-project, thanks to Mashape.

Text-Processing API

When I first created text-processing.com, in the summer of 2010, my initial intention was to provide an online demo of NLTK's capabilities. I trained a bunch of models on various NLTK corpora using nltk-trainer, then started making some simple Django forms to display the results. But as I was doing this, I realized I could fairly easily create an API based on these models. Instead of rendering HTML, I could just return the results as JSON.

I wasn't sure if anyone would actually use the API, but I knew the best way to find out was to just put it out there. So I did, initially making it completely open, with a rate limit of 1000 calls per day per IP address. I figured at the very least, I might get some PHP or Ruby users that wanted the power of NLTK without having to interface with Python. Within a month, people were regularly exceeding that limit, and I quietly increased it to 5000 calls/day, while I started searching for the simplest way to monetize the API. I didn't like what I found.

Monetizing APIs

Before Mashape, your options for monetizing APIs were either building a custom solution for authentication, billing, and tracking, or pay thousands of dollars a month for an "enterprise" solution from Mashery or Apigee. While I have no doubt Mashery & Apigee provide quality services, they are not in the price range for most developers. And building a custom solution is far more work than I wanted to put into it. Even now, when companies like Stripe exist to make billing easier, you'd still have to do authentication & call tracking. But Stripe didn't exist 2 years ago, and the best billing option I could find was Paypal, whose API documentation is great at inducing headaches. Lucky for me, Mashape was just opening up for beta testing, and appeared to be in the process of solving all of my problems :)

Mashape

Mashape was just what I needed to monetize the text-processing API, and it's improved tremendously since I started using it. They handle all the necessary details, like integrated billing, plus a lot more, such as usage charts, latency & uptime measurements, and automatic client library generation. This last is one of my favorite features, because the client libraries are generated using your API documentation, which provides a great incentive to accurately document the ins & outs of your API. Once you've documented your API, downloadable libraries in 5 different programming languages are immediately available, making it that much easier for new users to consume your API. As of this writing, those languages are Java, PHP, Python, Ruby, and Objective C.

Here's a little history for the curious: Mashape originally did authentication and tracking by exchanging tokens thru an API call. So you had to write some code to call their token API on every one of your API calls, then check the results to see if the call was valid, or if the caller had reached their limit. They didn't have all of the nice charts they have now, and their billing solution was the CEO manually handling Paypal payments. But none of that mattered, because it worked, and from conversations with them, I knew they were focused on more important things: building up their infrastructure and positioning themselves as a kind of app-store for APIs.

Mashape has been out of beta for a while now, with automated billing, and a custom proxy server for authenticating, routing, and tracking all API calls. They're releasing new features on a regular basis, and sponsoring events like MusicHackDay. I'm very impressed with everything they're doing, and on top of that, they're good hard-working people. I've been over to their "hacker house" in San Francisco a few times, and they're very friendly and accomodating. And if you're ever in the neighborhood, I'm sure they'd be open to a visit.

Profit

Once I had integrated Mashape, which was maybe 20 lines of code, the money started rolling in :). Just kidding, but using the typical definition of profit, when income exceeds costs, the text-processing API was profitable within a few months, and has remained so ever since. My only monetary cost is a single Linode server, so as long as people keep paying for the API, text-processing.com will remain online. And while it has a very nice profit margin, total monthly income barely approaches the cost of living in San Francisco. But what really matters to me is that text-processing.com has become a self-sustaining excuse for me to experiment with natural language processing techniques & data sets, test my models against the market, and provide developers with a simple way to integrate NLP into their own projects.

So if you've got an idea for an API, especially if it's something you could charge money for, I encourage you to build it and put it up on Mashape. All you need is a working API, a unique image & name, and a Paypal account for receiving payments. Like other app stores, Mashape takes a 20% cut of all revenue, but I think it's well worth it compared to the cost of replicating everything they provide. And unlike some app stores, you're not locked in. Many of the APIs on Mashape also provide alternative usage options (including text-processing), but they're on Mashape because of the increased exposure, distribution, and additional features, like client library generation. SaaS APIs are becoming a significant part of modern computing infrastructure, and Mashape provides a great platform for getting started.

Tagged as: , , , 1 Comment
25Jul/111

Testing Command Line Scripts with Roundup

As nltk-trainer becomes more stable, I realized that I needed some way to test the command line scripts. My previous ad-hoc method of "test whatever script options I can remember" was becoming unwieldy and unreliable. But how do you make repeatable tests for a command line script? It doesn't really fit into the standard unit testing model.

Enter roundup by Blake Mizerany. (NOTE: do not try to do apt-get install roundup. You will get an issue tracking system, not a script testing tool).

Roundup provides a great way to prevent shell bugs by creating simple test functions within a shell script. Here's the first dozen lines of train_classifier.sh, which you can probably guess tests train_classifier.py:

#!/usr/bin/env roundup

describe "train_classifier.py"

it_displays_usage_when_no_arguments() {
	./train_classifier.py 2>&1 | grep -q "usage: train_classifier.py"
}

it_cannot_find_foo() {
	last_line=$(./train_classifier.py foo 2>&1 | tail -n 1)
	test "$last_line" "=" "ValueError: cannot find corpus path for foo"
}

describe is like the name of a module or test case, and all test functions begin with test_. Within the test functions, you use standard shell commands that should produce no output on success (like grep -q or the test command). You can also match multiple lines of output, as in:

it_trains_movie_reviews_paras() {
	test "$(./train_classifier.py movie_reviews --no-pickle --no-eval --fraction 0.5 --instances paras)" "=" "loading movie_reviews
2 labels: ['neg', 'pos']
1000 training feats, 1000 testing feats
training NaiveBayes classifier"
}

Once you've got all your test functions defined, make sure your test script is executable and roundup is installed, then run your test script. You'll get nice output that looks like:

nltk-trainer$ tests/train_classifier.sh
train_classifier.py
  it_displays_usage_when_no_arguments:             [PASS]
  it_cannot_find_foo:                              [PASS]
  it_cannot_import_reader:                         [PASS]
  it_trains_movie_reviews_paras:                   [PASS]
  it_trains_corpora_movie_reviews_paras:           [PASS]
  it_cross_fold_validates:                         [PASS]
  it_trains_movie_reviews_sents:                   [PASS]
  it_trains_movie_reviews_maxent:                  [PASS]
  it_shows_most_informative:                       [PASS]
=========================================================
Tests:    9 | Passed:   9 | Failed:   0

So far, roundup has been a perfect tool for testing all the nltk-trainer scripts, and the only downside is the one-time manual installation. I highly recommend it for anyone writing custom commands and scripts, no matter what language you use to write them.

Tagged as: , 1 Comment
5Jan/1117

Hierarchical Classification

Hierarchical classification is an obscure but simple concept. The idea is that you arrange two or more classifiers in a hierarchy such that the classifiers lower in the hierarchy are only used if a higher classifier returns an appropriate result.

For example, the text-processing.com sentiment analysis demo uses hierarchical classification by combining a subjectivity classifier and a polarity classifier. The subjectivity classifier is first, and determines whether the text is objective or subjective. If the text is objective, then a label of neutral is returned, and the polarity classifier is not used. However, if the text is subjective (or polar),  then the polarity classifier is used to determine if the text is positive or negative.

Hierarchical Sentiment Classification Model

Hierarchical classification is a useful way to combine multiple binary classifiers, if you have a hierarchy of labels that can modeled as a binary tree. In this model, each branch of the tree either continues on to a new pair of branches, or stops, and at each branching you use a classifier to determine which branch to take.

4Oct/104

The Perils of Web Crawling

I've written a lot of web crawlers, and these are some of the problems I've encountered in the process. Unfortunately, most of these problems have no solution, other than quitting in disgust. Instead of looking for a way out when you're already deep in it, take this article as a warning of what to expect before you start down the perilous path of making a web spider. Now to be clear, I'm referring to deep web crawlers coded for a specific site, not a generic link harvester or image downloader. Generic crawlers probably have their own set of problems on top of everything mentioned below.

Most web pages suck

I have yet to encounter a website whose HTML didn't generate at least 3 WTFs.

The only valid measure of code quality: WTFs/minute

Seriously, I now have a deep respect for web browsers that are able to render the mess of HTML that exists out there. If only browsers were better at memory management, I might actually like them again (see unified theory of browser suckage).

Inconsistent information & navigation

2 pages on the same site, with the same url pattern, and the same kind of information, may still not be the same. One will have additional information that you didn't expect, breaking all your assumptions about what HTML is where. Another may not have essential information you expected all similar pages to have, again throwing your HTML parsing assumptions out the window.

Making things worse, the same kind of page on different sections of a site may be completely different. This generally occurs on larger sites where different sections have different teams managing them. The worst is when the differences are non-obvious, so that you think they're the same, until you realize hours later that the underlying HTML structure is completely different, and CSS had been fooling you, making it look the same when it's not.

90% of writing a crawler is debugging

Your crawler will break. It's not a matter of if but when. Sometimes it breaks immediately after you write it, because it can't deal with the malformed HTML garbage that you're trying to parse. Sometimes it happens randomly, weeks or months later, when the site makes a minor change to their HTML that sends your crawler into a chaotic death spiral. Whatever the case, it'll happen, it may not be your fault, but you'll have to deal with it (I recommend cursing profusely). If you can't deal with random sessions of angry debugging because some stranger you don't know but want to punch in the face made a change that broke your crawler, quit now.

You can't make their connection any faster

If your crawler is slow, it's probably not your fault. Chances are, your incoming connection is much faster and less congested than their outgoing connection. The internet is a series of tubes, and while their tube may be a lot larger than yours, it's filled with other people's crap, slowing down your completely legitimate web crawl to, well, a crawl. The only solution is to request more pages in parallel, which leads directly to the next problem...

Sites will ban you

Many web sites don't want to be crawled, at least not by you. They can't really stop you, but they'll try. In the end, they only win if you let them, but it can be quite annoying to deal with a site that doesn't want you to crawl them. There's a number of techniques for dealing with this:

  • route your crawls thru TOR, if you don't mind crawling slower than a pentium 1 on with a 2.4 bps modem (if you're too young to know what that means, think facebook or myspace, but 100 times slower, and you're crying because you're peeling a dozen onions while you wait for the page to load)
  • use anonymous proxies, which of course are always completely reliable, and never randomly go offline for no particular reason
  • slow down your crawl to the point a one armed blind man could do it faster
  • ignore robots.txt - what right do these sites have to tell you what you can & can't crawl?!? This is America, isn't it!?!

Conclusion

Don't write a web crawler. Seriously, it's the most annoying programming work in the world. Of course, chances are that if you're actually considering writing a crawler, it's because there's no other option. In that case, good luck, and stock up on liquor.

13Jan/100

Programming Philosophy Links

7Nov/090

Machine Learning Links

28Jul/098

How to Deploy hgwebdir.fcgi behind Nginx with Fab

If you're managing multiple mercurial repositories, it's nice to see them all in one place, using a simple web-based repository browser. There's various ways to publish mercurial repositories, but hgwebdir is the only method that supports multiple repos. Since I prefer fastcgi and nginx, I decided to use hgwebdir.fcgi, which unfortunately isn't documented on the mercurial wiki.

hgweb.config

Let's start by creating hgweb.config, which tells hgwebdir where the repos are and what the web UI should look like.

[paths]
/REPO1NAME = /PATH/TO/REPO1
/REPO2NAME = /PATH/TO/REPO2

[web]
base =
style = monoblue

There's a few different included themes you can choose from, I like the monoblue style. The empty base= line is apparently required to make everything work.

hgwebdir.fcgi

Next, make a copy of hgwebdir.fcgi, which in Ubuntu can be found in /usr/share/doc/mercurial/examples. Below is a simplified version with all comments removed. The one line you may want to change is the path to hgweb.config on the server, but I'll assume you'll want it in /etc/mercurial.

from mercurial import demandimport; demandimport.enable()
from mercurial.hgweb.hgwebdir_mod import hgwebdir
from mercurial.hgweb.request import wsgiapplication
from flup.server.fcgi import WSGIServer

def make_web_app():
    return hgwebdir("/etc/mercurial/hgweb.config")

WSGIServer(wsgiapplication(make_web_app)).run()

hg_server.conf

This is a simple nginx fastcgi config you can modify for your own purposes. It forwards all requests for hg.DOMAIN.COM to the hgwebdir.fcgi socket we'll be starting below.

server {
	listen 80;
	server_name hg;
	server_name hg.DOMAIN.COM;

	access_log /var/log/hg_access.log;
	error_log /var/log/hg_error.log;

	location / {
		fastcgi_pass	unix:/var/run/hgwebdir.sock;
		fastcgi_param	PATH_INFO	$fastcgi_script_name;
		fastcgi_param	QUERY_STRING	$query_string;
		fastcgi_param	REQUEST_METHOD	$request_method;
		fastcgi_param	CONTENT_TYPE	$content_type;
		fastcgi_param	CONTENT_LENGTH	$content_length;
		fastcgi_param	SERVER_PROTOCOL	$server_protocol;
		fastcgi_param	SERVER_PORT	$server_port;
		fastcgi_param	SERVER_NAME	$server_name;
	}
}

The hg_server.conf file will need a link from /etc/nginx/sites-enabled to its location in /etc/nginx/sites-available, assuming that you're using the default nginx config which includes every server conf found in /etc/nginx/sites-available.

fab hgweb restart_nginx

To make deployment easy, I use fab, so that if I make any changes to hgweb.config or hg_server.conf, I can simply run fab hgweb restart_nginx. For starting hgwebdir.fcgi, we can use spawn-fcgi, which usually comes with lighttpd, so you'll need that installed too.

hgweb copies hgweb.config and hgwebdir.fcgi to appropriate locations on the server, then starts the fastcgi process with a socket at /var/run/hgwebdir.sock.

restart_nginx copies hg_server.conf to the server and tells nginx to reload its config.

def hgweb():
    env.runpath = '/var/run'
    put('hgweb.config', '/tmp')
    put('hgwebdir.fcgi', '/tmp')
    sudo('mv /tmp/hgwebdir.fcgi /usr/local/bin/')
    sudo('chmod +x /usr/local/bin/hgwebdir.fcgi')
    sudo('mv /tmp/hgweb.config /etc/mercurial/hgweb.config')
    sudo('kill `cat %s/hgwebdir.pid`' % env.runpath)
    sudo('spawn-fcgi -f /usr/local/bin/hgwebdir.fcgi -s %s/hgwebdir.sock -P %s/hgwebdir.pid' % (env.runpath, env.runpath), user='www-data')

def restart_nginx():
    put('hg_server.conf', '/tmp/')
    sudo('mv /tmp/hg_server.conf /etc/nginx/sites-available/')
    sudo('killall -HUP nginx')

Once you've got these commands in your fabfile.py, you can run fab hgweb restart_nginx to deploy.

hgrc

Now that you've got hgwebdir.fcgi running (you can make sure it works by going to http://hg.DOMAIN.COM), you'll probably want to customize the info about each repo by editing .hg/hgrc.

[web]
description = All about my repo
contacts = Me

And that's it, you should now have a fast web-based browser for multiple repos :)

3Mar/096

mapfilter

Have you ever mapped a list, then filtered it? Or filtered first, then mapped? Why not do it all in one pass with mapfilter?

mapfilter?

mapfilter is a function that combines the traditional map & filter of functional programming by using the following logic:

  1. if your function returns false, then the element is discarded
  2. any other return value is mapped into the list

Why?

Doing a map and then a filter is O(2N), whereas mapfilter is O(N). That's twice as a fast! If you are dealing with large lists, this can be a huge time saver. And for the case where a large list contains small IDs for looking up a larger data structure, then using mapfilter can result in half the number of database lookups.

Obviously, mapfilter won't work if you want to produce a list of boolean values, as it would filter out all the false values. But why would you want to map to a list booleans?

Erlang Code

Here's some erlang code I've been using for a while:

mapfilter(F, List) -> lists:reverse(mapfilter(F, List, [])).

mapfilter(_, [], Results) ->
        Results;
mapfilter(F, [Item | Rest], Results) ->
        case F(Item) of
                false -> mapfilter(F, Rest, Results);
                Term -> mapfilter(F, Rest, [Term | Results])
        end.

Has anyone else done this for themselves? Does mapfilter exist in any programming language? If so, please leave a comment. I think mapfilter is a very simple & useful concept that should be a included in the standard library of every (functional) programming language. Erlang already has mapfoldl (map-reduce in one pass), so why not also have mapfilter?

5Feb/098

Test Driven Development in Python

One of my favorite aspects of Python is that it makes practicing TDD very easy. What makes it so frictionless is the doctest module. It allows you to write a test at the same time you define a function. No setup, no boilerplate, just write a function call and the expected output in the docstring. Here's a quick example of a fibonacci function.

def fib(n):
        '''Return the nth fibonacci number.
        >>> fib(0)
        0
        >>> fib(1)
        1
        >>> fib(2)
        1
        >>> fib(3)
        2
        >>> fib(4)
        3
        '''
        if n == 0:
                return 0
        elif n == 1:
                return 1
        else:
                return fib(n - 1) + fib(n - 2)

If you want to run your doctests, just add the following three lines to the bottom of your module.

if __name__ == '__main__':
        import doctest
        doctest.testmod()

Now you can run your module to run the doctests, like python fib.py.

So how well does this fit in with the TDD philosophy? Here's the basic TDD practices.

  1. Think about what you want to test
  2. Write a small test
  3. Write just enough code to fail the test
  4. Run the test and watch it fail
  5. Write just enough code to pass the test
  6. Run the test and watch it pass (if it fails, go back to step 4)
  7. Go back to step 1 and repeat until done

And now a step-by-step breakdown of how to do this with doctests, in excruciating detail.

1. Define a new empty method

def fib(n):
'''Return the nth fibonacci number.'''
pass

if __name__ == '__main__':
import doctest
doctest.testmod()

2. Write a doctest

def fib(n):
        '''Return the nth fibonacci number.
        >>> fib(0)
        0
        '''
        pass

3. Run the module and watch the doctest fail

python fib.py
**********************************************************************
File "fib1.py", line 3, in __main__.fib
Failed example:
    fib(0)
Expected:
    0
Got nothing
**********************************************************************
1 items had failures:
   1 of   1 in __main__.fib
***Test Failed*** 1 failures.

4. Write just enough code to pass the failing doctest

def fib(n):
        '''Return the nth fibonacci number.
        >>> fib(0)
        0
        '''
        return 0

5. Run the module and watch the doctest pass

python fib.py

6. Go back to step 2 and repeat

Now you can start filling in the rest of function, one test at time. In practice, you may not write code exactly like this, but the point is that doctests provide a really easy way to test your code as you write it.

Unit Tests

Ok, so doctests are great for simple tests. But what if your tests need to be a bit more complex? Maybe you need some external data, or mock objects. In that case, you'll be better off with more traditional unit tests. But first, take a little time to see if you can decompose your code into a set of smaller functions that can be tested individually. I find that code that is easier to test is also easier to understand.

Running Tests

For running my tests, I use nose. I have a tests/ directory with a simple configuration file, nose.cfg

[nosetests]
verbosity=3
with-doctest=1

Then in my Makefile, I add a test command so I can run make test.

test:
        @nosetests --config=tests/nose.cfg tests PACKAGE1 PACKAGE2

PACKAGE1 and PACKAGE2 are optional paths to your code. They could point to unit test packages and/or production code containing doctests.

And finally, if you're looking for a continuous integration server, try Buildbot.

%d bloggers like this: