It all started with an email to the baypiggies mailing list. An acquisition editor for Packt was looking for authors to expand their line of python cookbooks. For some reason I can't remember, I thought they wanted to put together a multi-author cookbook, where each author contributes a few recipes. That sounded doable, because I'd already written a number of articles that could serve as the basis for a few recipes. So I replied with links to the following articles:
The reply back was:
The next step is to come up with around 8-14 topics/chapters and around 80-100 recipes for the book as a whole.
My first reaction was "WTF?? No way!" But luckily, I didn't send that email. Instead, I took a couple days to think it over, and realized that maybe I could come up with that many recipes, if I broke my knowledge down into small pieces. I also decided to choose recipes that I didn't already know how to write, and use them as motivation for learning & research. So I replied back with a list of 92 recipes, and got to work. Not surprisingly, the original list of 92 changed significantly while writing the book, and I believe the final recipe count is 81.
I was keenly aware that there'd be some necessary overlap with the original NLTK book, Natural Language Processing with Python. But I did my best to minimize that overlap, and to present a different take on similar content. And there's a number of recipes that (as far as I know) you can't find anywhere else, the largest group of which can be found in Chapter 6, Transforming Chunks and Trees. I'm very pleased with the result, and I hope everyone who buys the book is too. I'd like to think that Python Text Processing with NLTK 2.0 Cookbook is the practical companion to the more teaching oriented Natural Language Processing with Python.
If you'd like a taste of the book, checkout the online sample chapter (pdf) Chapter 3, Custom Corpora, which details how many of the included corpus readers work, how to use them, and how to create your own corpus readers. The last recipe shows you how to create a corpus reader on top of MongoDB, and it should be fairly easy to modify for use with any other database.
Packt has also published two excerpts from Chapter 8, Distributed Processing and Handling Large Datasets, which are partially based on those original 2 articles:
The original client, which still exists as erldis_client.erl, implements asynchronous pipelining. This means you send a bunch of redis commands, then collect all the results at the end. This didn't work for me, as I needed a client that could handle parallel synchronous requests from multiple concurrent processes. So I copied erldis_client.erl to erldis_sync_client.erl and modified it to send replies back as soon as they are received from redis (in FIFO order). Many thanks to dialtone_ for writing the original erldis app as I'm not sure I would've created the synchronous client without it. And thanks to cstar for patches, such as making erldis_sync_client the default client for all functions in erldis.erl.
In addition to the synchronous client, I've added some extra functions and modules to make interfacing with redis more erlangy. Here's a brief overview...
- calls your function with the client PID as the argument
- stops the client
- returns the result of your function
The goal being to reduce boilerplate start/stop code.
Despite the low version numbers, I've been successfully using erldis as a component in parallel/distributed information retrieval (in conjunction with plists), and for accessing data shared with python / django apps. It's a fully compliant erlang application that you can include in your target system release structure.
Say you want to build a frequency distribution of many thousands of samples with the following characteristics:
- fast to build
- persistent data
- network accessible (with no locking requirements)
- can store large sliceable index lists
The only solution I know that meets those requirements is Redis. NLTK's FreqDist is not persistent , shelve is far too slow, BerkeleyDB is not network accessible (and is generally a PITA to manage), and AFAIK there's no other key-value store that makes sliceable lists really easy to create & access. So far I've been quite pleased with Redis, especially given how new it is. It's quite fast, is network accessible, atomic operations make locking unnecessary, supports sortable and sliceable list structures, and is very easy to configure.
Why build a NLTK FreqDist on Redis
Building a NLTK FreqDist on top of Redis allows you to create a ProbDist, which in turn can be used for classification. Having it be persistent lets you examine the data later. And the ability to create sliceable lists allows you to make sorted indexes for paging thru your samples.
Here's some more concrete use cases for persistent frequency distributions:
I put the code I've been using to build frequency distributions over large sets of words up at BitBucket. probablity.py contains
RedisFreqDist, which works just like the NTLK FreqDist, except it stores samples and frequencies as keys and values in Redis. That means samples must be strings. Internally,
RedisFreqDist also stores a set of all the samples under the key __samples__ for efficient lookup and sorting. Here's some example code for using it. For more info, checkout the wiki, or read the code.
def make_freq_dist(samples, host='localhost', port=6379, db=0): freqs = RedisFreqDist(host=host, port=port, db=db) for sample in samples: freqs.inc(sample)
Unfortunately, I had to muck about with some of FreqDist's internal implementation to remain compatible, so I can't promise the code will work beyond NLTK version 0.9.9. probablity.py also includes
ConditionalRedisFreqDist for creating ConditionalProbDists.
For creating lists of samples, that very much depends on your use case, but here's some example code for doing so.
r is a redis object,
key is the index key for storing the list, and
samples is assumed to be a sorted list. The
get_samples function demonstrates how to get a slice of samples from the list.
def index_samples(r, key, samples): r.delete(key) for word in words: r.push(key, word, tail=True) def get_samples(r, key, start, end): return r.lrange(key, start, end)
Yes, Redis is still fairly alpha, so I wouldn't use it for critical systems. But I've had very few issues so far, especially compared to dealing with BerkeleyDB. I highly recommend it for your non-critical computational needs Redis has been quite stable for a while now, and many sites are using it successfully in production