Tag Archives: python

Scalable Database Links

Redis:
Cassandra:
Performance Tradeoffs:
Other:

Django IA: Registration-Activation

django-registration is a pluggable Django app that implements a common registration-activation flow. This flow is quite similar to the password reset flow, but slightly simpler with only 3 views:

  1. register
  2. registration_complete
  3. activate

The basic idea is that an anonymous user can create a new account, but cannot login until they activate their account by clicking a link they’ll receive in an activation email. It’s a way to automatically verify that the new user has a valid email address, which is generally an acceptable proxy for proving that they’re human. Here’s an Information Architecture diagram, again using jjg’s visual vocabulary.

Django Registration IA

Here’s a more in-depth walk-thru with our fictional user named Bob:

  1. Bob encounters a section of the site that requires an account, and is redirected to the login page.
  2. But Bob does not have an account, so he goes to the registration page where he fills out a registration form.
  3. After submitting the registration form, Bob is taken to a page telling him that he needs to activate his account by clicking a link in an email that he should be receiving shortly.
  4. Bob checks his email, finds the activation email, and clicks the activation link.
  5. Bob is taken to a page that tells him his account is active, and he can now login.

As with password reset, I think the last step is unnecessary, and Bob should be automatically logged in when his account is activated. But to do that, you’ll have to write your own custom activate view. Luckily, this isn’t very hard. If you take a look at the code for registration.views.activate, the core code is actually quite simple:

from registration.models import RegistrationProfile

def activate(request, activation_key):
    user = RegistrationProfile.objects.activate_user(activation_key.lower())

    if not user:
        # handle invalid activation key
    else:
        # do stuff with the user, such as automatically login, then redirect

The rest of the custom activate view is up to you.

Django IA: Auth Password Reset

Django comes with a lot of great built-in functionality. One of the most useful contrib apps is authentication, which (among other things) provides views for login, logout, and password reset. Login & logout are self-explanatory, but resetting a password is, by nature, somewhat complicated. Because it’s a really bad idea to store passwords as plaintext, you can’t just send a user their password when they forget it. Instead, you have to provide a secure mechanism for users to change their password themselves, even if they can’t remember their original password. Lucky for us, Django auth provides this functionality out of the box. All you need to do is create the templates and hook-up the views. The code you need to write to make this happen is pretty simple, but it can be a bit tricky to understand how it all works together. There’s actually 4 separate view functions that together provide a complete password reset mechanism. These view functions are

  1. password_reset
  2. password_reset_done
  3. password_reset_confirm
  4. password_reset_complete

Here’s an Information Architecture diagram showing how these views fit together, using Jesse James Garrett’s Visual Vocabulary. The 2 black dots are starting points, and the circled black dot is an end point.

Django Auth Password Reset IA

Here’s a more in-depth walk-thru of what’s going on, with a fictional user named Bob:

  1. Bob tries to login and fails, probably a couple times. Bob clicks a “Forgot your password?” link, which takes him to the password_reset view.
  2. Bob enters his email address, which is then used to find his User account.
  3. If Bob’s User account is found, a password reset email is sent, and Bob is redirected to the password_reset_done view, which should tell him to check his email.
  4. Bob leaves the site to check his email. He finds the password reset email, and clicks the password reset link.
  5. Bob is taken to the password_reset_confirm view, which first validates that he can reset his password (this is handled with a hashed link token). If the token is valid, Bob is allowed to enter a new password. Once a new password is submitted, Bob is redirected to the password_reset_complete view.
  6. Bob can now login to your site with his new password.

This final step is the one minor issue I have with Django’s auth password reset. The user just changed their password, why do they have to enter it again to login? Why can’t we eliminate step 6 altogether, and automatically log the user in after they reset their password? In fact, you can eliminate step 6 with a bit of hacking on your own authentication backend, but that’s a topic for another post.

Cloud Computing Links

Amazon Web Services:
Python Libraries:
GlusterFS:

Django Forms, Utilities, OAuth, and OpenID Links

Form Customization
Utility Apps
OAuth and OpenID

Building a NLTK FreqDist on Redis

Say you want to build a frequency distribution of many thousands of samples with the following characteristics:

  • fast to build
  • persistent data
  • network accessible (with no locking requirements)
  • can store large sliceable index lists

The only solution I know that meets those requirements is Redis. NLTK’s FreqDist is not persistent , shelve is far too slow, BerkeleyDB is not network accessible (and is generally a PITA to manage), and AFAIK there’s no other key-value store that makes sliceable lists really easy to create & access. So far I’ve been quite pleased with Redis, especially given how new it is. It’s quite fast, is network accessible, atomic operations make locking unnecessary, supports sortable and sliceable list structures, and is very easy to configure.

Why build a NLTK FreqDist on Redis

Building a NLTK FreqDist on top of Redis allows you to create a ProbDist, which in turn can be used for classification. Having it be persistent lets you examine the data later. And the ability to create sliceable lists allows you to make sorted indexes for paging thru your samples.

Here’s some more concrete use cases for persistent frequency distributions:

RedisFreqDist

I put the code I’ve been using to build frequency distributions over large sets of words up at BitBucketprobablity.py contains RedisFreqDist, which works just like the NTLK FreqDist, except it stores samples and frequencies as keys and values in Redis. That means samples must be strings. Internally, RedisFreqDist also stores a set of all the samples under the key __samples__ for efficient lookup and sorting. Here’s some example code for using it. For more info, checkout the wiki, or read the code.

def make_freq_dist(samples, host='localhost', port=6379, db=0):
	freqs = RedisFreqDist(host=host, port=port, db=db)

	for sample in samples:
		freqs.inc(sample)

Unfortunately, I had to muck about with some of FreqDist’s internal implementation to remain compatible, so I can’t promise the code will work beyond NLTK version 0.9.9. probablity.py also includes ConditionalRedisFreqDist for creating ConditionalProbDists.

Lists

For creating lists of samples, that very much depends on your use case, but here’s some example code for doing so. r is a redis object, key is the index key for storing the list, and samples is assumed to be a sorted list. The get_samples function demonstrates how to get a slice of samples from the list.

def index_samples(r, key, samples):
	r.delete(key)

	for word in words:
		r.push(key, word, tail=True)

def get_samples(r, key, start, end):
	return r.lrange(key, start, end)

Yes, Redis is still fairly alpha, so I wouldn’t use it for critical systems. But I’ve had very few issues so far, especially compared to dealing with BerkeleyDB. I highly recommend it for your non-critical computational needs 🙂 Redis has been quite stable for a while now, and many sites are using it successfully in production

Chunk Extraction with NLTK

Chunk extraction is a useful preliminary step to information extraction, that creates parse trees from unstructured text with a chunker. Once you have a parse tree of a sentence, you can do more specific information extraction, such as named entity recognition and relation extraction.

Chunking is basically a 3 step process:

  1. Tag a sentence
  2. Chunk the tagged sentence
  3. Analyze the parse tree to extract information

I’ve already written about how to train a NLTK part of speech tagger and a chunker, so I’ll assume you’ve already done the training, and now you want to use your pos tagger and iob chunker to do something useful.

IOB Tag Chunker

The previously trained chunker is actually a chunk tagger. It’s a Tagger that assigns IOB chunk tags to part-of-speech tags. In order to use it for proper chunking, we need some extra code to convert the IOB chunk tags into a parse tree. I’ve created a wrapper class that complies with the nltk ChunkParserI interface and uses the trained chunk tagger to get IOB tags and convert them to a proper parse tree.

import nltk.chunk
import itertools

class TagChunker(nltk.chunk.ChunkParserI):
    def __init__(self, chunk_tagger):
        self._chunk_tagger = chunk_tagger

    def parse(self, tokens):
        # split words and part of speech tags
        (words, tags) = zip(*tokens)
        # get IOB chunk tags
        chunks = self._chunk_tagger.tag(tags)
        # join words with chunk tags
        wtc = itertools.izip(words, chunks)
        # w = word, t = part-of-speech tag, c = chunk tag
        lines = [' '.join([w, t, c]) for (w, (t, c)) in wtc if c]
        # create tree from conll formatted chunk lines
        return nltk.chunk.conllstr2tree('\n'.join(lines))

Chunk Extraction

Now that we have a proper NLTK chunker, we can use it to extract chunks. Here’s a simple example that tags a sentence, chunks the tagged sentence, then prints out each noun phrase.

# sentence should be a list of words
tagged = tagger.tag(sentence)
tree = chunker.parse(tagged)
# for each noun phrase sub tree in the parse tree
for subtree in tree.subtrees(filter=lambda t: t.node == 'NP'):
    # print the noun phrase as a list of part-of-speech tagged words
    print subtree.leaves()

Each sub tree has a phrase tag, and the leaves of a sub tree are the tagged words that make up that chunk. Since we’re training the chunker on IOB tags, NP stands for Noun Phrase. As noted before, the results of this natural language processing are heavily dependent on the training data. If your input text isn’t similar to the your training data, then you probably won’t be getting many chunks.