Chunk Extraction with NLTK
Chunk extraction is a useful preliminary step to information extraction, that creates parse trees from unstructured text with a chunker. Once you have a parse tree of a sentence, you can do more specific information extraction, such as named entity recognition and relation extraction.
Chunking is basically a 3 step process:
- Tag a sentence
- Chunk the tagged sentence
- Analyze the parse tree to extract information
I've already written about how to train a NLTK part of speech tagger and a chunker, so I'll assume you've already done the training, and now you want to use your pos tagger and iob chunker to do something useful.
IOB Tag Chunker
The previously trained chunker is actually a chunk tagger. It's a Tagger that assigns IOB chunk tags to part-of-speech tags. In order to use it for proper chunking, we need some extra code to convert the IOB chunk tags into a parse tree. I've created a wrapper class that complies with the nltk ChunkParserI interface and uses the trained chunk tagger to get IOB tags and convert them to a proper parse tree.
import nltk.chunk
import itertools
class TagChunker(nltk.chunk.ChunkParserI):
def __init__(self, chunk_tagger):
self._chunk_tagger = chunk_tagger
def parse(self, tokens):
# split words and part of speech tags
(words, tags) = zip(*tokens)
# get IOB chunk tags
chunks = self._chunk_tagger.tag(tags)
# join words with chunk tags
wtc = itertools.izip(words, chunks)
# w = word, t = part-of-speech tag, c = chunk tag
lines = [' '.join([w, t, c]) for (w, (t, c)) in wtc if c]
# create tree from conll formatted chunk lines
return nltk.chunk.conllstr2tree('\n'.join(lines))
Chunk Extraction
Now that we have a proper NLTK chunker, we can use it to extract chunks. Here's a simple example that tags a sentence, chunks the tagged sentence, then prints out each noun phrase.
# sentence should be a list of words
tagged = tagger.tag(sentence)
tree = chunker.parse(tagged)
# for each noun phrase sub tree in the parse tree
for subtree in tree.subtrees(filter=lambda t: t.node == 'NP'):
# print the noun phrase as a list of part-of-speech tagged words
print subtree.leaves()
Each sub tree has a phrase tag, and the leaves of a sub tree are the tagged words that make up that chunk. Since we're training the chunker on IOB tags, NP stands for Noun Phrase. As noted before, the results of this natural language processing are heavily dependent on the training data. If your input text isn't similar to the your training data, then you probably won't be getting many chunks.
Test Driven Development in Python
One of my favorite aspects of Python is that it makes practicing TDD very easy. What makes it so frictionless is the doctest module. It allows you to write a test at the same time you define a function. No setup, no boilerplate, just write a function call and the expected output in the docstring. Here's a quick example of a fibonacci function.
def fib(n):
'''Return the nth fibonacci number.
>>> fib(0)
0
>>> fib(1)
1
>>> fib(2)
1
>>> fib(3)
2
>>> fib(4)
3
'''
if n == 0:
return 0
elif n == 1:
return 1
else:
return fib(n - 1) + fib(n - 2)
If you want to run your doctests, just add the following three lines to the bottom of your module.
if __name__ == '__main__':
import doctest
doctest.testmod()
Now you can run your module to run the doctests, like python fib.py.
So how well does this fit in with the TDD philosophy? Here's the basic TDD practices.
- Think about what you want to test
- Write a small test
- Write just enough code to fail the test
- Run the test and watch it fail
- Write just enough code to pass the test
- Run the test and watch it pass (if it fails, go back to step 4)
- Go back to step 1 and repeat until done
And now a step-by-step breakdown of how to do this with doctests, in excruciating detail.
1. Define a new empty method
def fib(n): '''Return the nth fibonacci number.''' pass if __name__ == '__main__': import doctest doctest.testmod()
2. Write a doctest
def fib(n):
'''Return the nth fibonacci number.
>>> fib(0)
0
'''
pass
3. Run the module and watch the doctest fail
python fib.py
**********************************************************************
File "fib1.py", line 3, in __main__.fib
Failed example:
fib(0)
Expected:
0
Got nothing
**********************************************************************
1 items had failures:
1 of 1 in __main__.fib
***Test Failed*** 1 failures.
4. Write just enough code to pass the failing doctest
def fib(n):
'''Return the nth fibonacci number.
>>> fib(0)
0
'''
return 0
5. Run the module and watch the doctest pass
python fib.py
6. Go back to step 2 and repeat
Now you can start filling in the rest of function, one test at time. In practice, you may not write code exactly like this, but the point is that doctests provide a really easy way to test your code as you write it.
Unit Tests
Ok, so doctests are great for simple tests. But what if your tests need to be a bit more complex? Maybe you need some external data, or mock objects. In that case, you'll be better off with more traditional unit tests. But first, take a little time to see if you can decompose your code into a set of smaller functions that can be tested individually. I find that code that is easier to test is also easier to understand.
Running Tests
For running my tests, I use nose. I have a tests/ directory with a simple configuration file, nose.cfg
[nosetests] verbosity=3 with-doctest=1
Then in my Makefile, I add a test command so I can run make test.
test:
@nosetests --config=tests/nose.cfg tests PACKAGE1 PACKAGE2
PACKAGE1 and PACKAGE2 are optional paths to your code. They could point to unit test packages and/or production code containing doctests.
And finally, if you're looking for a continuous integration server, try Buildbot.




