StreamHacker Weotta be Hacking

23Feb/0910

Chunk Extraction with NLTK

Chunk extraction is a useful preliminary step to information extraction, that creates parse trees from unstructured text with a chunker. Once you have a parse tree of a sentence, you can do more specific information extraction, such as named entity recognition and relation extraction.

Chunking is basically a 3 step process:

  1. Tag a sentence
  2. Chunk the tagged sentence
  3. Analyze the parse tree to extract information

I've already written about how to train a NLTK part of speech tagger and a chunker, so I'll assume you've already done the training, and now you want to use your pos tagger and iob chunker to do something useful.

IOB Tag Chunker

The previously trained chunker is actually a chunk tagger. It's a Tagger that assigns IOB chunk tags to part-of-speech tags. In order to use it for proper chunking, we need some extra code to convert the IOB chunk tags into a parse tree. I've created a wrapper class that complies with the nltk ChunkParserI interface and uses the trained chunk tagger to get IOB tags and convert them to a proper parse tree.

import nltk.chunk
import itertools

class TagChunker(nltk.chunk.ChunkParserI):
    def __init__(self, chunk_tagger):
        self._chunk_tagger = chunk_tagger

    def parse(self, tokens):
        # split words and part of speech tags
        (words, tags) = zip(*tokens)
        # get IOB chunk tags
        chunks = self._chunk_tagger.tag(tags)
        # join words with chunk tags
        wtc = itertools.izip(words, chunks)
        # w = word, t = part-of-speech tag, c = chunk tag
        lines = [' '.join([w, t, c]) for (w, (t, c)) in wtc if c]
        # create tree from conll formatted chunk lines
        return nltk.chunk.conllstr2tree('\n'.join(lines))

Chunk Extraction

Now that we have a proper NLTK chunker, we can use it to extract chunks. Here's a simple example that tags a sentence, chunks the tagged sentence, then prints out each noun phrase.

# sentence should be a list of words
tagged = tagger.tag(sentence)
tree = chunker.parse(tagged)
# for each noun phrase sub tree in the parse tree
for subtree in tree.subtrees(filter=lambda t: t.node == 'NP'):
    # print the noun phrase as a list of part-of-speech tagged words
    print subtree.leaves()

Each sub tree has a phrase tag, and the leaves of a sub tree are the tagged words that make up that chunk. Since we're training the chunker on IOB tags, NP stands for Noun Phrase. As noted before, the results of this natural language processing are heavily dependent on the training data. If your input text isn't similar to the your training data, then you probably won't be getting many chunks.

  • http://www.linkedin.com/in/willdampier willdampier

    I never realized what the chunk-parser did … I was trying to extract phrases using itertools.groupby by to group consecutive adjectives and nouns.

  • http://weotta.com Jacob

    If you know the exact patterns you’re looking for, you can also use the RegexpParser. It’ll be a lot more accurate than itertools.groupby :)

  • http://www.linkedin.com/in/willdampier willdampier

    I’m trying to use the results of the pos tagger (and now chunker) to pull out words and phrases to use as features in a classification problem as I talk about in http://megamicrobase.wordpress.com/2009/02/26/featuring-the-featureset/ … since I’m dealing with mostly biomedical annotations I think I’ll gain a lot of specificity by switching to a medical corpus for the pos tagging and chunk parser.

  • Patrick

    Nice article.

    Slight typo (missing a closing parenthesis after c]) in line #16
    16. lines = [' '.join([w, t, c] for (w, (t, c)) in wtc if c]
    to
    16. lines = [' '.join([w, t, c]) for (w, (t, c)) in wtc if c]

  • Jacob

    Patrick, thanks for catching the typo. I’ve updated the article with the correct code.

  • Pingback: Learning to do natural language processing with NLTK | JetLlib Journal

  • Pingback: What's the best way to extract phrases from a corpus of text using Python? - Quora

  • Nk

    Your code doesn’t seem to work, there is a syntax error a line 16 “lines = [' '.join([w, t, c] for (w, (t, c)) in wtc if c]”

  • http://streamhacker.com/ Jacob Perkins

    you missed a closing parens for join, should be ”.join([w, t, c])

  • Pingback: dvdgrs » Graduation project

  • Pingback: Graduation project | david.graus

  • Santosh

    Hi, I am extracting causal sentences from the accident reports on water. I am using NLTK as a tool here. I manually created my regExp grammar by taking 20 causal sentence structures [see examples below]. The constructed grammar is of the type {grammar = r”’Cause: {??+?+}”’}. Now the grammar has 100% recall on the test set ( I built my own toy dataset with 50 causal and 50 non causal sentences) but a low precision. I would want to ask about, 1) how to train NLTK to build the regexp grammar automatically for extracting particular type of sentences. 2) Has any one ever tried to extract causal sentences.

    The example causal sentences are: a) There was poor sanitation in the village, as a consequence, she had health problems. b) The water was impure in her village, For this reason, she suffered from parasites. c) She had health problems because of poor sanitation in the village. I would want to extract only the above type of sentences from a large text..

  • adam

    I ran the above and had the following error:

    >>> tagged = tagger.tag(sentence)
    Traceback (most recent call last):
    File “”, line 1, in
    NameError: name ‘tagger’ is not defined

  • http://streamhacker.com/ Jacob Perkins

    You need to have a trained tagger first, and I’ve written other articles on how to do that, starting with http://streamhacker.com/2008/11/03/part-of-speech-tagging-with-nltk-part-1/

%d bloggers like this: