How to Train a NLTK Chunker

In NLTK, chunking is the process of extracting short, well-formed phrases, or chunks, from a sentence. This is also known as partial parsing, since a chunker is not required to capture all the words in a sentence, and does not produce a deep parse tree. But this is a good thing because it’s very hard to create a complete parse grammar for natural language, and full parsing is usually all or nothing. So chunking allows you to get at the bits you want and ignore the rest.

Training a Chunker

The general approach to chunking and parsing is to define rules or expressions that are then matched against the input sentence. But this is a very manual, tedious, and error-prone process, likely to get very complicated real fast. The alternative approach is to train a chunker the same way you train a part-of-speech tagger. Except in this case, instead of training on (word, tag) sequences, we train on (tag, iob) sequences, where iob is a chunk tag defined in the the conll2000 corpus. Here’s a function that will take a list of chunked sentences (from a chunked corpus like conll2000 or treebank), and return a list of (tag, iob) sequences.

import nltk.chunk

def conll_tag_chunks(chunk_sents):
    tag_sents = [nltk.chunk.tree2conlltags(tree) for tree in chunk_sents]
    return [[(t, c) for (w, t, c) in chunk_tags] for chunk_tags in tag_sents]

Chunker Accuracy

So how accurate is the trained chunker? Here’s the rest of the code, followed by a chart of the accuracy results. Note that I’m only using Ngram Taggers. You could additionally use the BrillTagger, but the training takes a ridiculously long time for very minimal gains in accuracy.

import nltk.corpus, nltk.tag

def ubt_conll_chunk_accuracy(train_sents, test_sents):
    train_chunks = conll_tag_chunks(train_sents)
    test_chunks = conll_tag_chunks(test_sents)

    u_chunker = nltk.tag.UnigramTagger(train_chunks)
    print 'u:', nltk.tag.accuracy(u_chunker, test_chunks)

    ub_chunker = nltk.tag.BigramTagger(train_chunks, backoff=u_chunker)
    print 'ub:', nltk.tag.accuracy(ub_chunker, test_chunks)

    ubt_chunker = nltk.tag.TrigramTagger(train_chunks, backoff=ub_chunker)
    print 'ubt:', nltk.tag.accuracy(ubt_chunker, test_chunks)

    ut_chunker = nltk.tag.TrigramTagger(train_chunks, backoff=u_chunker)
    print 'ut:', nltk.tag.accuracy(ut_chunker, test_chunks)

    utb_chunker = nltk.tag.BigramTagger(train_chunks, backoff=ut_chunker)
    print 'utb:', nltk.tag.accuracy(utb_chunker, test_chunks)

# conll chunking accuracy test
conll_train = nltk.corpus.conll2000.chunked_sents('train.txt')
conll_test = nltk.corpus.conll2000.chunked_sents('test.txt')
ubt_conll_chunk_accuracy(conll_train, conll_test)

# treebank chunking accuracy test
treebank_sents = nltk.corpus.treebank_chunk.chunked_sents()
ubt_conll_chunk_accuracy(treebank_sents[:2000], treebank_sents[2000:])
Accuracy for Trained Chunker
Accuracy for Trained Chunker

The ub_chunker and utb_chunker are slight favorites with equal accuracy, so in practice I suggest using the ub_chunker since it takes slightly less time to train.

Conclusion

Training a chunker this way is much easier than creating manual chunk expressions or rules, it can approach 100% accuracy, and the process is re-usable across data sets. As with part-of-speech tagging, the training set really matters, and should be as similar as possible to the actual text that you want to tag and chunk.

  • Col Wilson

    Hi there, thanks for the article, but I can’t seem to get it to work. I have written a class like this, around what you suggest (I think):

    class Chunker:

    def __init__(self):
    def conll_tag_chunks(chunk_sents):
    tag_sents = [nltk.chunk.tree2conlltags(tree) for tree in chunk_sents]
    return [[(t, c) for (w, t, c) in chunk_tags] for chunk_tags in tag_sents]
    train_sents = nltk.corpus.conll2000.chunked_sents()
    train_chunks = conll_tag_chunks(train_sents)
    logger.debug(‘training u_chunker’)
    u_chunker = UnigramTagger(train=train_chunks)
    logger.debug(‘training ub_chunker’)
    ub_chunker = BigramTagger(train=train_chunks, backoff=u_chunker)
    #ubt_chunker = nltk.tag.TrigramTagger(train_chunks, backoff=ub_chunker)
    #ut_chunker = nltk.tag.TrigramTagger(train_chunks, backoff=u_chunker)
    #utb_chunker = nltk.tag.BigramTagger(train_chunks, backoff=ut_chunker)
    logger.debug(‘finished training’)
    self.chunker = ub_chunker

    def chunk(self, tokens):
    return self.chunker.tag(tokens)

    and tried to do this:

    chunker = Chunker()
    s = “Since then, we’ve changed how we use Python a ton internally.”
    tokens = s.split()
    chunked = chunker.chunk(tokens)
    print chunked

    which gives:

    [(u'Since', None), (u'then,', None), (u"we've", None), (u'changed', None), (u'how', None), (u'we', None), (u'use', None), (u'Python', None), (u'a', None), (u'ton', None), (u'internally.', None)]

    In other words, nothing at all gets chunked.

    Have I missed something?

    Col

  • Col Wilson

    Hi there, thanks for the article, but I can’t seem to get it to work. I have written a class like this, around what you suggest (I think):

    class Chunker:

    def __init__(self):
    def conll_tag_chunks(chunk_sents):
    tag_sents = [nltk.chunk.tree2conlltags(tree) for tree in chunk_sents]
    return [[(t, c) for (w, t, c) in chunk_tags] for chunk_tags in tag_sents]
    train_sents = nltk.corpus.conll2000.chunked_sents()
    train_chunks = conll_tag_chunks(train_sents)
    logger.debug(‘training u_chunker’)
    u_chunker = UnigramTagger(train=train_chunks)
    logger.debug(‘training ub_chunker’)
    ub_chunker = BigramTagger(train=train_chunks, backoff=u_chunker)
    #ubt_chunker = nltk.tag.TrigramTagger(train_chunks, backoff=ub_chunker)
    #ut_chunker = nltk.tag.TrigramTagger(train_chunks, backoff=u_chunker)
    #utb_chunker = nltk.tag.BigramTagger(train_chunks, backoff=ut_chunker)
    logger.debug(‘finished training’)
    self.chunker = ub_chunker

    def chunk(self, tokens):
    return self.chunker.tag(tokens)

    and tried to do this:

    chunker = Chunker()
    s = “Since then, we’ve changed how we use Python a ton internally.”
    tokens = s.split()
    chunked = chunker.chunk(tokens)
    print chunked

    which gives:

    [(u'Since', None), (u'then,', None), (u"we've", None), (u'changed', None), (u'how', None), (u'we', None), (u'use', None), (u'Python', None), (u'a', None), (u'ton', None), (u'internally.', None)]

    In other words, nothing at all gets chunked.

    Have I missed something?

    Col

  • http://streamhacker.com/ Jacob Perkins

    Hi Col,

    It looks like you left out a step: part of speech tagging. The chunker requires tagged tokens, like [('foo', 'JJ'), ('bar', 'NN')] in order to extract chunks. So you’ll have to train a part of speech tagger as well as the chunker, then run the tokens thru the tagger, and use that output as input to the chunker. Check out my articles about part of speech tagging, starting with Part 1. You also may want to look at the NLTK Chunking Guide.

  • http://weotta.com Jacob

    Hi Col,

    It looks like you left out a step: part of speech tagging. The chunker requires tagged tokens, like [('foo', 'JJ'), ('bar', 'NN')] in order to extract chunks. So you’ll have to train a part of speech tagger as well as the chunker, then run the tokens thru the tagger, and use that output as input to the chunker. Check out my articles about part of speech tagging, starting with Part 1. You also may want to look at the NLTK Chunking Guide.

  • Col Wilson

    I tried that without success. My Tagger class (from your earlier article) looks like this:

    import nltk
    from nltk.tag import brill
    import logging
    logger = logging.getLogger(“ballyclare.tagger”)
    # see: http://streamhacker.wordpress.com/2008/12/03/part-of-speech-tagging-with-nltk-part-3/

    class Tagger:

    def __init__(self, sentences=1000, corpus=nltk.corpus.brown):
    logger.debug(‘training with ‘ + str(sentences) + ‘ sentences’)
    train_sents = corpus.tagged_sents()[:sentences]

    def backoff_tagger(tagged_sents, tagger_classes, backoff=None):
    if not backoff:
    backoff = tagger_classes[0](tagged_sents)
    del tagger_classes[0]

    for cls in tagger_classes:
    tagger = cls(tagged_sents, backoff=backoff)
    backoff = tagger

    return backoff

    word_patterns = [
    (r’^-?[0-9]+(.[0-9]+)?$’, ‘CD’),
    (r’.*ould$’, ‘MD’),
    (r’.*ing$’, ‘VBG’),
    (r’.*ed$’, ‘VBD’),
    (r’.*ness$’, ‘NN’),
    (r’.*ment$’, ‘NN’),
    (r’.*ful$’, ‘JJ’),
    (r’.*ious$’, ‘JJ’),
    (r’.*ble$’, ‘JJ’),
    (r’.*ic$’, ‘JJ’),
    (r’.*ive$’, ‘JJ’),
    (r’.*ic$’, ‘JJ’),
    (r’.*est$’, ‘JJ’),
    (r’^a$’, ‘PREP’),
    ]

    raubt_tagger = backoff_tagger(train_sents, [nltk.tag.AffixTagger,
    nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger],
    backoff=nltk.tag.RegexpTagger(word_patterns))

    templates = [
    brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,1)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (2,2)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,2)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,3)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,1)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (2,2)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,2)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,3)),
    brill.ProximateTokensTemplate(brill.ProximateTagsRule, (-1, -1), (1,1)),
    brill.ProximateTokensTemplate(brill.ProximateWordsRule, (-1, -1), (1,1))
    ]

    trainer = brill.FastBrillTaggerTrainer(raubt_tagger, templates)
    logger.debug(‘starting training’)
    braubt_tagger = trainer.train(train_sents, max_rules=100, min_score=3)
    logger.debug(‘finished training’)
    self.tagger = braubt_tagger

    def tag(self,sentence):
    return self.tagger.tag(sentence)

    and it gives me something like:

    [('Further', 'AP'), ('snow', None), ('is', 'BEZ'), ('expected', 'VBN'), ('to', 'TO'), ('push', None), ('into', 'IN'), ('many', 'AP'), ('southern', 'JJ-TL'), ('and', 'CC'), ('eastern', 'JJ-TL'), ('parts', 'NNS'), ('of', 'IN'), ('England,', None), ('including', 'IN'), ('London,', None), ('overnight', 'NN'), ('and', 'CC'), ('during', 'IN'), ('the', 'AT'), ('day', 'NN'), ('on', 'IN'), ('Friday.', None)]

    However when I feed this into the chunker I still get nothing:

    [(('Further', 'AP'), None), (('snow', None), None), (('is', 'BEZ'), None), (('expected', 'VBN'), None), (('to', 'TO'), None), (('push', None), None), (('into', 'IN'), None), (('many', 'AP'), None), (('southern', 'JJ-TL'), None), (('and', 'CC'), None), (('eastern', 'JJ-TL'), None), (('parts', 'NNS'), None), (('of', 'IN'), None), (('England,', None), None), (('including', 'IN'), None), (('London,', None), None), (('overnight', 'NN'), None), (('and', 'CC'), None), (('during', 'IN'), None), (('the', 'AT'), None), (('day', 'NN'), None), (('on', 'IN'), None), (('Friday.', None), None)]

    Is it I wonder because not all tokens get tags?

    Thanks for your help so far.

  • Col Wilson

    I tried that without success. My Tagger class (from your earlier article) looks like this:

    import nltk
    from nltk.tag import brill
    import logging
    logger = logging.getLogger(“ballyclare.tagger”)
    # see: http://streamhacker.wordpress.com/2008/12/03/part-of-speech-tagging-with-nltk-part-3/

    class Tagger:

    def __init__(self, sentences=1000, corpus=nltk.corpus.brown):
    logger.debug(‘training with ‘ + str(sentences) + ‘ sentences’)
    train_sents = corpus.tagged_sents()[:sentences]

    def backoff_tagger(tagged_sents, tagger_classes, backoff=None):
    if not backoff:
    backoff = tagger_classes[0](tagged_sents)
    del tagger_classes[0]

    for cls in tagger_classes:
    tagger = cls(tagged_sents, backoff=backoff)
    backoff = tagger

    return backoff

    word_patterns = [
    (r’^-?[0-9]+(.[0-9]+)?$’, ‘CD’),
    (r’.*ould$’, ‘MD’),
    (r’.*ing$’, ‘VBG’),
    (r’.*ed$’, ‘VBD’),
    (r’.*ness$’, ‘NN’),
    (r’.*ment$’, ‘NN’),
    (r’.*ful$’, ‘JJ’),
    (r’.*ious$’, ‘JJ’),
    (r’.*ble$’, ‘JJ’),
    (r’.*ic$’, ‘JJ’),
    (r’.*ive$’, ‘JJ’),
    (r’.*ic$’, ‘JJ’),
    (r’.*est$’, ‘JJ’),
    (r’^a$’, ‘PREP’),
    ]

    raubt_tagger = backoff_tagger(train_sents, [nltk.tag.AffixTagger,
    nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger],
    backoff=nltk.tag.RegexpTagger(word_patterns))

    templates = [
    brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,1)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (2,2)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,2)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,3)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,1)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (2,2)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,2)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,3)),
    brill.ProximateTokensTemplate(brill.ProximateTagsRule, (-1, -1), (1,1)),
    brill.ProximateTokensTemplate(brill.ProximateWordsRule, (-1, -1), (1,1))
    ]

    trainer = brill.FastBrillTaggerTrainer(raubt_tagger, templates)
    logger.debug(‘starting training’)
    braubt_tagger = trainer.train(train_sents, max_rules=100, min_score=3)
    logger.debug(‘finished training’)
    self.tagger = braubt_tagger

    def tag(self,sentence):
    return self.tagger.tag(sentence)

    and it gives me something like:

    [('Further', 'AP'), ('snow', None), ('is', 'BEZ'), ('expected', 'VBN'), ('to', 'TO'), ('push', None), ('into', 'IN'), ('many', 'AP'), ('southern', 'JJ-TL'), ('and', 'CC'), ('eastern', 'JJ-TL'), ('parts', 'NNS'), ('of', 'IN'), ('England,', None), ('including', 'IN'), ('London,', None), ('overnight', 'NN'), ('and', 'CC'), ('during', 'IN'), ('the', 'AT'), ('day', 'NN'), ('on', 'IN'), ('Friday.', None)]

    However when I feed this into the chunker I still get nothing:

    [(('Further', 'AP'), None), (('snow', None), None), (('is', 'BEZ'), None), (('expected', 'VBN'), None), (('to', 'TO'), None), (('push', None), None), (('into', 'IN'), None), (('many', 'AP'), None), (('southern', 'JJ-TL'), None), (('and', 'CC'), None), (('eastern', 'JJ-TL'), None), (('parts', 'NNS'), None), (('of', 'IN'), None), (('England,', None), None), (('including', 'IN'), None), (('London,', None), None), (('overnight', 'NN'), None), (('and', 'CC'), None), (('during', 'IN'), None), (('the', 'AT'), None), (('day', 'NN'), None), (('on', 'IN'), None), (('Friday.', None), None)]

    Is it I wonder because not all tokens get tags?

    Thanks for your help so far.

  • http://streamhacker.com/ Jacob Perkins

    Ok, I forgot to mention a major detail: notice how the train_chunks are created by taking [(t, c) for (w, t, c) in chunk_tags]? You need to do the same thing with your part of speech tagged tokens. Unzip the words from the part of speech tags, run the tags thru the chunker, giving you part of speech tags + chunk tags, then re-zip the words. Here’s some code to illustrate:


    tagged_toks = self.tagger.tag(sentence)
    (words, tags) = zip(*tagged_toks)
    chunks = self.chunker.tag(tags)
    return [(w, t, c) for (w, (t, c)) in zip(words, chunks)]

    Hope that helps. Perhaps I should write an article about putting it all together.

  • http://weotta.com Jacob

    Ok, I forgot to mention a major detail: notice how the train_chunks are created by taking [(t, c) for (w, t, c) in chunk_tags]? You need to do the same thing with your part of speech tagged tokens. Unzip the words from the part of speech tags, run the tags thru the chunker, giving you part of speech tags + chunk tags, then re-zip the words. Here’s some code to illustrate:


    tagged_toks = self.tagger.tag(sentence)
    (words, tags) = zip(*tagged_toks)
    chunks = self.chunker.tag(tags)
    return [(w, t, c) for (w, (t, c)) in zip(words, chunks)]

    Hope that helps. Perhaps I should write an article about putting it all together.

  • Col Wilson

    Aha! results. Not very good because the text is quite different from the training texts, but results nonetheless.

    Thanks.

    Yes, it would be nice to see a working example for the more challenged of us.

  • Col Wilson

    Aha! results. Not very good because the text is quite different from the training texts, but results nonetheless.

    Thanks.

    Yes, it would be nice to see a working example for the more challenged of us.

  • http://www.smithware.co.uk/ James Smith

    Did you ever get round to writing an article about putting it all together? Really great stuff here.

  • http://www.smithware.co.uk James Smith

    Did you ever get round to writing an article about putting it all together? Really great stuff here.

  • http://streamhacker.com/ Jacob Perkins

    Thanks James. Unfortunately I have not gotten around to that article yet, but thanks for reminding me. Maybe I can make that a new years resolution :)

  • Jacob

    Thanks James. Unfortunately I have not gotten around to that article yet, but thanks for reminding me. Maybe I can make that a new years resolution :)

  • http://www.smithware.co.uk/ James Smith

    Haha. We should all be so lucky to be able to stick to our new years resolutions.

    Whilst I have your attention, do you know if its possible to print a list of NE tags used in Chunk or extend the tags? I’m a little new to this and have been reading through the NLTK book but couldn’t find anything this specific.

  • http://www.smithware.co.uk James Smith

    Haha. We should all be so lucky to be able to stick to our new years resolutions.

    Whilst I have your attention, do you know if its possible to print a list of NE tags used in Chunk or extend the tags? I’m a little new to this and have been reading through the NLTK book but couldn’t find anything this specific.

  • http://streamhacker.com/ Jacob Perkins

    If you’re referring to Named Entity recognition with NLTK, afraid I can’t help you there as I haven’t done it. All I can recommend is digging into the source code and/or experimenting with the API.

  • Jacob

    If you’re referring to Named Entity recognition with NLTK, afraid I can’t help you there as I haven’t done it. All I can recommend is digging into the source code and/or experimenting with the API.

  • Pingback: Learning to do natural language processing with NLTK | JetLlib Journal

  • AdamL

    The NEs are stored as pickles in your install of nltk. Simply iterate through the pickle to get the list of NEs.

    _BINARY_NE_CHUNKER = ‘chunkers/maxent_ne_chunker/english_ace_binary.pickle’
    171 _MULTICLASS_NE_CHUNKER = ‘chunkers/maxent_ne_chunker/english_ace_multiclass.pickle’

  • smashthewindow

    Uh, so how the hell do you train chunkers? All you talk about is it’s efficient etc.

  • http://streamhacker.com/ Jacob Perkins

    In the accuracy testing code, multiple taggers are trained for chunking, such as: nltk.tag.UnigramTagger(train_chunks)

  • Martin Thomas

    The link to “train a chunker” is now sadly broken since the book moved from Googlecode to Github

  • Spartan

    Hi, I want to understand the structure within these pickle files. I mean what exactly is stored within it?
    As I am looking forward to create my own training data for specific domain. Any help would be appreciated.

  • http://streamhacker.com/ Jacob Perkins

    Pickle files are how Python serializes objects. It is a binary format for Python data. But the specifics don’t really matter for creating custom training data. All that matters when you save a pickle file, is that you need the same classes you used to create in order to load it again. So if you’re saving a NLTK chunker object as a pickle file, then you need the class code in your Python path to load the pickle file. Which you should have because you were able to save it.