<?xml version="1.0" encoding="UTF-8"?><rss
version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
> <channel><title>Comments on: How to Train a NLTK Chunker</title> <atom:link href="http://streamhacker.com/2008/12/29/how-to-train-a-nltk-chunker/feed/" rel="self" type="application/rss+xml" /><link>http://streamhacker.com/2008/12/29/how-to-train-a-nltk-chunker/#utm_source=feed&amp;utm_medium=feed&amp;utm_campaign=feed</link> <description>Weotta be Hacking</description> <lastBuildDate>Thu, 08 Jul 2010 07:41:51 +0000</lastBuildDate> <sy:updatePeriod>hourly</sy:updatePeriod> <sy:updateFrequency>1</sy:updateFrequency> <generator>http://wordpress.org/?v=3.0</generator> <atom:link rel="hub" href="http://pubsubhubbub.appspot.com" /> <atom:link rel="hub" href="http://superfeedr.com/hubbub" /> <item><title>By: Learning to do natural language processing with NLTK &#124; JetLlib Journal</title><link>http://streamhacker.com/2008/12/29/how-to-train-a-nltk-chunker/comment-page-1/#comment-543</link> <dc:creator>Learning to do natural language processing with NLTK &#124; JetLlib Journal</dc:creator> <pubDate>Sun, 04 Apr 2010 22:15:45 +0000</pubDate> <guid
isPermaLink="false">http://streamhacker.wordpress.com/?p=81#comment-543</guid> <description>[...] How to Train an NLTK Chunker [...]</description> <content:encoded><![CDATA[<p>[...] How to Train an NLTK Chunker [...]</p> ]]></content:encoded> </item> <item><title>By: Jacob</title><link>http://streamhacker.com/2008/12/29/how-to-train-a-nltk-chunker/comment-page-1/#comment-297</link> <dc:creator>Jacob</dc:creator> <pubDate>Thu, 17 Dec 2009 23:09:13 +0000</pubDate> <guid
isPermaLink="false">http://streamhacker.wordpress.com/?p=81#comment-297</guid> <description>If you&#039;re referring to Named Entity recognition with NLTK, afraid I can&#039;t help you there as I haven&#039;t done it. All I can recommend is digging into the source code and/or experimenting with the API.</description> <content:encoded><![CDATA[<p>If you&#8217;re referring to Named Entity recognition with NLTK, afraid I can&#8217;t help you there as I haven&#8217;t done it. All I can recommend is digging into the source code and/or experimenting with the API.</p> ]]></content:encoded> </item> <item><title>By: James Smith</title><link>http://streamhacker.com/2008/12/29/how-to-train-a-nltk-chunker/comment-page-1/#comment-296</link> <dc:creator>James Smith</dc:creator> <pubDate>Thu, 17 Dec 2009 20:45:53 +0000</pubDate> <guid
isPermaLink="false">http://streamhacker.wordpress.com/?p=81#comment-296</guid> <description>Haha. We should all be so lucky to be able to stick to our new years resolutions.Whilst I have your attention, do you know if its possible to print a list of NE tags used in Chunk or extend the tags? I&#039;m a little new to this and have been reading through the NLTK book but couldn&#039;t find anything this specific.</description> <content:encoded><![CDATA[<p>Haha. We should all be so lucky to be able to stick to our new years resolutions.</p><p>Whilst I have your attention, do you know if its possible to print a list of NE tags used in Chunk or extend the tags? I&#8217;m a little new to this and have been reading through the NLTK book but couldn&#8217;t find anything this specific.</p> ]]></content:encoded> </item> <item><title>By: Jacob</title><link>http://streamhacker.com/2008/12/29/how-to-train-a-nltk-chunker/comment-page-1/#comment-295</link> <dc:creator>Jacob</dc:creator> <pubDate>Wed, 16 Dec 2009 17:25:08 +0000</pubDate> <guid
isPermaLink="false">http://streamhacker.wordpress.com/?p=81#comment-295</guid> <description>Thanks James. Unfortunately I have not gotten around to that article yet, but thanks for reminding me. Maybe I can make that a new years resolution :)</description> <content:encoded><![CDATA[<p>Thanks James. Unfortunately I have not gotten around to that article yet, but thanks for reminding me. Maybe I can make that a new years resolution <img
src='http://streamhacker.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /></p> ]]></content:encoded> </item> <item><title>By: James Smith</title><link>http://streamhacker.com/2008/12/29/how-to-train-a-nltk-chunker/comment-page-1/#comment-294</link> <dc:creator>James Smith</dc:creator> <pubDate>Wed, 16 Dec 2009 17:15:33 +0000</pubDate> <guid
isPermaLink="false">http://streamhacker.wordpress.com/?p=81#comment-294</guid> <description>Did you ever get round to writing an article about putting it all together? Really great stuff here.</description> <content:encoded><![CDATA[<p>Did you ever get round to writing an article about putting it all together? Really great stuff here.</p> ]]></content:encoded> </item> <item><title>By: Col Wilson</title><link>http://streamhacker.com/2008/12/29/how-to-train-a-nltk-chunker/comment-page-1/#comment-16</link> <dc:creator>Col Wilson</dc:creator> <pubDate>Fri, 06 Feb 2009 06:16:53 +0000</pubDate> <guid
isPermaLink="false">http://streamhacker.wordpress.com/?p=81#comment-16</guid> <description>Aha! results. Not very good because the text is quite different from the training texts, but results nonetheless.Thanks.Yes, it would be nice to see a working example for the more challenged of us.</description> <content:encoded><![CDATA[<p>Aha! results. Not very good because the text is quite different from the training texts, but results nonetheless.</p><p>Thanks.</p><p>Yes, it would be nice to see a working example for the more challenged of us.</p> ]]></content:encoded> </item> <item><title>By: Jacob</title><link>http://streamhacker.com/2008/12/29/how-to-train-a-nltk-chunker/comment-page-1/#comment-15</link> <dc:creator>Jacob</dc:creator> <pubDate>Fri, 06 Feb 2009 01:48:42 +0000</pubDate> <guid
isPermaLink="false">http://streamhacker.wordpress.com/?p=81#comment-15</guid> <description>Ok, I forgot to mention a major detail: notice how the train_chunks are created by taking [(t, c) for (w, t, c) in chunk_tags]? You need to do the same thing with your part of speech tagged tokens. Unzip the words from the part of speech tags, run the tags thru the chunker, giving you part of speech tags + chunk tags, then re-zip the words. Here&#039;s some code to illustrate:&lt;code&gt;
tagged_toks = self.tagger.tag(sentence)
(words, tags) = zip(*tagged_toks)
chunks = self.chunker.tag(tags)
return [(w, t, c) for (w, (t, c)) in zip(words, chunks)]
&lt;/code&gt;Hope that helps. Perhaps I should write an article about putting it all together.</description> <content:encoded><![CDATA[<p>Ok, I forgot to mention a major detail: notice how the train_chunks are created by taking [(t, c) for (w, t, c) in chunk_tags]? You need to do the same thing with your part of speech tagged tokens. Unzip the words from the part of speech tags, run the tags thru the chunker, giving you part of speech tags + chunk tags, then re-zip the words. Here&#8217;s some code to illustrate:</p><p><code><br
/> tagged_toks = self.tagger.tag(sentence)<br
/> (words, tags) = zip(*tagged_toks)<br
/> chunks = self.chunker.tag(tags)<br
/> return [(w, t, c) for (w, (t, c)) in zip(words, chunks)]<br
/> </code></p><p>Hope that helps. Perhaps I should write an article about putting it all together.</p> ]]></content:encoded> </item> <item><title>By: Col Wilson</title><link>http://streamhacker.com/2008/12/29/how-to-train-a-nltk-chunker/comment-page-1/#comment-14</link> <dc:creator>Col Wilson</dc:creator> <pubDate>Fri, 06 Feb 2009 01:29:01 +0000</pubDate> <guid
isPermaLink="false">http://streamhacker.wordpress.com/?p=81#comment-14</guid> <description>I tried that without success. My Tagger class (from your earlier article) looks like this:import nltk
from nltk.tag import brill
import logging
logger = logging.getLogger(&quot;ballyclare.tagger&quot;)
# see: http://streamhacker.wordpress.com/2008/12/03/part-of-speech-tagging-with-nltk-part-3/class Tagger:def __init__(self, sentences=1000, corpus=nltk.corpus.brown):
logger.debug(&#039;training with &#039; + str(sentences) + &#039; sentences&#039;)
train_sents = corpus.tagged_sents()[:sentences]def backoff_tagger(tagged_sents, tagger_classes, backoff=None):
if not backoff:
backoff = tagger_classes[0](tagged_sents)
del tagger_classes[0]for cls in tagger_classes:
tagger = cls(tagged_sents, backoff=backoff)
backoff = taggerreturn backoffword_patterns = [
(r&#039;^-?[0-9]+(.[0-9]+)?$&#039;, &#039;CD&#039;),
(r&#039;.*ould$&#039;, &#039;MD&#039;),
(r&#039;.*ing$&#039;, &#039;VBG&#039;),
(r&#039;.*ed$&#039;, &#039;VBD&#039;),
(r&#039;.*ness$&#039;, &#039;NN&#039;),
(r&#039;.*ment$&#039;, &#039;NN&#039;),
(r&#039;.*ful$&#039;, &#039;JJ&#039;),
(r&#039;.*ious$&#039;, &#039;JJ&#039;),
(r&#039;.*ble$&#039;, &#039;JJ&#039;),
(r&#039;.*ic$&#039;, &#039;JJ&#039;),
(r&#039;.*ive$&#039;, &#039;JJ&#039;),
(r&#039;.*ic$&#039;, &#039;JJ&#039;),
(r&#039;.*est$&#039;, &#039;JJ&#039;),
(r&#039;^a$&#039;, &#039;PREP&#039;),
]raubt_tagger = backoff_tagger(train_sents, [nltk.tag.AffixTagger,
nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger],
backoff=nltk.tag.RegexpTagger(word_patterns))templates = [
brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,1)),
brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (2,2)),
brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,2)),
brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,3)),
brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,1)),
brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (2,2)),
brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,2)),
brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,3)),
brill.ProximateTokensTemplate(brill.ProximateTagsRule, (-1, -1), (1,1)),
brill.ProximateTokensTemplate(brill.ProximateWordsRule, (-1, -1), (1,1))
]trainer = brill.FastBrillTaggerTrainer(raubt_tagger, templates)
logger.debug(&#039;starting training&#039;)
braubt_tagger = trainer.train(train_sents, max_rules=100, min_score=3)
logger.debug(&#039;finished training&#039;)
self.tagger = braubt_taggerdef tag(self,sentence):
return self.tagger.tag(sentence)and it gives me something like:[(&#039;Further&#039;, &#039;AP&#039;), (&#039;snow&#039;, None), (&#039;is&#039;, &#039;BEZ&#039;), (&#039;expected&#039;, &#039;VBN&#039;), (&#039;to&#039;, &#039;TO&#039;), (&#039;push&#039;, None), (&#039;into&#039;, &#039;IN&#039;), (&#039;many&#039;, &#039;AP&#039;), (&#039;southern&#039;, &#039;JJ-TL&#039;), (&#039;and&#039;, &#039;CC&#039;), (&#039;eastern&#039;, &#039;JJ-TL&#039;), (&#039;parts&#039;, &#039;NNS&#039;), (&#039;of&#039;, &#039;IN&#039;), (&#039;England,&#039;, None), (&#039;including&#039;, &#039;IN&#039;), (&#039;London,&#039;, None), (&#039;overnight&#039;, &#039;NN&#039;), (&#039;and&#039;, &#039;CC&#039;), (&#039;during&#039;, &#039;IN&#039;), (&#039;the&#039;, &#039;AT&#039;), (&#039;day&#039;, &#039;NN&#039;), (&#039;on&#039;, &#039;IN&#039;), (&#039;Friday.&#039;, None)]However when I feed this into the chunker I still get nothing:[((&#039;Further&#039;, &#039;AP&#039;), None), ((&#039;snow&#039;, None), None), ((&#039;is&#039;, &#039;BEZ&#039;), None), ((&#039;expected&#039;, &#039;VBN&#039;), None), ((&#039;to&#039;, &#039;TO&#039;), None), ((&#039;push&#039;, None), None), ((&#039;into&#039;, &#039;IN&#039;), None), ((&#039;many&#039;, &#039;AP&#039;), None), ((&#039;southern&#039;, &#039;JJ-TL&#039;), None), ((&#039;and&#039;, &#039;CC&#039;), None), ((&#039;eastern&#039;, &#039;JJ-TL&#039;), None), ((&#039;parts&#039;, &#039;NNS&#039;), None), ((&#039;of&#039;, &#039;IN&#039;), None), ((&#039;England,&#039;, None), None), ((&#039;including&#039;, &#039;IN&#039;), None), ((&#039;London,&#039;, None), None), ((&#039;overnight&#039;, &#039;NN&#039;), None), ((&#039;and&#039;, &#039;CC&#039;), None), ((&#039;during&#039;, &#039;IN&#039;), None), ((&#039;the&#039;, &#039;AT&#039;), None), ((&#039;day&#039;, &#039;NN&#039;), None), ((&#039;on&#039;, &#039;IN&#039;), None), ((&#039;Friday.&#039;, None), None)]Is it I wonder because not all tokens get tags?Thanks for your help so far.</description> <content:encoded><![CDATA[<p>I tried that without success. My Tagger class (from your earlier article) looks like this:</p><p>import nltk<br
/> from nltk.tag import brill<br
/> import logging<br
/> logger = logging.getLogger(&#8220;ballyclare.tagger&#8221;)<br
/> # see: <a
href="http://streamhacker.wordpress.com/2008/12/03/part-of-speech-tagging-with-nltk-part-3/" rel="nofollow">http://streamhacker.wordpress.com/2008/12/03/part-of-speech-tagging-with-nltk-part-3/</a></p><p>class Tagger:</p><p> def __init__(self, sentences=1000, corpus=nltk.corpus.brown):<br
/> logger.debug(&#8216;training with &#8216; + str(sentences) + &#8216; sentences&#8217;)<br
/> train_sents = corpus.tagged_sents()[:sentences]</p><p> def backoff_tagger(tagged_sents, tagger_classes, backoff=None):<br
/> if not backoff:<br
/> backoff = tagger_classes[0](tagged_sents)<br
/> del tagger_classes[0]</p><p> for cls in tagger_classes:<br
/> tagger = cls(tagged_sents, backoff=backoff)<br
/> backoff = tagger</p><p> return backoff</p><p> word_patterns = [<br
/> (r'^-?[0-9]+(.[0-9]+)?$&#8217;, &#8216;CD&#8217;),<br
/> (r&#8217;.*ould$&#8217;, &#8216;MD&#8217;),<br
/> (r&#8217;.*ing$&#8217;, &#8216;VBG&#8217;),<br
/> (r&#8217;.*ed$&#8217;, &#8216;VBD&#8217;),<br
/> (r&#8217;.*ness$&#8217;, &#8216;NN&#8217;),<br
/> (r&#8217;.*ment$&#8217;, &#8216;NN&#8217;),<br
/> (r&#8217;.*ful$&#8217;, &#8216;JJ&#8217;),<br
/> (r&#8217;.*ious$&#8217;, &#8216;JJ&#8217;),<br
/> (r&#8217;.*ble$&#8217;, &#8216;JJ&#8217;),<br
/> (r&#8217;.*ic$&#8217;, &#8216;JJ&#8217;),<br
/> (r&#8217;.*ive$&#8217;, &#8216;JJ&#8217;),<br
/> (r&#8217;.*ic$&#8217;, &#8216;JJ&#8217;),<br
/> (r&#8217;.*est$&#8217;, &#8216;JJ&#8217;),<br
/> (r&#8217;^a$&#8217;, &#8216;PREP&#8217;),<br
/> ]</p><p> raubt_tagger = backoff_tagger(train_sents, [nltk.tag.AffixTagger,<br
/> nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger],<br
/> backoff=nltk.tag.RegexpTagger(word_patterns))</p><p> templates = [<br
/> brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,1)),<br
/> brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (2,2)),<br
/> brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,2)),<br
/> brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,3)),<br
/> brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,1)),<br
/> brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (2,2)),<br
/> brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,2)),<br
/> brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,3)),<br
/> brill.ProximateTokensTemplate(brill.ProximateTagsRule, (-1, -1), (1,1)),<br
/> brill.ProximateTokensTemplate(brill.ProximateWordsRule, (-1, -1), (1,1))<br
/> ]</p><p> trainer = brill.FastBrillTaggerTrainer(raubt_tagger, templates)<br
/> logger.debug(&#8216;starting training&#8217;)<br
/> braubt_tagger = trainer.train(train_sents, max_rules=100, min_score=3)<br
/> logger.debug(&#8216;finished training&#8217;)<br
/> self.tagger = braubt_tagger</p><p> def tag(self,sentence):<br
/> return self.tagger.tag(sentence)</p><p>and it gives me something like:</p><p> [('Further', 'AP'), ('snow', None), ('is', 'BEZ'), ('expected', 'VBN'), ('to', 'TO'), ('push', None), ('into', 'IN'), ('many', 'AP'), ('southern', 'JJ-TL'), ('and', 'CC'), ('eastern', 'JJ-TL'), ('parts', 'NNS'), ('of', 'IN'), ('England,', None), ('including', 'IN'), ('London,', None), ('overnight', 'NN'), ('and', 'CC'), ('during', 'IN'), ('the', 'AT'), ('day', 'NN'), ('on', 'IN'), ('Friday.', None)]</p><p>However when I feed this into the chunker I still get nothing:</p><p> [(('Further', 'AP'), None), (('snow', None), None), (('is', 'BEZ'), None), (('expected', 'VBN'), None), (('to', 'TO'), None), (('push', None), None), (('into', 'IN'), None), (('many', 'AP'), None), (('southern', 'JJ-TL'), None), (('and', 'CC'), None), (('eastern', 'JJ-TL'), None), (('parts', 'NNS'), None), (('of', 'IN'), None), (('England,', None), None), (('including', 'IN'), None), (('London,', None), None), (('overnight', 'NN'), None), (('and', 'CC'), None), (('during', 'IN'), None), (('the', 'AT'), None), (('day', 'NN'), None), (('on', 'IN'), None), (('Friday.', None), None)]</p><p>Is it I wonder because not all tokens get tags?</p><p>Thanks for your help so far.</p> ]]></content:encoded> </item> <item><title>By: Jacob</title><link>http://streamhacker.com/2008/12/29/how-to-train-a-nltk-chunker/comment-page-1/#comment-13</link> <dc:creator>Jacob</dc:creator> <pubDate>Fri, 06 Feb 2009 00:38:31 +0000</pubDate> <guid
isPermaLink="false">http://streamhacker.wordpress.com/?p=81#comment-13</guid> <description>Hi Col,It looks like you left out a step: part of speech tagging. The chunker requires tagged tokens, like &lt;code&gt;[(&#039;foo&#039;, &#039;JJ&#039;), (&#039;bar&#039;, &#039;NN&#039;)]&lt;/code&gt; in order to extract chunks. So you&#039;ll have to train a part of speech tagger as well as the chunker, then run the tokens thru the tagger, and use that output as input to the chunker. Check out my articles about part of speech tagging, starting with &lt;a href=&quot;http://streamhacker.wordpress.com/2008/11/03/part-of-speech-tagging-with-nltk-part-1/&quot; rel=&quot;nofollow&quot;&gt;Part 1&lt;/a&gt;. You also may want to look at the &lt;a href=&quot;http://nltk.googlecode.com/svn/trunk/doc/en/ch07.html#chunking&quot; rel=&quot;nofollow&quot;&gt;NLTK Chunking Guide&lt;/a&gt;.</description> <content:encoded><![CDATA[<p>Hi Col,</p><p>It looks like you left out a step: part of speech tagging. The chunker requires tagged tokens, like <code>[('foo', 'JJ'), ('bar', 'NN')]</code> in order to extract chunks. So you&#8217;ll have to train a part of speech tagger as well as the chunker, then run the tokens thru the tagger, and use that output as input to the chunker. Check out my articles about part of speech tagging, starting with <a
href="http://streamhacker.wordpress.com/2008/11/03/part-of-speech-tagging-with-nltk-part-1/" rel="nofollow">Part 1</a>. You also may want to look at the <a
href="http://nltk.googlecode.com/svn/trunk/doc/en/ch07.html#chunking" rel="nofollow">NLTK Chunking Guide</a>.</p> ]]></content:encoded> </item> <item><title>By: Col Wilson</title><link>http://streamhacker.com/2008/12/29/how-to-train-a-nltk-chunker/comment-page-1/#comment-12</link> <dc:creator>Col Wilson</dc:creator> <pubDate>Fri, 06 Feb 2009 00:07:40 +0000</pubDate> <guid
isPermaLink="false">http://streamhacker.wordpress.com/?p=81#comment-12</guid> <description>Hi there, thanks for the article, but I can&#039;t seem to get it to work. I have written a class like this, around what you suggest (I think):class Chunker:def __init__(self):
def conll_tag_chunks(chunk_sents):
tag_sents = [nltk.chunk.tree2conlltags(tree) for tree in chunk_sents]
return [[(t, c) for (w, t, c) in chunk_tags] for chunk_tags in tag_sents]
train_sents = nltk.corpus.conll2000.chunked_sents()
train_chunks = conll_tag_chunks(train_sents)
logger.debug(&#039;training u_chunker&#039;)
u_chunker = UnigramTagger(train=train_chunks)
logger.debug(&#039;training ub_chunker&#039;)
ub_chunker = BigramTagger(train=train_chunks, backoff=u_chunker)
#ubt_chunker = nltk.tag.TrigramTagger(train_chunks, backoff=ub_chunker)
#ut_chunker = nltk.tag.TrigramTagger(train_chunks, backoff=u_chunker)
#utb_chunker = nltk.tag.BigramTagger(train_chunks, backoff=ut_chunker)
logger.debug(&#039;finished training&#039;)
self.chunker = ub_chunkerdef chunk(self, tokens):
return self.chunker.tag(tokens)and tried to do this:chunker = Chunker()
s = &quot;Since then, we&#039;ve changed how we use Python a ton internally.&quot;
tokens = s.split()
chunked = chunker.chunk(tokens)
print chunkedwhich gives:[(u&#039;Since&#039;, None), (u&#039;then,&#039;, None), (u&quot;we&#039;ve&quot;, None), (u&#039;changed&#039;, None), (u&#039;how&#039;, None), (u&#039;we&#039;, None), (u&#039;use&#039;, None), (u&#039;Python&#039;, None), (u&#039;a&#039;, None), (u&#039;ton&#039;, None), (u&#039;internally.&#039;, None)]In other words, nothing at all gets chunked.Have I missed something?Col</description> <content:encoded><![CDATA[<p>Hi there, thanks for the article, but I can&#8217;t seem to get it to work. I have written a class like this, around what you suggest (I think):</p><p>class Chunker:</p><p> def __init__(self):<br
/> def conll_tag_chunks(chunk_sents):<br
/> tag_sents = [nltk.chunk.tree2conlltags(tree) for tree in chunk_sents]<br
/> return [[(t, c) for (w, t, c) in chunk_tags] for chunk_tags in tag_sents]<br
/> train_sents = nltk.corpus.conll2000.chunked_sents()<br
/> train_chunks = conll_tag_chunks(train_sents)<br
/> logger.debug(&#8216;training u_chunker&#8217;)<br
/> u_chunker = UnigramTagger(train=train_chunks)<br
/> logger.debug(&#8216;training ub_chunker&#8217;)<br
/> ub_chunker = BigramTagger(train=train_chunks, backoff=u_chunker)<br
/> #ubt_chunker = nltk.tag.TrigramTagger(train_chunks, backoff=ub_chunker)<br
/> #ut_chunker = nltk.tag.TrigramTagger(train_chunks, backoff=u_chunker)<br
/> #utb_chunker = nltk.tag.BigramTagger(train_chunks, backoff=ut_chunker)<br
/> logger.debug(&#8216;finished training&#8217;)<br
/> self.chunker = ub_chunker</p><p> def chunk(self, tokens):<br
/> return self.chunker.tag(tokens)</p><p>and tried to do this:</p><p>chunker = Chunker()<br
/> s = &#8220;Since then, we&#8217;ve changed how we use Python a ton internally.&#8221;<br
/> tokens = s.split()<br
/> chunked = chunker.chunk(tokens)<br
/> print chunked</p><p>which gives:</p><p>[(u'Since', None), (u'then,', None), (u"we've", None), (u'changed', None), (u'how', None), (u'we', None), (u'use', None), (u'Python', None), (u'a', None), (u'ton', None), (u'internally.', None)]</p><p>In other words, nothing at all gets chunked.</p><p>Have I missed something?</p><p>Col</p> ]]></content:encoded> </item> </channel> </rss>
<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Minified using disk
Page Caching using disk (enhanced) (user agent is rejected)

Served from: streamhacker.com @ 2010-07-31 06:28:28 -->