<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Part of Speech Tagging with NLTK Part 4 &#8211; Brill Tagger vs Classifier Taggers</title>
	<atom:link href="http://streamhacker.com/2010/04/12/pos-tag-nltk-brill-classifier/feed/" rel="self" type="application/rss+xml" />
	<link>http://streamhacker.com/2010/04/12/pos-tag-nltk-brill-classifier/#utm_source=feed&#038;utm_medium=feed&#038;utm_campaign=feed</link>
	<description>Weotta be Hacking</description>
	<lastBuildDate>Sun, 05 Feb 2012 22:47:34 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
<atom:link rel="hub" href="http://pubsubhubbub.appspot.com" />
	<atom:link rel="hub" href="http://superfeedr.com/hubbub" />
		<item>
		<title>By: Jacob Perkins</title>
		<link>http://streamhacker.com/2010/04/12/pos-tag-nltk-brill-classifier/comment-page-1/#comment-903</link>
		<dc:creator>Jacob Perkins</dc:creator>
		<pubDate>Thu, 22 Sep 2011 15:44:00 +0000</pubDate>
		<guid isPermaLink="false">http://streamhacker.com/?p=1116#comment-903</guid>
		<description>Hi Max,

Joining phrases with &quot;_&quot; then splitting out later can definitely work, because it would let you use a manual dictionary UnigramTagger to define the tags, ensuring they don&#039;t get missed. After chunking, it should be easy to split up any word with &quot;_&quot; in it.

An alternative solution is to transform your list of phrases into a corpus of tagged &amp; chunked phrases, then train a tagger and chunker on it. Basically, each line could look like &quot;[General/NN Electric/NN ...]&quot;. The brackets are used by the BracketParseCorpusReader (used by treebank) to define noun phrases. This method can definitely work, but it may take a lot more effort on your part to tag &amp; chunk every phrases.</description>
		<content:encoded><![CDATA[<p>Hi Max,</p>
<p>Joining phrases with &#8220;_&#8221; then splitting out later can definitely work, because it would let you use a manual dictionary UnigramTagger to define the tags, ensuring they don&#8217;t get missed. After chunking, it should be easy to split up any word with &#8220;_&#8221; in it.</p>
<p>An alternative solution is to transform your list of phrases into a corpus of tagged &amp; chunked phrases, then train a tagger and chunker on it. Basically, each line could look like &#8220;[General/NN Electric/NN ...]&#8220;. The brackets are used by the BracketParseCorpusReader (used by treebank) to define noun phrases. This method can definitely work, but it may take a lot more effort on your part to tag &amp; chunk every phrases.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Max</title>
		<link>http://streamhacker.com/2010/04/12/pos-tag-nltk-brill-classifier/comment-page-1/#comment-902</link>
		<dc:creator>Max</dc:creator>
		<pubDate>Thu, 22 Sep 2011 11:01:00 +0000</pubDate>
		<guid isPermaLink="false">http://streamhacker.com/?p=1116#comment-902</guid>
		<description>Hi Jacob,

Thanks for your fast reply! I was away (on vacation) but now I&#039;m back to work. I&#039;m following your recommendations, also learned NLTK Trainer and using for training.

As I know, key-phrases extraction highly depends on the POS-tagger efficiency. The thing is that POS-tagger cannot work perfectly due to unknown words and words or phrases that have special meaning (names of trademarks, companies, products and so on).
I want to use categorized phrases/words lists in order to parse information more efficiently. For example, I want to ensure that &quot;General Electric&quot; phrase won&#039;t be broken or &quot;iphone&quot; won&#039;t be assigned the &quot;CD&quot; POS-tag (as it is now in my application).
So I want to assign correct POS-tags to a set of non-standard words. It&#039;s clear about tagging single words. But when it comes to multiple-word phrases, I&#039;m not sure about the solution. So I want some multi-word phrases to be kept as they are and treated just like single nouns on the chunking stage (as I think, it may improve chunking efficiency).

I&#039;ve been thinking about how to do this and that&#039;s the most obvious solution I see:
Before the stage of tokenizing sentences into words the application finds all the occurrences from the lists (i.e. categorized lists of phrases/words) and replaces spaces with &quot;_&quot; (for example) so that &quot;General Electric&quot; will become &quot;General_Electric&quot; and won&#039;t be split into separate words (at least by TreebankWordTokenizer.tokenize()). This way using my own POS tagger as the first in the chain will tag &quot;General_Electric&quot; as a single noun and chunking stage may be more successful. Also I&#039;ll have to somehow remember that &quot;General_Electric&quot; refers to the original &quot;General Electric&quot; and belongs to &quot;Companies&quot; list.
OR
Apply some function that will join 2 words &quot;General&quot; &quot;Electric&quot; into a single &quot;General Electric&quot;.

That&#039;s not the solution that I like. I&#039;m sure Python or NLTK provide some suitable functionality. Could you suggest a better way to do such things? I would appreciate any thoughts.</description>
		<content:encoded><![CDATA[<p>Hi Jacob,</p>
<p>Thanks for your fast reply! I was away (on vacation) but now I&#8217;m back to work. I&#8217;m following your recommendations, also learned NLTK Trainer and using for training.</p>
<p>As I know, key-phrases extraction highly depends on the POS-tagger efficiency. The thing is that POS-tagger cannot work perfectly due to unknown words and words or phrases that have special meaning (names of trademarks, companies, products and so on).<br />
I want to use categorized phrases/words lists in order to parse information more efficiently. For example, I want to ensure that &#8220;General Electric&#8221; phrase won&#8217;t be broken or &#8220;iphone&#8221; won&#8217;t be assigned the &#8220;CD&#8221; POS-tag (as it is now in my application).<br />
So I want to assign correct POS-tags to a set of non-standard words. It&#8217;s clear about tagging single words. But when it comes to multiple-word phrases, I&#8217;m not sure about the solution. So I want some multi-word phrases to be kept as they are and treated just like single nouns on the chunking stage (as I think, it may improve chunking efficiency).</p>
<p>I&#8217;ve been thinking about how to do this and that&#8217;s the most obvious solution I see:<br />
Before the stage of tokenizing sentences into words the application finds all the occurrences from the lists (i.e. categorized lists of phrases/words) and replaces spaces with &#8220;_&#8221; (for example) so that &#8220;General Electric&#8221; will become &#8220;General_Electric&#8221; and won&#8217;t be split into separate words (at least by TreebankWordTokenizer.tokenize()). This way using my own POS tagger as the first in the chain will tag &#8220;General_Electric&#8221; as a single noun and chunking stage may be more successful. Also I&#8217;ll have to somehow remember that &#8220;General_Electric&#8221; refers to the original &#8220;General Electric&#8221; and belongs to &#8220;Companies&#8221; list.<br />
OR<br />
Apply some function that will join 2 words &#8220;General&#8221; &#8220;Electric&#8221; into a single &#8220;General Electric&#8221;.</p>
<p>That&#8217;s not the solution that I like. I&#8217;m sure Python or NLTK provide some suitable functionality. Could you suggest a better way to do such things? I would appreciate any thoughts.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jacob Perkins</title>
		<link>http://streamhacker.com/2010/04/12/pos-tag-nltk-brill-classifier/comment-page-1/#comment-893</link>
		<dc:creator>Jacob Perkins</dc:creator>
		<pubDate>Tue, 30 Aug 2011 18:43:00 +0000</pubDate>
		<guid isPermaLink="false">http://streamhacker.com/?p=1116#comment-893</guid>
		<description>Hi Max,

Sounds like you&#039;re clear the overall structure, though you may want to think about removing the stemming &amp; stop word filtering steps, as transforming chunks &amp; words can often change the meaning. But that depends on your needs.

To improve chunking, yes, you could define a few select chunk rules where you&#039;re close to 100% sure that if they work, then it&#039;s a good chunk, and if you get nothing, then use a trained chunker. In other words, you define a few high-precision regular expressions, and then rely on the trained chunker to find/recall chunks the manual chunker misses.

The other thing I recommend is if you&#039;re using a treebank trained chunker, then you should use a treebank trained tagger. Or consider training on conll2000 as well. Checkout https://github.com/japerk/nltk-trainer for some scripts I made to make training easier.

However, the best option (but also the most time consuming) is to create your own tagged &amp; chunked corpus, then train a tagger &amp; chunker on that. I recommend a bootstrap approach, where you&#039;d use an existing tagger &amp; chunker to create an initial corpus, then go in and hand-correct before training a custom tagger &amp; chunker. This is the only way I know of to end up with highly accurate results, and also allows you to define custom chunk structures that aren&#039;t found in treebank, because you have control of the training data.</description>
		<content:encoded><![CDATA[<p>Hi Max,</p>
<p>Sounds like you&#8217;re clear the overall structure, though you may want to think about removing the stemming &amp; stop word filtering steps, as transforming chunks &amp; words can often change the meaning. But that depends on your needs.</p>
<p>To improve chunking, yes, you could define a few select chunk rules where you&#8217;re close to 100% sure that if they work, then it&#8217;s a good chunk, and if you get nothing, then use a trained chunker. In other words, you define a few high-precision regular expressions, and then rely on the trained chunker to find/recall chunks the manual chunker misses.</p>
<p>The other thing I recommend is if you&#8217;re using a treebank trained chunker, then you should use a treebank trained tagger. Or consider training on conll2000 as well. Checkout https://github.com/japerk/nltk-trainer for some scripts I made to make training easier.</p>
<p>However, the best option (but also the most time consuming) is to create your own tagged &amp; chunked corpus, then train a tagger &amp; chunker on that. I recommend a bootstrap approach, where you&#8217;d use an existing tagger &amp; chunker to create an initial corpus, then go in and hand-correct before training a custom tagger &amp; chunker. This is the only way I know of to end up with highly accurate results, and also allows you to define custom chunk structures that aren&#8217;t found in treebank, because you have control of the training data.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Max</title>
		<link>http://streamhacker.com/2010/04/12/pos-tag-nltk-brill-classifier/comment-page-1/#comment-892</link>
		<dc:creator>Max</dc:creator>
		<pubDate>Tue, 30 Aug 2011 18:19:00 +0000</pubDate>
		<guid isPermaLink="false">http://streamhacker.com/?p=1116#comment-892</guid>
		<description>Thanks a lot!
I&#039;ve got a question but to be more precise, let me explain a bit:
I&#039;m working on a solution that extracts meaningful pieces of information from text (phrases, names...) that will be used for further text mining stages.
The current algorithm consists of the following stages, as I followed the book:
#Tokenizing Text into Sentences
#Tokenizing Sentences into Words 
#POS Tagging - using brown-simplifed-tags from infochimps
#Chunking - currently using ClassifierChunker trained by Treebank chunked sentences
#Stop-words filtering
#Apply extra filters - remove words that don&#039;t comply with word length requirements depending on POS and other
#Stemming, lemmatization may also be added (I just don&#039;t want to apply it to ALL the words as it &quot;destroys&quot; the meaning of important terms)
Currently, I&#039;m focused on POS-tagging and chunking.
Is there a way to somehow combine regular expressions for chunking with ordinary regular expressions (for example to specify that a noun starts with a capital letter or parse constructions and exact words like &quot;+ is to  +&quot;)?
The only solution that comes to mind is to create a Part-of-Speech Tagged Word Corpus and assign &quot;custom tags&quot; to words that I need and then pass them to the chunker, but I&#039;m not sure about such approach.
I know that training a chunker is a better way (comparatively to defining rules) to extract chunks but there are a lot of cases I&#039;m totally not satisfied with the chunks it returns. As I can&#039;t fully rely on it, I think I could define some simple grammar rules manually and if they don&#039;t return results, a trained chunker could be used.
I&#039;m new to NLTK and I would appreciate any suggestions to the overall process.</description>
		<content:encoded><![CDATA[<p>Thanks a lot!<br />
I&#8217;ve got a question but to be more precise, let me explain a bit:<br />
I&#8217;m working on a solution that extracts meaningful pieces of information from text (phrases, names&#8230;) that will be used for further text mining stages.<br />
The current algorithm consists of the following stages, as I followed the book:<br />
#Tokenizing Text into Sentences<br />
#Tokenizing Sentences into Words<br />
#POS Tagging &#8211; using brown-simplifed-tags from infochimps<br />
#Chunking &#8211; currently using ClassifierChunker trained by Treebank chunked sentences<br />
#Stop-words filtering<br />
#Apply extra filters &#8211; remove words that don&#8217;t comply with word length requirements depending on POS and other<br />
#Stemming, lemmatization may also be added (I just don&#8217;t want to apply it to ALL the words as it &#8220;destroys&#8221; the meaning of important terms)<br />
Currently, I&#8217;m focused on POS-tagging and chunking.<br />
Is there a way to somehow combine regular expressions for chunking with ordinary regular expressions (for example to specify that a noun starts with a capital letter or parse constructions and exact words like &#8220;+ is to  +&#8221;)?<br />
The only solution that comes to mind is to create a Part-of-Speech Tagged Word Corpus and assign &#8220;custom tags&#8221; to words that I need and then pass them to the chunker, but I&#8217;m not sure about such approach.<br />
I know that training a chunker is a better way (comparatively to defining rules) to extract chunks but there are a lot of cases I&#8217;m totally not satisfied with the chunks it returns. As I can&#8217;t fully rely on it, I think I could define some simple grammar rules manually and if they don&#8217;t return results, a trained chunker could be used.<br />
I&#8217;m new to NLTK and I would appreciate any suggestions to the overall process.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jacob Perkins</title>
		<link>http://streamhacker.com/2010/04/12/pos-tag-nltk-brill-classifier/comment-page-1/#comment-891</link>
		<dc:creator>Jacob Perkins</dc:creator>
		<pubDate>Tue, 30 Aug 2011 04:13:00 +0000</pubDate>
		<guid isPermaLink="false">http://streamhacker.com/?p=1116#comment-891</guid>
		<description>Hi Max, I just updated the infochimps page to list all the tags. The VB+ tags are pretty rare, but most of the rest are fairly common in the brown corpus.</description>
		<content:encoded><![CDATA[<p>Hi Max, I just updated the infochimps page to list all the tags. The VB+ tags are pretty rare, but most of the rest are fairly common in the brown corpus.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Max</title>
		<link>http://streamhacker.com/2010/04/12/pos-tag-nltk-brill-classifier/comment-page-1/#comment-890</link>
		<dc:creator>Max</dc:creator>
		<pubDate>Mon, 29 Aug 2011 12:05:00 +0000</pubDate>
		<guid isPermaLink="false">http://streamhacker.com/?p=1116#comment-890</guid>
		<description>Hi, Jacob! Thanks very much for your book and blog!
I used ClassifierBasedPOSTagger for POS-tagging but changed it for this
http://www.infochimps.com/datasets/brown-simplifed-tags-part-of-speech-tagger-for-python-nltkafter I saw it on your blog. Where can I get the list of all possible tags (simplified tags) that this tagger can assign to words?
</description>
		<content:encoded><![CDATA[<p>Hi, Jacob! Thanks very much for your book and blog!<br />
I used ClassifierBasedPOSTagger for POS-tagging but changed it for this<br />
<a href="http://www.infochimps.com/datasets/brown-simplifed-tags-part-of-speech-tagger-for-python-nltkafter" rel="nofollow">http://www.infochimps.com/datasets/brown-simplifed-tags-part-of-speech-tagger-for-python-nltkafter</a> I saw it on your blog. Where can I get the list of all possible tags (simplified tags) that this tagger can assign to words?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jacob Perkins</title>
		<link>http://streamhacker.com/2010/04/12/pos-tag-nltk-brill-classifier/comment-page-1/#comment-779</link>
		<dc:creator>Jacob Perkins</dc:creator>
		<pubDate>Fri, 07 Jan 2011 15:42:00 +0000</pubDate>
		<guid isPermaLink="false">http://streamhacker.com/?p=1116#comment-779</guid>
		<description>The ClassifierBasedPOSTagger is the most accurate method I know of.</description>
		<content:encoded><![CDATA[<p>The ClassifierBasedPOSTagger is the most accurate method I know of.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mvdeshpande28</title>
		<link>http://streamhacker.com/2010/04/12/pos-tag-nltk-brill-classifier/comment-page-1/#comment-778</link>
		<dc:creator>Mvdeshpande28</dc:creator>
		<pubDate>Fri, 07 Jan 2011 08:38:00 +0000</pubDate>
		<guid isPermaLink="false">http://streamhacker.com/?p=1116#comment-778</guid>
		<description>CAN ANYONE PLEASE SUGGEST ME THE BEST TAGGING METHOD.. I HAVE TO MAKE A PROJECT.. ANYONE PLZZZZZZZZZZZZZ HELP</description>
		<content:encoded><![CDATA[<p>CAN ANYONE PLEASE SUGGEST ME THE BEST TAGGING METHOD.. I HAVE TO MAKE A PROJECT.. ANYONE PLZZZZZZZZZZZZZ HELP</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jacob Perkins</title>
		<link>http://streamhacker.com/2010/04/12/pos-tag-nltk-brill-classifier/comment-page-1/#comment-630</link>
		<dc:creator>Jacob Perkins</dc:creator>
		<pubDate>Fri, 13 Aug 2010 19:31:00 +0000</pubDate>
		<guid isPermaLink="false">http://streamhacker.com/?p=1116#comment-630</guid>
		<description>Yes, I found that cutoff_prob parameter later and did some experiments, coming to the same conclusion as you: a backoff tagger with a classifier based tagger generally doesn&#039;t help.</description>
		<content:encoded><![CDATA[<p>Yes, I found that cutoff_prob parameter later and did some experiments, coming to the same conclusion as you: a backoff tagger with a classifier based tagger generally doesn&#8217;t help.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Dannii</title>
		<link>http://streamhacker.com/2010/04/12/pos-tag-nltk-brill-classifier/comment-page-1/#comment-628</link>
		<dc:creator>Dannii</dc:creator>
		<pubDate>Fri, 13 Aug 2010 14:11:00 +0000</pubDate>
		<guid isPermaLink="false">http://streamhacker.com/?p=1116#comment-628</guid>
		<description>You said: &quot;A ClassifierBasedPOSTagger does not need a backoff tagger, since cpos accuracy is exactly the same as for craubt across all corpora.&quot;

This is probably because you didn&#039;t set the classifier&#039;s cutoff_prob parameter. Without it the tagger will never consult its backoff. I don&#039;t use a backoff tagger with a classifier tagger, as anything else is only going to be less accurate.</description>
		<content:encoded><![CDATA[<p>You said: &#8220;A ClassifierBasedPOSTagger does not need a backoff tagger, since cpos accuracy is exactly the same as for craubt across all corpora.&#8221;</p>
<p>This is probably because you didn&#8217;t set the classifier&#8217;s cutoff_prob parameter. Without it the tagger will never consult its backoff. I don&#8217;t use a backoff tagger with a classifier tagger, as anything else is only going to be less accurate.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

