<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Text Classification for Sentiment Analysis &#8211; Stopwords and Collocations</title>
	<atom:link href="http://streamhacker.com/2010/05/24/text-classification-sentiment-analysis-stopwords-collocations/feed/" rel="self" type="application/rss+xml" />
	<link>http://streamhacker.com/2010/05/24/text-classification-sentiment-analysis-stopwords-collocations/#utm_source=feed&#038;utm_medium=feed&#038;utm_campaign=feed</link>
	<description>Weotta be Hacking</description>
	<lastBuildDate>Thu, 19 Apr 2012 12:53:00 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
<atom:link rel="hub" href="http://pubsubhubbub.appspot.com" />
	<atom:link rel="hub" href="http://superfeedr.com/hubbub" />
		<item>
		<title>By: Jacob Perkins</title>
		<link>http://streamhacker.com/2010/05/24/text-classification-sentiment-analysis-stopwords-collocations/comment-page-1/#comment-944</link>
		<dc:creator>Jacob Perkins</dc:creator>
		<pubDate>Wed, 08 Feb 2012 15:44:00 +0000</pubDate>
		<guid isPermaLink="false">http://streamhacker.com/?p=1227#comment-944</guid>
		<description>words is not explicitly defined above, but it&#039;s a function parameter that is expected to be a list of strings. featx is also a function parameter, but it&#039;s expected to be a function that accepts words and returns a dict. This way, you can pass different featx functions to evaluate_classifier to see the different results.</description>
		<content:encoded><![CDATA[<p>words is not explicitly defined above, but it&#8217;s a function parameter that is expected to be a list of strings. featx is also a function parameter, but it&#8217;s expected to be a function that accepts words and returns a dict. This way, you can pass different featx functions to evaluate_classifier to see the different results.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Fredrik</title>
		<link>http://streamhacker.com/2010/05/24/text-classification-sentiment-analysis-stopwords-collocations/comment-page-1/#comment-943</link>
		<dc:creator>Fredrik</dc:creator>
		<pubDate>Wed, 08 Feb 2012 14:37:00 +0000</pubDate>
		<guid isPermaLink="false">http://streamhacker.com/?p=1227#comment-943</guid>
		<description>I am quite new to Python, and some parts of the code seems more or less magic to me... I have understood that functions are just ordinary objects/values in Python and I guess that this is the trick, but can you explain or suggest a good link for explaining how the following parts of the code work? The name word_feats seems to be bounded to the function word_feats, but what is words bound too? I guess it is bound to featx through function evaluate_classifier, but I really don&#039;t get how featx is assigned a value in 
negfeats = [(featx(movie_reviews.words(fileids=[f])), &#039;neg&#039;) for f in negids] (to me it looks like featx is a function here, but I guess it is not? I guess that I should do some basic reading about Python, but any clarification would be helpful.

</description>
		<content:encoded><![CDATA[<p>I am quite new to Python, and some parts of the code seems more or less magic to me&#8230; I have understood that functions are just ordinary objects/values in Python and I guess that this is the trick, but can you explain or suggest a good link for explaining how the following parts of the code work? The name word_feats seems to be bounded to the function word_feats, but what is words bound too? I guess it is bound to featx through function evaluate_classifier, but I really don&#8217;t get how featx is assigned a value in <br />
negfeats = [(featx(movie_reviews.words(fileids=[f])), &#8216;neg&#8217;) for f in negids] (to me it looks like featx is a function here, but I guess it is not? I guess that I should do some basic reading about Python, but any clarification would be helpful.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: ???? ?????? &#171; ?????</title>
		<link>http://streamhacker.com/2010/05/24/text-classification-sentiment-analysis-stopwords-collocations/comment-page-1/#comment-938</link>
		<dc:creator>???? ?????? &#171; ?????</dc:creator>
		<pubDate>Thu, 01 Dec 2011 16:21:37 +0000</pubDate>
		<guid isPermaLink="false">http://streamhacker.com/?p=1227#comment-938</guid>
		<description>[...] 1. NLTK: http://streamhacker.com/2010/05/24/text-classification-sentiment-analysis-stopwords-collocations/ [...]</description>
		<content:encoded><![CDATA[<p>[...] 1. NLTK: http://streamhacker.com/2010/05/24/text-classification-sentiment-analysis-stopwords-collocations/ [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Schillermika</title>
		<link>http://streamhacker.com/2010/05/24/text-classification-sentiment-analysis-stopwords-collocations/comment-page-1/#comment-937</link>
		<dc:creator>Schillermika</dc:creator>
		<pubDate>Mon, 21 Nov 2011 05:54:00 +0000</pubDate>
		<guid isPermaLink="false">http://streamhacker.com/?p=1227#comment-937</guid>
		<description>thanks...def need to polish up my python skills</description>
		<content:encoded><![CDATA[<p>thanks&#8230;def need to polish up my python skills</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jacob Perkins</title>
		<link>http://streamhacker.com/2010/05/24/text-classification-sentiment-analysis-stopwords-collocations/comment-page-1/#comment-936</link>
		<dc:creator>Jacob Perkins</dc:creator>
		<pubDate>Mon, 21 Nov 2011 04:52:00 +0000</pubDate>
		<guid isPermaLink="false">http://streamhacker.com/?p=1227#comment-936</guid>
		<description>featuresets = [(bag_of_words([word.lower() for word in sent]), label) for (sentence, label) in raw_dataset]

or

raw_dataset = [([word.lower() for word in sentence], &quot;physics&quot;) for sentence in physics.sents()]</description>
		<content:encoded><![CDATA[<p>featuresets = [(bag_of_words([word.lower() for word in sent]), label) for (sentence, label) in raw_dataset]</p>
<p>or</p>
<p>raw_dataset = [([word.lower() for word in sentence], &#8220;physics&#8221;) for sentence in physics.sents()]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Schillermika</title>
		<link>http://streamhacker.com/2010/05/24/text-classification-sentiment-analysis-stopwords-collocations/comment-page-1/#comment-935</link>
		<dc:creator>Schillermika</dc:creator>
		<pubDate>Mon, 21 Nov 2011 04:39:00 +0000</pubDate>
		<guid isPermaLink="false">http://streamhacker.com/?p=1227#comment-935</guid>
		<description>Here&#039;s the problem I&#039;m having. I&#039;ll use a test corpus I play around with to demonstrate. So, first, the corpus reader object and bag of words function

physics_corpus = LazyCorpusLoader(&#039;cookbook&#039;, PlaintextCorpusReader, [&#039;physics.txt&#039;])
def bag_of_words(sentence):	return dict([(word, True) for word in sentence])


Then I label the training data

raw_dataset = [(sentence, &quot;physics&quot;) for sentence in physics.sents()]

I would have preferred that raw_dataset be this instead:

raw_dataset2 = [(word.lower(), &quot;physics&quot;) for word in physics.words()]

But the problem is that if I use raw_dataset2 to create my featuresets to train the classifier like this:

featuresets = [(bag_of_words(word), label) for (word, label) in raw_dataset2]

Then I get this:

[({&#039;h&#039;: True, &#039;e&#039;: True, &#039;T&#039;: True}, &#039;physics&#039;), ({&#039;a&#039;: True, &#039;c&#039;: True, &#039;i&#039;: True, &#039;h&#039;: True, &#039;l&#039;: True, &#039;p&#039;: True, &#039;s&#039;: True, &#039;y&#039;: True}, &#039;physics&#039;)

Not what I want. But with plain old raw_dataset:

raw_dataset = [(sentence, &quot;physics&quot;) for sentence in physics.sents()]

featuresets = [(bag_of_words(sentence), label) for (sentence, label) in raw_dataset]

It returns whole words as I want:

({&#039;and&#039;: True, &#039;distances&#039;: True, &#039;scales&#039;: True, &#039;subatomic&#039;: True, &#039;over&#039;: True, &#039;challenges&#039;: True, &#039;meters&#039;: True}, &#039;physics&#039;)

So my dilemma is that I&#039;m stuck with physics.sents() so that bag_of_words returns whole words rather than letters. But I can&#039;t lowercase sentences so a list comprehension like [word.lower() for word in physics.sents()] is not an option. And that&#039;s why I put word.lower() in the bag_of_words() function.  I&#039;m having trouble seeing where I can apply word.lower() . I tried converting raw_dataset to a string so I could lowercase the words and then convert back to a list, but I should have known it&#039;s inane. Any insights?

thnx</description>
		<content:encoded><![CDATA[<p>Here&#8217;s the problem I&#8217;m having. I&#8217;ll use a test corpus I play around with to demonstrate. So, first, the corpus reader object and bag of words function</p>
<p>physics_corpus = LazyCorpusLoader(&#8216;cookbook&#8217;, PlaintextCorpusReader, ['physics.txt'])<br />
def bag_of_words(sentence):	return dict([(word, True) for word in sentence])</p>
<p>Then I label the training data</p>
<p>raw_dataset = [(sentence, "physics") for sentence in physics.sents()]</p>
<p>I would have preferred that raw_dataset be this instead:</p>
<p>raw_dataset2 = [(word.lower(), "physics") for word in physics.words()]</p>
<p>But the problem is that if I use raw_dataset2 to create my featuresets to train the classifier like this:</p>
<p>featuresets = [(bag_of_words(word), label) for (word, label) in raw_dataset2]</p>
<p>Then I get this:</p>
<p>[({'h': True, 'e': True, 'T': True}, 'physics'), ({'a': True, 'c': True, 'i': True, 'h': True, 'l': True, 'p': True, 's': True, 'y': True}, 'physics')</p>
<p>Not what I want. But with plain old raw_dataset:</p>
<p>raw_dataset = [(sentence, "physics") for sentence in physics.sents()]</p>
<p>featuresets = [(bag_of_words(sentence), label) for (sentence, label) in raw_dataset]</p>
<p>It returns whole words as I want:</p>
<p>({&#8216;and&#8217;: True, &#8216;distances&#8217;: True, &#8216;scales&#8217;: True, &#8216;subatomic&#8217;: True, &#8216;over&#8217;: True, &#8216;challenges&#8217;: True, &#8216;meters&#8217;: True}, &#8216;physics&#8217;)</p>
<p>So my dilemma is that I&#8217;m stuck with physics.sents() so that bag_of_words returns whole words rather than letters. But I can&#8217;t lowercase sentences so a list comprehension like [word.lower() for word in physics.sents()] is not an option. And that&#8217;s why I put word.lower() in the bag_of_words() function.  I&#8217;m having trouble seeing where I can apply word.lower() . I tried converting raw_dataset to a string so I could lowercase the words and then convert back to a list, but I should have known it&#8217;s inane. Any insights?</p>
<p>thnx</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jacob Perkins</title>
		<link>http://streamhacker.com/2010/05/24/text-classification-sentiment-analysis-stopwords-collocations/comment-page-1/#comment-934</link>
		<dc:creator>Jacob Perkins</dc:creator>
		<pubDate>Mon, 21 Nov 2011 01:10:00 +0000</pubDate>
		<guid isPermaLink="false">http://streamhacker.com/?p=1227#comment-934</guid>
		<description>You should remove word.lower() from bag_of_words(), and instead lowercase everything yourself. The best way to do this would be to lowercase every word in the sentence first, before finding bigrams or calling bag_of_words(). This is a simple list comprehension, like sentence = [word.lower() for word in sentence]</description>
		<content:encoded><![CDATA[<p>You should remove word.lower() from bag_of_words(), and instead lowercase everything yourself. The best way to do this would be to lowercase every word in the sentence first, before finding bigrams or calling bag_of_words(). This is a simple list comprehension, like sentence = [word.lower() for word in sentence]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Schillermika</title>
		<link>http://streamhacker.com/2010/05/24/text-classification-sentiment-analysis-stopwords-collocations/comment-page-1/#comment-933</link>
		<dc:creator>Schillermika</dc:creator>
		<pubDate>Sun, 20 Nov 2011 21:08:00 +0000</pubDate>
		<guid isPermaLink="false">http://streamhacker.com/?p=1227#comment-933</guid>
		<description>Hey Jacob,

I&#039;m trying to classify song lyrics, where bigrams matter. My program is reading from custom corpuses full of lyrics from the Web. I need to lowercase all the unigrams and bigrams, and that&#039;s where there&#039;s an issue. First, the bag of words function

def bag_of_words(sentence):	return dict([(word.lower(), True) for word in sentence])

Then the bigram extractor

def bag_of_bigrams_words(sentence, score_fn = BigramAssocMeasures.chi_sq, n = 200):	bigram_finder = BigramCollocationFinder.from_words(sentence)	bigrams = bigram_finder.nbest(score_fn, n)	return bag_of_words(sentence + bigrams)

So, I use bag_of_bigrams_words() on a simple sentence like 

bag_of_bigrams_words([&#039;Joey&#039;, &#039;Plays&#039;, &#039;the&#039;, &#039;Guitar&#039;])


 and I get the following error

Traceback (most recent call last):  File &quot;&quot;, line 1, in     bag_of_bigrams_words([&#039;Joey&#039;, &#039;Plays&#039;, &#039;the&#039;, &#039;Guitar&#039;])  File &quot;&quot;, line 4, in bag_of_bigrams_words    return bag_of_words(sentence + bigrams)  File &quot;&quot;, line 2, in bag_of_words    return dict([(word.lower(), True) for word in sentence])AttributeError: &#039;tuple&#039; object has no attribute &#039;lower&#039;


It seems that as long as word.lower() is in bag_of_words(), it&#039;s incompatible with the bigram tuples. What&#039;s the best way around this, considering that I need word.lower() in bag_of_words() in order to reduce dimensionality ?

</description>
		<content:encoded><![CDATA[<p>Hey Jacob,</p>
<p>I&#8217;m trying to classify song lyrics, where bigrams matter. My program is reading from custom corpuses full of lyrics from the Web. I need to lowercase all the unigrams and bigrams, and that&#8217;s where there&#8217;s an issue. First, the bag of words function</p>
<p>def bag_of_words(sentence):	return dict([(word.lower(), True) for word in sentence])</p>
<p>Then the bigram extractor</p>
<p>def bag_of_bigrams_words(sentence, score_fn = BigramAssocMeasures.chi_sq, n = 200):	bigram_finder = BigramCollocationFinder.from_words(sentence)	bigrams = bigram_finder.nbest(score_fn, n)	return bag_of_words(sentence + bigrams)</p>
<p>So, I use bag_of_bigrams_words() on a simple sentence like </p>
<p>bag_of_bigrams_words(['Joey', 'Plays', 'the', 'Guitar'])</p>
<p> and I get the following error</p>
<p>Traceback (most recent call last):  File &#8220;&#8221;, line 1, in     bag_of_bigrams_words(['Joey', 'Plays', 'the', 'Guitar'])  File &#8220;&#8221;, line 4, in bag_of_bigrams_words    return bag_of_words(sentence + bigrams)  File &#8220;&#8221;, line 2, in bag_of_words    return dict([(word.lower(), True) for word in sentence])AttributeError: &#8216;tuple&#8217; object has no attribute &#8216;lower&#8217;</p>
<p>It seems that as long as word.lower() is in bag_of_words(), it&#8217;s incompatible with the bigram tuples. What&#8217;s the best way around this, considering that I need word.lower() in bag_of_words() in order to reduce dimensionality ?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jacob Perkins</title>
		<link>http://streamhacker.com/2010/05/24/text-classification-sentiment-analysis-stopwords-collocations/comment-page-1/#comment-841</link>
		<dc:creator>Jacob Perkins</dc:creator>
		<pubDate>Sun, 19 Jun 2011 20:44:00 +0000</pubDate>
		<guid isPermaLink="false">http://streamhacker.com/?p=1227#comment-841</guid>
		<description>It&#039;s hard to tell what&#039;s wrong, so all I can suggest is make sure all instances of &quot;movie_reviews&quot; have been changed to &quot;mysentiment&quot; and remove the movie_reviews import. If that doesn&#039;t do it, then make sure the corpus reader is defined correctly by creating it, then doing &quot;mysentiment.categories()&quot; and &quot;mysentiment.fileids()&quot; to ensure it&#039;s producing the right results.</description>
		<content:encoded><![CDATA[<p>It&#8217;s hard to tell what&#8217;s wrong, so all I can suggest is make sure all instances of &#8220;movie_reviews&#8221; have been changed to &#8220;mysentiment&#8221; and remove the movie_reviews import. If that doesn&#8217;t do it, then make sure the corpus reader is defined correctly by creating it, then doing &#8220;mysentiment.categories()&#8221; and &#8220;mysentiment.fileids()&#8221; to ensure it&#8217;s producing the right results.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Aiden</title>
		<link>http://streamhacker.com/2010/05/24/text-classification-sentiment-analysis-stopwords-collocations/comment-page-1/#comment-840</link>
		<dc:creator>Aiden</dc:creator>
		<pubDate>Sun, 19 Jun 2011 20:38:00 +0000</pubDate>
		<guid isPermaLink="false">http://streamhacker.com/?p=1227#comment-840</guid>
		<description>the code didn&#039;t look a mess when I posted it, sorry about that, don&#039;t know why it appeared like that. Its the same as your first code above except with a corpus reader at the beginning.
from nltk.corpus.reader import CategorizedPlaintextCorpusReadermysentiment = CategorizedPlaintextCorpusReader(r&#039;c:/users/Aiden/nltk_data/corpora/sentiment&#039;, r&#039;(pos&#124;neg)/.*.txt&#039;, cat_pattern=r&#039;(pos&#124;neg)/.*.txt&#039;)</description>
		<content:encoded><![CDATA[<p>the code didn&#8217;t look a mess when I posted it, sorry about that, don&#8217;t know why it appeared like that. Its the same as your first code above except with a corpus reader at the beginning.<br />
from nltk.corpus.reader import CategorizedPlaintextCorpusReadermysentiment = CategorizedPlaintextCorpusReader(r&#8217;c:/users/Aiden/nltk_data/corpora/sentiment&#8217;, r&#8217;(pos|neg)/.*.txt&#8217;, cat_pattern=r&#8217;(pos|neg)/.*.txt&#8217;)</p>
]]></content:encoded>
	</item>
</channel>
</rss>

