Text Classification for Sentiment Analysis – Stopwords and Collocations

Improving feature extraction can often have a significant positive impact on classifier accuracy (and precision and recall). In this article, I’ll be evaluating two modifications of the word_feats feature extraction method:

  1. filter out stopwords
  2. include bigram collocations

To do this effectively, we’ll modify the previous code so that we can use an arbitrary feature extractor function that takes the words in a file and returns the feature dictionary. As before, we’ll use these features to train a Naive Bayes Classifier.

import collections
import nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews

def evaluate_classifier(featx):
	negids = movie_reviews.fileids('neg')
	posids = movie_reviews.fileids('pos')

	negfeats = [(featx(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
	posfeats = [(featx(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

	negcutoff = len(negfeats)*3/4
	poscutoff = len(posfeats)*3/4

	trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
	testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]

	classifier = NaiveBayesClassifier.train(trainfeats)
	refsets = collections.defaultdict(set)
	testsets = collections.defaultdict(set)

	for i, (feats, label) in enumerate(testfeats):
			refsets[label].add(i)
			observed = classifier.classify(feats)
			testsets[observed].add(i)

	print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
	print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
	print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])
	print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg'])
	print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg'])
	classifier.show_most_informative_features()

Baseline Bag of Words Feature Extraction

Here’s the baseline feature extractor for bag of words feature selection.

def word_feats(words):
	return dict([(word, True) for word in words])

evaluate_classifier(word_feats)

The results are the same as in the previous articles, but I’ve included them here for reference:

accuracy: 0.728
pos precision: 0.651595744681
pos recall: 0.98
neg precision: 0.959677419355
neg recall: 0.476
Most Informative Features
         magnificent = True              pos : neg    =     15.0 : 1.0
         outstanding = True              pos : neg    =     13.6 : 1.0
           insulting = True              neg : pos    =     13.0 : 1.0
          vulnerable = True              pos : neg    =     12.3 : 1.0
           ludicrous = True              neg : pos    =     11.8 : 1.0
              avoids = True              pos : neg    =     11.7 : 1.0
         uninvolving = True              neg : pos    =     11.7 : 1.0
          astounding = True              pos : neg    =     10.3 : 1.0
         fascination = True              pos : neg    =     10.3 : 1.0
             idiotic = True              neg : pos    =      9.8 : 1.0

Stopword Filtering

Stopwords are words that are generally considered useless. Most search engines ignore these words because they are so common that including them would greatly increase the size of the index without improving precision or recall. NLTK comes with a stopwords corpus that includes a list of 128 english stopwords. Let’s see what happens when we filter out these words.

from nltk.corpus import stopwords
stopset = set(stopwords.words('english'))

def stopword_filtered_word_feats(words):
	return dict([(word, True) for word in words if word not in stopset])

evaluate_classifier(stopword_filtered_word_feats)

And the results for a stopword filtered bag of words are:

accuracy: 0.726
pos precision: 0.649867374005
pos recall: 0.98
neg precision: 0.959349593496
neg recall: 0.472

Accuracy went down .2%, and pos precision and neg recall dropped as well! Apparently stopwords add information to sentiment analysis classification. I did not include the most informative features since they did not change.

Bigram Collocations

As mentioned at the end of the article on precision and recall, it’s possible that including bigrams will improve classification accuracy. The hypothesis is that people say things like “not great”, which is a negative expression that the bag of words model could interpret as positive since it sees “great” as a separate word.

To find significant bigrams, we can use nltk.collocations.BigramCollocationFinder along with nltk.metrics.BigramAssocMeasures. The BigramCollocationFinder maintains 2 internal FreqDists, one for individual word frequencies, another for bigram frequencies. Once it has these frequency distributions, it can score individual bigrams using a scoring function provided by BigramAssocMeasures, such chi-square. These scoring functions measure the collocation correlation of 2 words, basically whether the bigram occurs about as frequently as each individual word.

import itertools
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

def bigram_word_feats(words, score_fn=BigramAssocMeasures.chi_sq, n=200):
	bigram_finder = BigramCollocationFinder.from_words(words)
	bigrams = bigram_finder.nbest(score_fn, n)
	return dict([(ngram, True) for ngram in itertools.chain(words, bigrams)])

evaluate_classifier(bigram_word_feats)

After some experimentation, I found that using the 200 best bigrams from each file produced great results:

accuracy: 0.816
pos precision: 0.753205128205
pos recall: 0.94
neg precision: 0.920212765957
neg recall: 0.692
Most Informative Features
         magnificent = True              pos : neg    =     15.0 : 1.0
         outstanding = True              pos : neg    =     13.6 : 1.0
           insulting = True              neg : pos    =     13.0 : 1.0
          vulnerable = True              pos : neg    =     12.3 : 1.0
   ('matt', 'damon') = True              pos : neg    =     12.3 : 1.0
      ('give', 'us') = True              neg : pos    =     12.3 : 1.0
           ludicrous = True              neg : pos    =     11.8 : 1.0
         uninvolving = True              neg : pos    =     11.7 : 1.0
              avoids = True              pos : neg    =     11.7 : 1.0
('absolutely', 'no') = True              neg : pos    =     10.6 : 1.0

Yes, you read that right, Matt Damon is apparently one of the best predictors for positive sentiment in movie reviews. But despite this chuckle-worthy result

  • accuracy is up almost 9%
  • pos precision has increased over 10% with only 4% drop in recall
  • neg recall has increased over 21% with just under 4% drop in precision

So it appears that the bigram hypothesis is correct, and including significant bigrams can increase classifier effectiveness. Note that it’s significant bigrams that enhance effectiveness. I tried using nltk.util.bigrams to include all bigrams, and the results were only a few points above baseline. This points to the idea that including only significant features can improve accuracy compared to using all features. In a future article, I’ll try trimming down the single word features to only include significant words.

  • Pingback: links for 2010-05-24 « Blarney Fellow

  • Pingback: Tweets that mention Text Classification for Sentiment Analysis – Stopwords and Collocations «streamhacker.com -- Topsy.com

  • mdgt_sadhu

    this is very interesting – thanks for writing it up. It seems however that general n-grams cant be chosen out of the box ? i.e. nltk only has a choice b/w bigrams and trigrams ?

  • http://streamhacker.com/ Jacob Perkins

    That's true, but I think you could use nltk.collocations.AbstractCollocationFinder with FreqDists you create yourself. The harder part would be a generic ngram scoring function, but it looks like if you extended nltk.metrics.NgramAssocMeasures to implement _contigency and _marginals for ngrams, all the other scoring functions would work.

  • Ryan He

    Interesting! But in Bo Pang et al. 2008, Thumbs up? Sentiment Classification using Machine Learning Techniques, the unigrams presence feature is evaluated to be the best feature set. Using just the unigrams presence feature, Naive Bayes, Maxent and SVM classifiers can all get accuracies better than 80%. Adding the bigrams feature doesn't help. Why do you think the result of your comparison between feature sets is different from theirs?

  • Ryan He

    Sorry, a typo there. It should be Bo Pang et al. 2002…

  • Ryan He

    Sorry, a typo there. It should be Bo Pang et al. 2002…

  • http://streamhacker.com/ Jacob Perkins

    Hi Ryan, thanks for pointing me to that research. They used cross-validation of 3 folds over 1400 reviews, and made sure not to favor prolific reviewers. Whereas each file of NLTK's movie_corpus contains many reviews with no knowledge of who the reviewer is. And I fed the classifier 1500 reviews in a single batch, with no cross-validation. I also chose the 200 best bigrams on a per-file basis using information gain, whereas they apparently did not do any special bigram selection. As I mentioned at the end of the article, including all bigrams helps a little, but not much. So I think the key differences are most likely cross-validation and bigram selection.

  • pierre_rosado

    Hi Jacob,

    Your results with the stopwords filter made me curious; It seems counterintuitive. To know why, I tagged the stopwords with the POS. Next, for each POS category, I ran evaluate_classifier filtering only the stopwords in the corresponding category.

    The test's result was that from all stopwords only the adverbs and wh determiners seems to add information. Excluding them from the stopwords and running evaluate_classifier with the filter gave the following results:

    accuracy: 0.73
    pos precision: 0.653333333333
    pos recall: 0.98
    neg precision: 0.96
    neg recall: 0.48

    What do you think?

    Finally, thanks for the posts; they are very informative.

  • http://streamhacker.com/ Jacob Perkins

    Hi Pierre,

    Thanks for looking into the stopwords. I was definitely surprised when I got the original result, but your findings help make sense of it. The adverbs support verbs, and perhaps the wh determiners imply a rhetorical question (instead of a statement). Makes me think that generic stopwords lists can't be fully trusted in all contexts.

  • pierre_rosado

    Hi Jacob,

    Your results with the stopwords filter made me curious; It seems counterintuitive. To know why, I tagged the stopwords with the POS. Next, for each POS category, I ran evaluate_classifier filtering only the stopwords in the corresponding category.

    The test's result was that from all stopwords only the adverbs and wh determiners seems to add information. Excluding them from the stopwords and running evaluate_classifier with the filter gave the following results:

    accuracy: 0.73
    pos precision: 0.653333333333
    pos recall: 0.98
    neg precision: 0.96
    neg recall: 0.48

    What do you think?

    Finally, thanks for the posts; they are very informative.

  • http://streamhacker.com/ Jacob Perkins

    Hi Pierre,

    Thanks for looking into the stopwords. I was definitely surprised when I got the original result, but your findings help make sense of it. The adverbs support verbs, and perhaps the wh determiners imply a rhetorical question (instead of a statement). Makes me think that generic stopwords lists can't be fully trusted in all contexts.

  • http://www.pbsapos.com.au/possoftware.aspx POS Software

    Dear poster,

    you are such a great contributor. very well said.

  • Aiden

    Hi, I’m having a bit of trouble running this with my data set. I tried out your Naive Bayes Classifier and it worked perfectly with my custom corpus but for some reason when I run this code, for a few seconds Python looks like it’s working, the * appears indicating that its running the code, then next thing it finishes without giving any output not even an error!! I’m baffled as to why this is…this is my code, i’d appreciate it if you could tell me whats wrong with it.

    import collectionsimport nltk.classify.util, nltk.metricsfrom nltk.classify import NaiveBayesClassifierfrom nltk.corpus.reader import CategorizedPlaintextCorpusReadermysentiment = CategorizedPlaintextCorpusReader(‘c:/users/Aiden/nltk_data/corpora/sentiment’, r'(pos|neg)/.*.txt’, cat_pattern=r'(pos|neg)/.*.txt’)def evaluate_classifier(featx):    negids = mysentiment.fileids(‘neg’)    posids = mysentiment.fileids(‘pos’)     negfeats = [(featx(mysentiment.words(fileids=[f])), ‘neg’) for f in negids]    posfeats = [(featx(mysentiment.words(fileids=[f])), ‘pos’) for f in posids]     negcutoff = len(negfeats)*3/4    poscutoff = len(posfeats)*3/4     trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]    testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]     classifier = NaiveBayesClassifier.train(trainfeats)    refsets = collections.defaultdict(set)    testsets = collections.defaultdict(set)     for i, (feats, label) in enumerate(testfeats):            refsets[label].add(i)            observed = classifier.classify(feats)            testsets[observed].add(i)     print ‘accuracy:’, nltk.classify.util.accuracy(classifier, testfeats)    print ‘pos precision:’, nltk.metrics.precision(refsets['pos'], testsets['pos'])    print ‘pos recall:’, nltk.metrics.recall(refsets['pos'], testsets['pos'])    print ‘neg precision:’, nltk.metrics.precision(refsets['neg'], testsets['neg'])    print ‘neg recall:’, nltk.metrics.recall(refsets['neg'], testsets['neg'])    return classifier.show_most_informative_features()

  • Aiden

    the code didn’t look a mess when I posted it, sorry about that, don’t know why it appeared like that. Its the same as your first code above except with a corpus reader at the beginning.
    from nltk.corpus.reader import CategorizedPlaintextCorpusReadermysentiment = CategorizedPlaintextCorpusReader(r’c:/users/Aiden/nltk_data/corpora/sentiment’, r'(pos|neg)/.*.txt’, cat_pattern=r'(pos|neg)/.*.txt’)

  • http://streamhacker.com/ Jacob Perkins

    It’s hard to tell what’s wrong, so all I can suggest is make sure all instances of “movie_reviews” have been changed to “mysentiment” and remove the movie_reviews import. If that doesn’t do it, then make sure the corpus reader is defined correctly by creating it, then doing “mysentiment.categories()” and “mysentiment.fileids()” to ensure it’s producing the right results.

  • Schillermika

    Hey Jacob,

    I’m trying to classify song lyrics, where bigrams matter. My program is reading from custom corpuses full of lyrics from the Web. I need to lowercase all the unigrams and bigrams, and that’s where there’s an issue. First, the bag of words function

    def bag_of_words(sentence): return dict([(word.lower(), True) for word in sentence])

    Then the bigram extractor

    def bag_of_bigrams_words(sentence, score_fn = BigramAssocMeasures.chi_sq, n = 200): bigram_finder = BigramCollocationFinder.from_words(sentence) bigrams = bigram_finder.nbest(score_fn, n) return bag_of_words(sentence + bigrams)

    So, I use bag_of_bigrams_words() on a simple sentence like 

    bag_of_bigrams_words(['Joey', 'Plays', 'the', 'Guitar'])

     and I get the following error

    Traceback (most recent call last):  File “”, line 1, in     bag_of_bigrams_words(['Joey', 'Plays', 'the', 'Guitar'])  File “”, line 4, in bag_of_bigrams_words    return bag_of_words(sentence + bigrams)  File “”, line 2, in bag_of_words    return dict([(word.lower(), True) for word in sentence])AttributeError: ‘tuple’ object has no attribute ‘lower’

    It seems that as long as word.lower() is in bag_of_words(), it’s incompatible with the bigram tuples. What’s the best way around this, considering that I need word.lower() in bag_of_words() in order to reduce dimensionality ?

  • http://streamhacker.com/ Jacob Perkins

    You should remove word.lower() from bag_of_words(), and instead lowercase everything yourself. The best way to do this would be to lowercase every word in the sentence first, before finding bigrams or calling bag_of_words(). This is a simple list comprehension, like sentence = [word.lower() for word in sentence]

  • Schillermika

    Here’s the problem I’m having. I’ll use a test corpus I play around with to demonstrate. So, first, the corpus reader object and bag of words function

    physics_corpus = LazyCorpusLoader(‘cookbook’, PlaintextCorpusReader, ['physics.txt'])
    def bag_of_words(sentence): return dict([(word, True) for word in sentence])

    Then I label the training data

    raw_dataset = [(sentence, "physics") for sentence in physics.sents()]

    I would have preferred that raw_dataset be this instead:

    raw_dataset2 = [(word.lower(), "physics") for word in physics.words()]

    But the problem is that if I use raw_dataset2 to create my featuresets to train the classifier like this:

    featuresets = [(bag_of_words(word), label) for (word, label) in raw_dataset2]

    Then I get this:

    [({‘h': True, ‘e': True, ‘T': True}, ‘physics’), ({‘a': True, ‘c': True, ‘i': True, ‘h': True, ‘l': True, ‘p': True, ‘s': True, ‘y': True}, ‘physics’)

    Not what I want. But with plain old raw_dataset:

    raw_dataset = [(sentence, "physics") for sentence in physics.sents()]

    featuresets = [(bag_of_words(sentence), label) for (sentence, label) in raw_dataset]

    It returns whole words as I want:

    ({‘and': True, ‘distances': True, ‘scales': True, ‘subatomic': True, ‘over': True, ‘challenges': True, ‘meters': True}, ‘physics’)

    So my dilemma is that I’m stuck with physics.sents() so that bag_of_words returns whole words rather than letters. But I can’t lowercase sentences so a list comprehension like [word.lower() for word in physics.sents()] is not an option. And that’s why I put word.lower() in the bag_of_words() function.  I’m having trouble seeing where I can apply word.lower() . I tried converting raw_dataset to a string so I could lowercase the words and then convert back to a list, but I should have known it’s inane. Any insights?

    thnx

  • http://streamhacker.com/ Jacob Perkins

    featuresets = [(bag_of_words([word.lower() for word in sent]), label) for (sentence, label) in raw_dataset]

    or

    raw_dataset = [([word.lower() for word in sentence], “physics”) for sentence in physics.sents()]

  • Schillermika

    thanks…def need to polish up my python skills

  • Pingback: ???? ?????? « ?????

  • Fredrik

    I am quite new to Python, and some parts of the code seems more or less magic to me… I have understood that functions are just ordinary objects/values in Python and I guess that this is the trick, but can you explain or suggest a good link for explaining how the following parts of the code work? The name word_feats seems to be bounded to the function word_feats, but what is words bound too? I guess it is bound to featx through function evaluate_classifier, but I really don’t get how featx is assigned a value in 
    negfeats = [(featx(movie_reviews.words(fileids=[f])), ‘neg’) for f in negids] (to me it looks like featx is a function here, but I guess it is not? I guess that I should do some basic reading about Python, but any clarification would be helpful.

  • http://streamhacker.com/ Jacob Perkins

    words is not explicitly defined above, but it’s a function parameter that is expected to be a list of strings. featx is also a function parameter, but it’s expected to be a function that accepts words and returns a dict. This way, you can pass different featx functions to evaluate_classifier to see the different results.

  • Pingback: Thinknook | 10 Ways to Improve your Text Classification Algorithm Accuracy and Performance

  • BitcoinKing

    I am newbie, 1 month experience with Python, my results with my own corpus are quite bad: accuracy 0.366, i think this because of small corpus: 115 txt files of both catgories.

    I suggest adding (for my case, russian language)
    from nltk.stem import SnowballStemmer
    russian_stemmer = SnowballStemmer(‘russian’)

    and change code as follows:
    def word_feats(words):
    return dict([(russian_stemmer.stem(word.lower()), True) for word in words])

    But the main reason why accuracy is bad is more complex task: it is not enough to use simple words or even bigrams, i need detect one category in a bunch of other variety of texts.

    Jacob, could you suggest example code with more complex features? I mean maybe regexp usage will help, for example: [list of predefined words] some regexp [list of predefined words]. So this will match tripples such as: Bill Gates (have, will) visit(-ing) Hermitage, Google lauch Android. What do you think?

  • http://streamhacker.com/ Jacob Perkins

    I’m not sure what you’re asking. Are you trying to categorize a single piece of text? Find multiple categories in a piece of text? Annotate every certain words with categories?

    If you’re just trying to classify a piece of text, then I recommend that you try using stemmed words, non stemmed words, and bigrams. And often the best thing you can do is get or create more training data.

  • BitcoinKing

    I have a corpus with 2 folders: adv – with advertisement txt files(manually gained), and nonadv – all other files parsed from Internet where -may be- advertisement, product reviews etc., for simplicity no political, economical texts, only about some products or services.
    The task is to identify advertisement texts in nonadv folder and output a list of texts with probability that it is advertisement for each fileid.
    As you can see the task is harder and cannot be solved using bag of word frequencies.
    So i want to “help” the feature extractor with my knowledge: i know particular word combinations that is peculiar to adv texts. This combinations i can represent as regexp searches such as

    re.search(r”([Android|Iphone] .*{3,5} [best|cool|beware]$)).
    So i want to add several such regexps
    to feature extractor, i think it will improve significantly.
    So, if several such searches are found in text it can be considered as advertisement.
    Also i assume i need training, because i am not sure what of that combinations are better.

    I read the book http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html
    chapter Exploiting Context and did not found such examples.

  • http://streamhacker.com/ Jacob Perkins

    What you’re trying to do is sometimes called “one class classification” https://en.wikipedia.org/wiki/One-class_classification. You can’t train a standard binary classifier until you have known nonadv examples. So I suggest you either take the time to create that training data, look into the research for one-class classification, or maybe try using clustering with 3 clusters that will hopefully end up being adv, nonadv, other.

  • BitcoinKing

    Thanks for reply, for me it’s surprise that his problem belongs to another class.
    Can you point me on github projects with example code? Unfortunately my Python skills are not enough to write code myself from scratch.

  • http://streamhacker.com/ Jacob Perkins

    I don’t know of any Python code or github projects to do this. It’s a fairly rare case for text classification. I think your best bet is to actually create some real nonadv training data so you can train a binary classifier, then use that to find adv text in the your remaining raw text.

  • BitcoinKing

    I have created nonadv corpus of texts. Now i am go with my own function that detect features in nonadv texts:

    dict_features = defaultdict(dict)
    def regexp_features(corpus):
    for fileid in corpus.fileids():
    if re.search(r'(??|??) -?.* ????????|??????|????|???|???????|???????|??????|??????’, corpus.raw(fileid)):
    dict_features[fileid]['oskorblenie'] = 1
    else:
    dict_features[fileid]['oskorblenie'] = 0

    if re.search(r’[??]?????[?|?|???].????(?|??)’, corpus.raw(fileid)):
    dict_features[fileid]['samoprezentacia'] = 1
    else:
    dict_features[fileid]['samoprezentacia'] = 0
    return dict_features

    Here is the idea: i am adding 1 to feature if it exists in text, and 0 if not. Then i wrote another function that sum values for each feature for each file, and if sum > 5 for example i assume this is adv.
    The question is how do i integrate that function to your’s article example. Because in my case i don’t use machine learning at all, but i want.

  • http://streamhacker.com/ Jacob Perkins

    What you could do is, after you get the feature dictionary from my example code, using word_feats or another function, then you could update that with your own feature dictionary for that file. That way you get the combined features.

  • Arjun

    Hi…Can you tell me a way where in I have test data on which I need to invoke a label using this classifier?? Also, how can I run svm for this methd?

  • http://streamhacker.com/ Jacob Perkins

    Are you asking how to classify a piece of text? If so, then you need to convert the text into a feature dictionary, as shown in the stopword_filtered_word_feats() function. Then pass it into the classifier’s classify() method.
    To use SVM, you can use NLTK’s SVM classifier, or the scikit-learn classifier wrapper.

  • Arjun

    Hi Jacob…thanks for the reply….But I am not able to do what ur reply says….Its like I have a text file which is a mixture of negative and positive sentences and I want to classify them and label each of the sentences into either ‘positive’ or ‘negative’ Can you please help me with the code?? Thanks in advance

  • http://streamhacker.com/ Jacob Perkins

    So if you have a file, then you need to read that file & split it into sentences. You can do this manually by reading in the file, then running the text thru a sentence tokenizer. Then you split each sentence into words using a word tokenizer, and you can pass the words into a feature dictionary function. Or you can use a NLTK corpus reader on your file, many of which will do the sentence & word splitting automatically. If you’re not sure how to do any of this, I highly recommend going thru the NLTK book at http://www.nltk.org/book/ or buying my NLTK cookbook.

  • Arjun

    Hi Jacob, I have the development and the test set as shown in the diagram. I have done the analysis for training and dev-test set but I want to get the label(negative or positive) for the test data set. I have 1000 tweets in the test data set for which I need this label and then I will manually get the accuracy for this test set?? Can you please help me where do I invokee this test set and get the label for these 1000 tweets as my output??Thanks in advance

  • http://streamhacker.com/ Jacob Perkins

    You need to load your test data so that each tweet becomes a list of words, which can be translated into a feature dictionary for classification. As I said before, this can be done using a NLTK corpus reader.

    Once you have your classification, then you can write the tweet out to a file, using one file or directory for each label. NLTK’s movie_reviews corpus provides a simple example for how to organize your labeled corpus files.

  • Arjun

    Hi Jacob,

    Thanks for the reply, finally I have got what i wanted. I am in a dilemma now since the best word accuracy is 1 percent less at 84 than the single word accuracy at 85 pc. Sounds crazy

  • Selva Saravanakumar

    Hi.. I’m using both nltk NaiveBayesClassifier and SklearnClassifier for classification of sentences. Is there is a way to find which is the best classification. For eg: If i give “You are looking not so great” , one is classifying it as “Positive” and other as “Negative”. I just want to know which is correct, because i will automate for more the 2k data where manual checking is tedious.

    Thanks in advance.

  • str1ct

    @Selva: you need optimal big training dataset with labeled optimist and pesimist sentences. For example. 1500 sentences neg and 1000 pos and then classify in NB or sklearn… I tried do this and results are similar.

    You should find somewhere sentiment corpus.

    http://stackoverflow.com/questions/7551262/training-data-for-sentiment-analysis

  • quad

    hey! we are working on classifying movie reviews into either positive or negative class. we are using nltk and svm classifier(linearsv. with just unigrams we have an accuarcy of 70%. We want some help with using bigrams.
    we have split each movie review into sentences and further into list of words. and this list is in an array. how do we use bigrams to improve accuary? how to proceed?

  • http://streamhacker.com/ Jacob Perkins

    nltk.util.ngrams(sentence, 2) will generate a list of bigrams, which you can then use as features, just like words

  • quad

    thanks for the reply. you mean just compare the bigrams with the list of predefined bigrams?

  • http://streamhacker.com/ Jacob Perkins

    Yes, you can do that to limit the bigrams to the significant ones, or you could try using all bigrams.

  • quad

    Thanks. using all bigrams means it involves using weigting measures like TfidfVectorizer? If yes then could you help us with combining that output with the custom features we have extracted so that we can feed to the SVM. For now the input to SVM is a feature vector of our handcrafted features.

  • http://streamhacker.com/ Jacob Perkins

    Using all bigrams is just like using all words, it has nothing specific to do with the TfidfVectorizer.

  • quad

    thanks!

  • corn

    quick question: When i use my own custom dataset, the precision, recall, and accuracy and what not are all fine, but when i try to get the most informative features, it only returns words of the negative persuasion. Why is this? I have 80k positive files, 2k negative files, and 3k neutral files. Any help would be much appreciated