Tag Archives: spam

Programming Collective Intelligence Review

Programming Collective Intelligence

Programming Collective Intelligence is a great conceptual introduction to many common machine learning algorithms and techniques. It covers classification algorithms such as Naive Bayes and Neural Networks, and algorithmic optimization approaches like Genetic Programming. The book also manages to pick interesting example applications, such as stock price prediction and topic identification.

There are two chapters in particular that stand out to me. First is Chapter 6, which covers Naive Bayes classification. What stood out was that the algorithm presented is an online learner, which means it can be updated as data comes in, unlike the NLTK NaiveBayesClassifier, which can be trained only once. Another thing that caught my attention was Fisher’s method, which is not implemented in NLTK, but could be with a little work. Apparently Fisher’s method is great for spam filtering, and is used by the SpamBayes Outlook plugin (which is also written in Python).

Second, I found Chapter 9, which covers Support Vector Machines and Kernel Methods, to be quite intuitive. It explains the idea by starting with examples of linear classification and its shortfalls. But then the examples show that by scaling the data in a particular way first, linear classification suddenly becomes possible. And the kernel trick is simply a neat and efficient way to reduce the amount of calculation necessary to train a classifier on scaled data.

The final chapter summarizes all the key algorithms, and for many it includes commentary on their strengths and weaknesses. This seems like valuable reference material, especially for when you have a new data set to learn from, and you’re not sure which algorithms will help get the results you’re looking for. Overall, I found Programming Collective Intelligence to be an enjoyable read on my Kindle 3, and highly recommend it to anyone getting started with machine learning and Python, as well as anyone interested in a general survey of machine learning algorithms.