Category Archives: python

Instant PyGame Book Review

Pygame for Python Game Development How-toThis is a review of the book Instant Pygame for Python Game Development How-to, by Ivan Idris. Packt asked me to review the book, and I agreed because like many developers, I’ve thought about writing my own game, and I’ve been curious about the capabilities of pygame. It’s a short book, ~120 pages, so this is a short review.

The book covers pygame basics like drawing images, rendering text, playing sounds, creating animations, and altering the mouse cursor. The author has helpfully posted some video demos of some of the exercises, which are linked from the book. I think this is a great way to show what’s possible, while also giving the reader a clear idea of what they are creating & what should happen. After the basic intro exercises, I think the best content was how to manipulate pixel arrays with numpy (the author has also written two books on numpy: NumPy Beginner’s Guide & NumPy Cookbook), how to create & use sprites, and how to make your own version of the game of life.

There were 3 chapters whose content puzzled me. When you’ve got such a short book on a specific topic, why bring up matplotlib, profiling, and debugging? These chapters seemed off-topic and just thrown in there randomly. The organization of the book could have been much better too, leading the reader from the basics all the way to a full-fledged game, with each chapter adding to the previous chapters. Instead, the chapters sometimes felt like unrelated low-level examples.

Overall, the book was a quick & easy read, that rapidly introduces you to basic pygame functionality, and leads you on to more complex activities. My main takeaway is that pygame provides an easy to use & low-level framework for building simple games, and can be used to create more complex games (but probably not FPS or similar graphically intensive games). The ideal games would probably be puzzle based and/or dialogue heavy, and only require simple interactions from the user. So if you’re interested in building such a game in Python, you should definitely get a copy of Instant Pygame for Python Game Development How-to.

Text Classification for Sentiment Analysis – NLTK + Scikit-Learn

Now that NLTK versions 2.0.1 & higher include the SklearnClassifier (contributed by Lars Buitinck), it’s much easier to make use of the excellent scikit-learn library of algorithms for text classification. But how well do they work?

Below is a table showing both the accuracy & F-measure of many of these algorithms using different feature extraction methods. Unlike the standard NLTK classifiers, sklearn classifiers are designed for handling numeric features. So there are 3 different values under the feats column for each algorithm. bow means bag-of-words feature extraction, where every word gets a 1 if present, or a 0 if not. int means word counts are used, so if a word occurs twice, it gets the number 2 as its feature value (whereas with bow it would still get a 1). And tfidf means the TfidfTransformer is used to produce a floating point number that measures the importance of a word, using the tf-idf algorithm.

All numbers were determined using nltk-trainer, specifically, python train_classifier.py movie_reviews --no-pickle --classifier sklearn.ALGORITHM --fraction 0.75. For int features, the option --value-type int was used, and for tfidf features, the options --value-type float --tfidf were used. This was with NLTK 2.0.3 and sklearn 0.12.1.

algorithm feats accuracy neg f-measure pos f-measure
BernoulliNB bow 82.2 82.7 81.6
BernoulliNB int 82.2 82.7 81.6
BernoulliNB tfidf 82.2 82.7 81.6
GaussianNB bow 66.4 65.1 67.6
GaussianNB int 66.8 66.3 67.3
MultinomialNB bow 82.2 82.7 81.6
MultinomialNB int 81.2 81.5 80.1
MultinomialNB tfidf 81.6 83.0 80.0
LogisticRegression bow 85.6 85.8 85.4
LogisticRegression int 83.2 83.0 83.4
LogisticRegression tfidf 82.0 81.5 82.5
SVC bow 67.6 75.3 52.9
SVC int 67.8 71.7 62.6
SVC tfidf 50.2 0.8 66.7
LinearSVC bow 86.0 86.2 85.8
LinearSVC int 81.8 81.7 81.9
LinearSVC tfidf 85.8 85.5 86.0
NuSVC bow 85.0 85.5 84.5
NuSVC int 81.4 81.7 81.1
NuSVC tfidf 50.2 0.8 66.7

As you can see, the best algorithms are BernoulliNB, MultinomialNB, LogisticRegression, LinearSVC, and NuSVC. Surprisingly, int and tfidf features either provide a very small performance increase, or significantly decrease performance. So let’s see if we can improve performance with the same techniques used in previous articles in this series, specifically bigrams and high information words.

Bigrams

Below is a table showing the accuracy of the top 5 algorithms using just unigrams (the default, a.k.a single words), and using unigrams + bigrams (pairs of words) with the option --ngrams 1 2.

algorithm unigrams bigrams
BernoulliNB 82.2 86.0
MultinomialNB 82.2 86.0
LogisticRegression 85.6 86.6
LinearSVC 86.0 86.4
NuSVC 85.0 85.2

Only BernoulliNB & MultinomialNB got a modest boost in accuracy, putting them on-par with the rest of the algorithms. But we can do better than this using feature scoring.

Feature Scoring

As I’ve shown previously, eliminating low information features can have significant positive effects. Below is a table showing the accuracy of each algorithm at different score levels, using the option --min_score SCORE (and keeping the --ngrams 1 2 option to get bigram features).

algorithm score 1 score 2 score 3
BernoulliNB 62.8 97.2 95.8
MultinomialNB 62.8 97.2 95.8
LogisticRegression 90.4 91.6 91.4
LinearSVC 89.8 91.4 90.2
NuSVC 89.4 90.8 91.0

LogisticRegression, LinearSVC, and NuSVC all get a nice gain of ~4-5%, but the most interesting results are from the BernoulliNB & MultinomialNB algorithms, which drop down significantly at --min_score 1, but then skyrocket up to 97% with --min_score 2. The only explanation I can offer for this is that Naive Bayes classification, because it does not weight features, can be quite sensitive to changes in training data (see Bayesian Poisoning for an example).

Scikit-Learn

If you haven’t yet tried using scikit-learn for text classification, then I hope this article convinces you that it’s worth learning. NLTK’s SklearnClassifier makes the process much easier, since you don’t have to convert feature dictionaries to numpy arrays yourself, or keep track of all known features. The Scikits classifiers also tend to be more memory efficient than the standard NLTK classifiers, due to their use of sparse arrays.

NLTK 2 Release Highlights

NLTK 2.0.1, a.k.a NLTK 2, was recently released, and what follows is my favorite changes, new features, and highlights from the ChangeLog.

New Classifiers

The SVMClassifier adds support vector machine classification thru SVMLight with PySVMLight. This is a much needed addition to the set of supported classification algorithms. But even more interesting…

The SklearnClassifier provides a general interface to text classification with scikit-learn. While scikit-learn is still pre-1.0, it is rapidly becoming one of the most popular machine learning toolkits, and provides more advanced feature extraction methods for classification.

Github

NLTK has moved development and hosting to github, replacing google code and SVN. The primary motivation is to make new development easier, and already a Python 3 branch is under active development. I think this is great, since github makes forking & pull requests quite easy, and it’s become the de-facto “social coding” site.

Sphinx

Coinciding with the github move, the documentation was updated to use Sphinx, the same documentation generator used by Python and many other projects. While I personally like Sphinx and restructured text (which I used to write this post), I’m not thrilled with the results. The new documentation structure and NLTK homepage seem much less approachable. While it works great if you know exactly what you’re looking for, I worry that new/interested users will have a harder time getting started.

New Corpora

Since the 0.9.9 release, a number of new corpora and corpus readers have been added:

ChangeLog Highlights

And here’s a few final highlights:

The Future

I think NLTK’s ideal role is be a standard interface between corpora and NLP algorithms. There are many different corpus formats, and every algorithm has its own data structure requirements, so providing common abstract interfaces to connect these together is very powerful. It allows you to test the same algorithm on disparate corpora, or try multiple algorithms on a single corpus. This is what NLTK already does best, and I hope that becomes even more true in the future.

PyCon NLTK Tutorial Assistants

My PyCon tutorial, Introduction to NLTK, now has over 40 people registered. This is about twice as many people as I was expecting, but I’m glad so many people want to learn NLTK :)  Because of the large class size, it’d really helpful to have a couple assistants with at least some NLTK experience, including, but not limited to:

* installing NLTK
* installing & using NLTK on Windows
* installing & using nltk-trainer
* creating custom corpora
* using WordNet

If you’re interested in helping out, please read Tutorial Assistants and contact me, japerk — at — gmail. Thanks!

Upcoming Talks

At the end of February and the beginning of March, I’ll be giving 3 talks in the SF Bay Area and one in St Louis, MO. In chronological order…

How Weotta uses MongoDB

Grant and I will be helping 10gen celebrate the opening of their new San Francisco office on Tuesday, February 21, by talking about
How Weotta uses MongoDB. We’ll cover some of our favorite features of MongoDB and how we use it for local place & events search. Then we’ll finish with a preview of Weotta’s upcoming MongoDB powered local search APIs.

NLTK Jam Session at NICAR 2012

On Thursday, February 23, in St Louis, MO, I’ll be demonstrating how to use NLTK as part of the NewsCamp workshop at NICAR 2012. This will be a version of my PyCon NLTK Tutorial with a focus on news text and corpora like treebank.

Corpus Bootstrapping with NLTK at Strata 2012

As part of the Strata 2012 Deep Data program, I’ll talk about Corpus Bootstrapping with NLTK on Tuesday, February 28. The premise of this talk is that while there’s plenty of great algorithms and methods for natural language processing, most of them require a training corpus, and chances are the training corpus you really need doesn’t exist. So how can you quickly create a quality corpus at minimal cost? I’ll cover specific real-world examples to answer this question.

NLTK Tutorial at PyCon 2012

Introduction to NLTK will be a 3 hour tutorial at PyCon on Thursday, March 8th. You’ll get to know NLTK in depth, learn about corpus organization, and train your own models manually & with nltk-trainer. My goal is that you’ll walk out with at least one new NLP superpower that you can put to use immediately.

Fuzzy String Matching in Python

Fuzzy matching is a general term for finding strings that are almost equal, or mostly the same. Of course almost and mostly are ambiguous terms themselves, so you’ll have to determine what they really mean for your specific needs. The best way to do this is to come up with a list of test cases before you start writing any fuzzy matching code. These test cases should be pairs of strings that either should fuzzy match, or not. I like to create doctests for this, like so:

def fuzzy_match(s1, s2):
	'''
	>>> fuzzy_match('Happy Days', ' happy days ')
	True
	>>> fuzzy_match('happy days', 'sad days')
	False
	'''
	# TODO: fuzzy matching code
	return s1 == s2

Once you’ve got a good set of test cases, then it’s much easier to tailor your fuzzy matching code to get the best results.

Normalization

The first step before doing any string matching is normalization. The goal with normalization is to transform your strings into a normal form, which in some cases may be all you need to do. While 'Happy Days' != ' happy days ', with simple normalization you can get 'Happy Days'.lower() == ' happy days '.strip().

The most basic normalization you can do is to lowercase and strip whitespace. But chances are you’ll want to more. For example, here’s a simple normalization function that also removes all punctuation in a string.

import string

def normalize(s):
	for p in string.punctuation:
		s = s.replace(p, '')

	return s.lower().strip()

Using this normalize function, we can make the above fuzzy matching function pass our simple tests.

def fuzzy_match(s1, s2):
	'''
	>>> fuzzy_match('Happy Days', ' happy days ')
	True
	>>> fuzzy_match('happy days', 'sad days')
	False
	'''
	return normalize(s1) == normalize(s2)

If you want to get more advanced, keep reading…

Regular Expressions

Beyond just stripping whitespace from the ends of strings, it’s also a good idea replace all whitespace occurrences with a single space character. The regex function for doing this is re.sub('\s+', s, ' '). This will replace every occurrence of one or more spaces, newlines, tabs, etc, essentially eliminating the significance of whitespace for matching.

You may also be able to use regular expressions for partial fuzzy matching. Maybe you can use regular expressions to identify significant parts of a string, or perhaps split a string into component parts for further matching. If you think you can create a simple regular expression to help with fuzzy matching, do it, because chances are, any other code you write to do fuzzy matching will be more complicated, less straightforward, and probably slower. You can also use more complicated regular expressions to handle specific edge cases. But beware of any expression that takes puzzling out every time you look at it, because you’ll probably be revisiting this code a number of times to tweak it for handling new cases, and tweaking complicated regular expressions is a sure way to induce headaches and eyeball-bleeding.

Edit Distance

The edit distance (aka Levenshtein distance) is the number of single character edits it would take to transform one string into another. Thefore, the smaller the edit distance, the more similar two strings are.

If you want to do edit distance calculations, checkout the standalone editdist module. Its distance function takes 2 strings and returns the Levenshtein edit distance. It’s also implemented in C, and so is quite fast.

Fuzzywuzzy

Fuzzywuzzy is a great all-purpose library for fuzzy string matching, built (in part) on top of Python’s difflib. It has a number of different fuzzy matching functions, and it’s definitely worth experimenting with all of them. I’ve personally found ratio and token_set_ratio to be the most useful.

NLTK

If you want to do some custom fuzzy string matching, then NLTK is a great library to use. There’s word tokenizers, stemmers, and it even has its own edit distance implementation. Here’s a way you could combine all 3 to create a fuzzy string matching function.

from nltk import metrics, stem, tokenize

stemmer = stem.PorterStemmer()

def normalize(s):
	words = tokenize.wordpunct_tokenize(s.lower().strip())
	return ' '.join([stemmer.stem(w) for w in words])

def fuzzy_match(s1, s2, max_dist=3):
	return metrics.edit_distance(normalize(s1), normalize(s2)) <= max_dist

Phonetics

Finally, an interesting and perhaps non-obvious way to compare strings is with phonetic algorithms. The idea is that 2 strings that sound same may be the same (or at least similar enough). One of the most well known phonetic algorithms is Soundex, with a python soundex algorithm here. Another is Double Metaphone, with a python metaphone module here. You can also find code for these and other phonetic algorithms in the nltk-trainer phonetics module (copied from a now defunct sourceforge project called advas). Using any of these algorithms, you get an encoded string, and then if 2 encodings compare equal, the original strings match. Theoretically, you could even do fuzzy matching on the phonetic encodings, but that’s probably pushing the bounds of fuzziness a bit too far.

NLTK Overview at SF Python

On September 14, 2011, I’ll be giving a 20 minute overview of NLTK for the San Francisco Python Meetup Group. Since it’s only 20 minutes, I can’t get into too much detail, but I plan to quickly cover the basics of:

I’ll also be soliciting feedback for a NLTK Tutorial at PyCON 2012. So if you’ll be at the meetup and are interested in attending a NLTK tutorial, come find me and tell me what you’d want to learn.

Updated 9/15/2011: Slides from the talk are online – NLTK in 20 minutes

Python 3 Web Development Review

Python 3 Web DevelopmentThe problem with Python 3 Web Development Beginner’s Guide, by Michel Anders, is one of expectations (disclaimer: I received a free eBook from Packt for review). Let’s start with the title… First we have Python 3 Web Development. This immediately sets the wrong expectations because:

  1. There’s almost as much jQuery & Javascript as there is Python.
  2. Most of the Python code is not Python 3 specific, and the code that is could easily be translate to Python 2.
  3. Much of the Python code either uses CherryPy or is for generating HTML. This is not immediately obvious, but becomes apparent in Chapter 3 (which is available as a free PDF download: Chapter 3 – Tasklist I Persistence).

Second, this book is also supposed to be a Beginner’s Guide, but that is definitely not the case. To really grasp what’s going on, you need to already know the basics of HTML, jQuery interaction, and how HTTP works. Chapter 1 is an excellent introduction to HTTP and web application development, but the book as a whole is not beginner material. I think that anything that uses Python metaclasses automatically becomes at least intermediate level, if not expert, and the main thrust of Chapter 7 is refactoring all your straightforward database code to use complicated metaclasses.

However, if you mentally rewrite the title to be “Web Framework Development from scratch using CherryPy and jQuery”, then you’ve got the right idea. The book steps you through web app development with CherryPy, database models with sqlite3, and plenty of HTML and jQuery for interface generation and interaction. While creating example applications, you slowly build up a re-usable framework. It’s an interesting approach, but unfortunately it gets muddied up with inline HTML rendering. I never thought a language as simple and elegant as Python could be reduced to the ugliness of common PHP, but generating HTML with string interpolation inside the same functions that are accessing the database gets pretty close. I kept expecting the author to introduce template rendering, which is a major part of most modern web development frameworks, but it never happened, despite the plethora of excellent Python templating libraries.

While reading this book, I often had the recurring thought “I’m so glad I use Django“. If your aim is rapid application development, this is not the book for you. However, if you’re interested in creating your own web development framework, or would at least like to understand how a framework like Django could be created, then buy a copy Python 3 Web Development.

PyCon NLTK Tutorial Suggestions

PyCon 2012 just released a CFP, and NLTK shows up 3 times in the suggested topics. While I’ve never done this before, I know stuff about Text Processing with NLTK so I’m going to submit a tutorial abstract. But I want your feedback: what exactly should this tutorial cover? If you could attend a 3 hour class on NLTK, what knowledge & skills would you like to come away with? Here are a few specific topics I could cover:

  • part-of-speech tagging & chunking
  • text classification
  • creating a custom corpus and corpus reader
  • training custom models (manually and/or with nltk-trainer)
  • bootstrapping a custom corpus for text classification

Or I could do a high-level survey of many NLTK modules and corpora. Please let me know what you think in the comments, if you plan on going to PyCon 2012, and if you’d want to attend a tutorial on NLTK. You can also contact me directly if you prefer.

Co-Hosting

If you’ve done this kind of thing before, have some teaching and/or speaking experience, and you feel you could add value (maybe you’re a computational linguist or NLP’er and/or have used NLTK professionally), I’d be happy to work with a co-host. Contact me if you’re interested, or leave a note in the comments.

Python Testing Cookbook Review

Python Testing CookbookPython Testing Cookbook, by Greg L Turnquist (@gregturn), goes far beyond Unit Testing, but overall it’s a mixed bag. Here’s a breakdown for each chapter (disclaimer: I received a free eBook from Packt for review):

  1. Basic introduction to testing with unittest, which is great if you’re just starting with Python and testing.
  2. Good coverage of nose. I was pleasantly surprised at how easy it is to write nose plugins.
  3. Deep coverage of using doctest and writing testable docstrings. You can download a free PDF of Chapter 3 here.
  4. BDD with a cool nose plugin, and how to use mock or mockito for testing with mock objects. I wish the author had expressed an opinion in favor of either mock or mockito, but he didn’t, so I will: use Fudge. Chapter 4 also covers the Lettuce DSL, which I think is pretty neat, but since every step requires writing a function handler, I’m not convinced it’s actually easier or better than writing doctests or unit tests.
  5. Acceptance testing with Pyccuracy and Robot Framework, which both give you a way to use Selenium from Python. I thought the DSLs used seemed too “magic”, but I that’s probably because I didn’t know the command words, and they weren’t highlighted or adequately explained.
  6. How to install and use Jenkins and TeamCity, and how to display XML reports produced using NoseXUnit. This is a very useful chapter for anyone thinking about or setting up continuous integration.
  7. This chapter is supposed to be about test coverage, and does introduce coverage, but the examples get needlessly complicated. Previous chapters used a simple shopping cart example, but this chapter uses network events, which really distracts from the tests. The author also writes unittests that just print the results intead of actually testing results with assertions.
  8. More network event complexity while trying to demonstrate smoke testing and load testing. This chapter would have made a lot more sense in a book about network programming and how to test network events. Pyro is used with very little explanation, and MySQL and SQLlite are brought in too, increasing the complexity even more.
  9. This chapter is filled with useful advice, but there’s no actual code examples. Instead, the advice is shoehorned into the cookbook format, which I felt detracted from the otherwise great content.

Throughout the book, the author presents a kind of “main script” that he updates at the end of many of the chapters. However, the script also contains stub functions that are never touched and barely explained, making their existance completely unnecessary. There’s also far too many import *, which can make it difficult to understand the code. But I did learn enough new things that I think Python Testing Cookbook is worth buying and reading. Leaving out Chapters 7 and 8, I think the book is a great resource if you’re just getting started with testing, you want to do continuous integration, and/or you want to get non-programmers involved in the testing process. There’s also a blog about the book, which has links to other reviews.