NLTK 2.0.1, a.k.a NLTK 2, was recently released, and what follows is my favorite changes, new features, and highlights from the ChangeLog.
New Classifiers
The SVMClassifier adds support vector machine classification thru SVMLight with PySVMLight. This is a much needed addition to the set of supported classification algorithms. But even more interesting…
The SklearnClassifier provides a general interface to text classification with scikit-learn. While scikit-learn is still pre-1.0, it is rapidly becoming one of the most popular machine learning toolkits, and provides more advanced feature extraction methods for classification.
Github
NLTK has moved development and hosting to github, replacing google code and SVN. The primary motivation is to make new development easier, and already a Python 3 branch is under active development. I think this is great, since github makes forking & pull requests quite easy, and it’s become the de-facto “social coding” site.
Sphinx
Coinciding with the github move, the documentation was updated to use Sphinx, the same documentation generator used by Python and many other projects. While I personally like Sphinx and restructured text (which I used to write this post), I’m not thrilled with the results. The new documentation structure and NLTK homepage seem much less approachable. While it works great if you know exactly what you’re looking for, I worry that new/interested users will have a harder time getting started.
New Corpora
Since the 0.9.9 release, a number of new corpora and corpus readers have been added:
ChangeLog Highlights
And here’s a few final highlights:
- The HunposTagger, which wraps hunpos.
- The StanfordTagger plus 2 subclasses for NER and POS tagging with the Stanford POS Tagger.
- The SnowballStemmer, which supports 13 different languages. You can try it out at my online stemming demo.
The Future
I think NLTK’s ideal role is be a standard interface between corpora and NLP algorithms. There are many different corpus formats, and every algorithm has its own data structure requirements, so providing common abstract interfaces to connect these together is very powerful. It allows you to test the same algorithm on disparate corpora, or try multiple algorithms on a single corpus. This is what NLTK already does best, and I hope that becomes even more true in the future.