When I first created text-processing.com, in the summer of 2010, my initial intention was to provide an online demo of NLTK's capabilities. I trained a bunch of models on various NLTK corpora using nltk-trainer, then started making some simple Django forms to display the results. But as I was doing this, I realized I could fairly easily create an API based on these models. Instead of rendering HTML, I could just return the results as JSON.
I wasn't sure if anyone would actually use the API, but I knew the best way to find out was to just put it out there. So I did, initially making it completely open, with a rate limit of 1000 calls per day per IP address. I figured at the very least, I might get some PHP or Ruby users that wanted the power of NLTK without having to interface with Python. Within a month, people were regularly exceeding that limit, and I quietly increased it to 5000 calls/day, while I started searching for the simplest way to monetize the API. I didn't like what I found.
Before Mashape, your options for monetizing APIs were either building a custom solution for authentication, billing, and tracking, or pay thousands of dollars a month for an "enterprise" solution from Mashery or Apigee. While I have no doubt Mashery & Apigee provide quality services, they are not in the price range for most developers. And building a custom solution is far more work than I wanted to put into it. Even now, when companies like Stripe exist to make billing easier, you'd still have to do authentication & call tracking. But Stripe didn't exist 2 years ago, and the best billing option I could find was Paypal, whose API documentation is great at inducing headaches. Lucky for me, Mashape was just opening up for beta testing, and appeared to be in the process of solving all of my problems
Mashape was just what I needed to monetize the text-processing API, and it's improved tremendously since I started using it. They handle all the necessary details, like integrated billing, plus a lot more, such as usage charts, latency & uptime measurements, and automatic client library generation. This last is one of my favorite features, because the client libraries are generated using your API documentation, which provides a great incentive to accurately document the ins & outs of your API. Once you've documented your API, downloadable libraries in 5 different programming languages are immediately available, making it that much easier for new users to consume your API. As of this writing, those languages are Java, PHP, Python, Ruby, and Objective C.
Here's a little history for the curious: Mashape originally did authentication and tracking by exchanging tokens thru an API call. So you had to write some code to call their token API on every one of your API calls, then check the results to see if the call was valid, or if the caller had reached their limit. They didn't have all of the nice charts they have now, and their billing solution was the CEO manually handling Paypal payments. But none of that mattered, because it worked, and from conversations with them, I knew they were focused on more important things: building up their infrastructure and positioning themselves as a kind of app-store for APIs.
Mashape has been out of beta for a while now, with automated billing, and a custom proxy server for authenticating, routing, and tracking all API calls. They're releasing new features on a regular basis, and sponsoring events like MusicHackDay. I'm very impressed with everything they're doing, and on top of that, they're good hard-working people. I've been over to their "hacker house" in San Francisco a few times, and they're very friendly and accomodating. And if you're ever in the neighborhood, I'm sure they'd be open to a visit.
Once I had integrated Mashape, which was maybe 20 lines of code, the money started rolling in :). Just kidding, but using the typical definition of profit, when income exceeds costs, the text-processing API was profitable within a few months, and has remained so ever since. My only monetary cost is a single Linode server, so as long as people keep paying for the API, text-processing.com will remain online. And while it has a very nice profit margin, total monthly income barely approaches the cost of living in San Francisco. But what really matters to me is that text-processing.com has become a self-sustaining excuse for me to experiment with natural language processing techniques & data sets, test my models against the market, and provide developers with a simple way to integrate NLP into their own projects.
So if you've got an idea for an API, especially if it's something you could charge money for, I encourage you to build it and put it up on Mashape. All you need is a working API, a unique image & name, and a Paypal account for receiving payments. Like other app stores, Mashape takes a 20% cut of all revenue, but I think it's well worth it compared to the cost of replicating everything they provide. And unlike some app stores, you're not locked in. Many of the APIs on Mashape also provide alternative usage options (including text-processing), but they're on Mashape because of the increased exposure, distribution, and additional features, like client library generation. SaaS APIs are becoming a significant part of modern computing infrastructure, and Mashape provides a great platform for getting started.
Now that NLTK versions 2.0.1 & higher include the SklearnClassifier (contributed by Lars Buitinck), it's much easier to make use of the excellent scikit-learn library of algorithms for text classification. But how well do they work?
Below is a table showing both the accuracy & F-measure of many of these algorithms using different feature extraction methods. Unlike the standard NLTK classifiers, sklearn classifiers are designed for handling numeric features. So there are 3 different values under the
feats column for each algorithm.
bow means bag-of-words feature extraction, where every word gets a 1 if present, or a 0 if not.
int means word counts are used, so if a word occurs twice, it gets the number 2 as its feature value (whereas with
bow it would still get a 1). And
tfidf means the TfidfTransformer is used to produce a floating point number that measures the importance of a word, using the tf-idf algorithm.
All numbers were determined using nltk-trainer, specifically,
python train_classifier.py movie_reviews --no-pickle --classifier sklearn.ALGORITHM --fraction 0.75. For
int features, the option
--value-type int was used, and for
tfidf features, the options
--value-type float --tfidf were used. This was with NLTK 2.0.3 and sklearn 0.12.1.
|algorithm||feats||accuracy||neg f-measure||pos f-measure|
As you can see, the best algorithms are BernoulliNB, MultinomialNB, LogisticRegression, LinearSVC, and NuSVC. Surprisingly,
tfidf features either provide a very small performance increase, or significantly decrease performance. So let's see if we can improve performance with the same techniques used in previous articles in this series, specifically bigrams and high information words.
Below is a table showing the accuracy of the top 5 algorithms using just
unigrams (the default, a.k.a single words), and using unigrams +
bigrams (pairs of words) with the option
--ngrams 1 2.
MultinomialNB got a modest boost in accuracy, putting them on-par with the rest of the algorithms. But we can do better than this using feature scoring.
As I've shown previously, eliminating low information features can have significant positive effects. Below is a table showing the accuracy of each algorithm at different score levels, using the option
--min_score SCORE (and keeping the
--ngrams 1 2 option to get bigram features).
|algorithm||score 1||score 2||score 3|
NuSVC all get a nice gain of ~4-5%, but the most interesting results are from the
MultinomialNB algorithms, which drop down significantly at
--min_score 1, but then skyrocket up to 97% with
--min_score 2. The only explanation I can offer for this is that Naive Bayes classification, because it does not weight features, can be quite sensitive to changes in training data (see Bayesian Poisoning for an example).
If you haven't yet tried using scikit-learn for text classification, then I hope this article convinces you that it's worth learning. NLTK's SklearnClassifier makes the process much easier, since you don't have to convert feature dictionaries to numpy arrays yourself, or keep track of all known features. The Scikits classifiers also tend to be more memory efficient than the standard NLTK classifiers, due to their use of sparse arrays.
NLTK 2.0.1, a.k.a NLTK 2, was recently released, and what follows is my favorite changes, new features, and highlights from the ChangeLog.
The SVMClassifier adds support vector machine classification thru SVMLight with PySVMLight. This is a much needed addition to the set of supported classification algorithms. But even more interesting...
The SklearnClassifier provides a general interface to text classification with scikit-learn. While scikit-learn is still pre-1.0, it is rapidly becoming one of the most popular machine learning toolkits, and provides more advanced feature extraction methods for classification.
NLTK has moved development and hosting to github, replacing google code and SVN. The primary motivation is to make new development easier, and already a Python 3 branch is under active development. I think this is great, since github makes forking & pull requests quite easy, and it's become the de-facto "social coding" site.
Coinciding with the github move, the documentation was updated to use Sphinx, the same documentation generator used by Python and many other projects. While I personally like Sphinx and restructured text (which I used to write this post), I'm not thrilled with the results. The new documentation structure and NLTK homepage seem much less approachable. While it works great if you know exactly what you're looking for, I worry that new/interested users will have a harder time getting started.
Since the 0.9.9 release, a number of new corpora and corpus readers have been added:
And here's a few final highlights:
- The HunposTagger, which wraps hunpos.
- The StanfordTagger plus 2 subclasses for NER and POS tagging with the Stanford POS Tagger.
- The SnowballStemmer, which supports 13 different languages. You can try it out at my online stemming demo.
I think NLTK's ideal role is be a standard interface between corpora and NLP algorithms. There are many different corpus formats, and every algorithm has its own data structure requirements, so providing common abstract interfaces to connect these together is very powerful. It allows you to test the same algorithm on disparate corpora, or try multiple algorithms on a single corpus. This is what NLTK already does best, and I hope that becomes even more true in the future.
I've given a few talks & presentations recently, so for anyone that doesn't follow japerk on twitter, here are some links:
- Weotta's MongoDB presentation from Tuesday, Feb 21 at the SF MongoDB meetup
- Corpus Bootstrapping with NLTK from Tuesday, Feb 28, during the Deep Data session at Strata
- PyCon NLTK Tutorial code from Thursday, March 8 at PyCon 2012
I also want to recommend 2 books that helped me mentally prepare for these talks:
My PyCon tutorial, Introduction to NLTK, now has over 40 people registered. This is about twice as many people as I was expecting, but I'm glad so many people want to learn NLTK Because of the large class size, it'd really helpful to have a couple assistants with at least some NLTK experience, including, but not limited to:
* installing NLTK
* installing & using NLTK on Windows
* installing & using nltk-trainer
* creating custom corpora
* using WordNet
If you're interested in helping out, please read Tutorial Assistants and contact me, japerk -- at -- gmail. Thanks!
At the end of February and the beginning of March, I'll be giving 3 talks in the SF Bay Area and one in St Louis, MO. In chronological order...
How Weotta uses MongoDB
Grant and I will be helping 10gen celebrate the opening of their new San Francisco office on Tuesday, February 21, by talking about
How Weotta uses MongoDB. We'll cover some of our favorite features of MongoDB and how we use it for local place & events search. Then we'll finish with a preview of Weotta's upcoming MongoDB powered local search APIs.
NLTK Jam Session at NICAR 2012
On Thursday, February 23, in St Louis, MO, I'll be demonstrating how to use NLTK as part of the NewsCamp workshop at NICAR 2012. This will be a version of my PyCon NLTK Tutorial with a focus on news text and corpora like treebank.
Corpus Bootstrapping with NLTK at Strata 2012
As part of the Strata 2012 Deep Data program, I'll talk about Corpus Bootstrapping with NLTK on Tuesday, February 28. The premise of this talk is that while there's plenty of great algorithms and methods for natural language processing, most of them require a training corpus, and chances are the training corpus you really need doesn't exist. So how can you quickly create a quality corpus at minimal cost? I'll cover specific real-world examples to answer this question.
NLTK Tutorial at PyCon 2012
Introduction to NLTK will be a 3 hour tutorial at PyCon on Thursday, March 8th. You'll get to know NLTK in depth, learn about corpus organization, and train your own models manually & with nltk-trainer. My goal is that you'll walk out with at least one new NLP superpower that you can put to use immediately.
Fuzzy matching is a general term for finding strings that are almost equal, or mostly the same. Of course almost and mostly are ambiguous terms themselves, so you'll have to determine what they really mean for your specific needs. The best way to do this is to come up with a list of test cases before you start writing any fuzzy matching code. These test cases should be pairs of strings that either should fuzzy match, or not. I like to create doctests for this, like so:
def fuzzy_match(s1, s2): ''' >>> fuzzy_match('Happy Days', ' happy days ') True >>> fuzzy_match('happy days', 'sad days') False ''' # TODO: fuzzy matching code return s1 == s2
Once you've got a good set of test cases, then it's much easier to tailor your fuzzy matching code to get the best results.
The first step before doing any string matching is normalization. The goal with normalization is to transform your strings into a normal form, which in some cases may be all you need to do. While
'Happy Days' != ' happy days ', with simple normalization you can get
'Happy Days'.lower() == ' happy days '.strip().
The most basic normalization you can do is to lowercase and strip whitespace. But chances are you'll want to more. For example, here's a simple normalization function that also removes all punctuation in a string.
import string def normalize(s): for p in string.punctuation: s = s.replace(p, '') return s.lower().strip()
normalize function, we can make the above fuzzy matching function pass our simple tests.
def fuzzy_match(s1, s2): ''' >>> fuzzy_match('Happy Days', ' happy days ') True >>> fuzzy_match('happy days', 'sad days') False ''' return normalize(s1) == normalize(s2)
If you want to get more advanced, keep reading...
Beyond just stripping whitespace from the ends of strings, it's also a good idea replace all whitespace occurrences with a single space character. The regex function for doing this is
re.sub('\s+', s, ' '). This will replace every occurrence of one or more spaces, newlines, tabs, etc, essentially eliminating the significance of whitespace for matching.
You may also be able to use regular expressions for partial fuzzy matching. Maybe you can use regular expressions to identify significant parts of a string, or perhaps split a string into component parts for further matching. If you think you can create a simple regular expression to help with fuzzy matching, do it, because chances are, any other code you write to do fuzzy matching will be more complicated, less straightforward, and probably slower. You can also use more complicated regular expressions to handle specific edge cases. But beware of any expression that takes puzzling out every time you look at it, because you'll probably be revisiting this code a number of times to tweak it for handling new cases, and tweaking complicated regular expressions is a sure way to induce headaches and eyeball-bleeding.
The edit distance (aka Levenshtein distance) is the number of single character edits it would take to transform one string into another. Thefore, the smaller the edit distance, the more similar two strings are.
If you want to do edit distance calculations, checkout the standalone editdist module. Its
distance function takes 2 strings and returns the Levenshtein edit distance. It's also implemented in C, and so is quite fast.
Fuzzywuzzy is a great all-purpose library for fuzzy string matching, built (in part) on top of Python's difflib. It has a number of different fuzzy matching functions, and it's definitely worth experimenting with all of them. I've personally found
token_set_ratio to be the most useful.
If you want to do some custom fuzzy string matching, then NLTK is a great library to use. There's word tokenizers, stemmers, and it even has its own edit distance implementation. Here's a way you could combine all 3 to create a fuzzy string matching function.
from nltk import metrics, stem, tokenize stemmer = stem.PorterStemmer() def normalize(s): words = tokenize.wordpunct_tokenize(s.lower().strip()) return ' '.join([stemmer.stem(w) for w in words]) def fuzzy_match(s1, s2, max_dist=3): return metrics.edit_distance(normalize(s1), normalize(s2)) <= max_dist
Finally, an interesting and perhaps non-obvious way to compare strings is with phonetic algorithms. The idea is that 2 strings that sound same may be the same (or at least similar enough). One of the most well known phonetic algorithms is Soundex, with a python soundex algorithm here. Another is Double Metaphone, with a python metaphone module here. You can also find code for these and other phonetic algorithms in the nltk-trainer phonetics module (copied from a now defunct sourceforge project called advas). Using any of these algorithms, you get an encoded string, and then if 2 encodings compare equal, the original strings match. Theoretically, you could even do fuzzy matching on the phonetic encodings, but that's probably pushing the bounds of fuzziness a bit too far.
On September 14, 2011, I'll be giving a 20 minute overview of NLTK for the San Francisco Python Meetup Group. Since it's only 20 minutes, I can't get into too much detail, but I plan to quickly cover the basics of:
- tokenization and why it's not as easy as
- part-of-speech tagging and why it's important
- chunking and named entity recognition
- text classification and how it works for sentiment analysis
- training your own models with nltk-trainer
I'll also be soliciting feedback for a NLTK Tutorial at PyCON 2012. So if you'll be at the meetup and are interested in attending a NLTK tutorial, come find me and tell me what you'd want to learn.
Updated 9/15/2011: Slides from the talk are online - NLTK in 20 minutes
PyCon 2012 just released a CFP, and NLTK shows up 3 times in the suggested topics. While I've never done this before, I know stuff about Text Processing with NLTK so I'm going to submit a tutorial abstract. But I want your feedback: what exactly should this tutorial cover? If you could attend a 3 hour class on NLTK, what knowledge & skills would you like to come away with? Here are a few specific topics I could cover:
- part-of-speech tagging & chunking
- text classification
- creating a custom corpus and corpus reader
- training custom models (manually and/or with nltk-trainer)
- bootstrapping a custom corpus for text classification
Or I could do a high-level survey of many NLTK modules and corpora. Please let me know what you think in the comments, if you plan on going to PyCon 2012, and if you'd want to attend a tutorial on NLTK. You can also contact me directly if you prefer.
If you've done this kind of thing before, have some teaching and/or speaking experience, and you feel you could add value (maybe you're a computational linguist or NLP'er and/or have used NLTK professionally), I'd be happy to work with a co-host. Contact me if you're interested, or leave a note in the comments.
Programming Collective Intelligence is a great conceptual introduction to many common machine learning algorithms and techniques. It covers classification algorithms such as Naive Bayes and Neural Networks, and algorithmic optimization approaches like Genetic Programming. The book also manages to pick interesting example applications, such as stock price prediction and topic identification.
There are two chapters in particular that stand out to me. First is Chapter 6, which covers Naive Bayes classification. What stood out was that the algorithm presented is an online learner, which means it can be updated as data comes in, unlike the NLTK NaiveBayesClassifier, which can be trained only once. Another thing that caught my attention was Fisher's method, which is not implemented in NLTK, but could be with a little work. Apparently Fisher's method is great for spam filtering, and is used by the SpamBayes Outlook plugin (which is also written in Python).
Second, I found Chapter 9, which covers Support Vector Machines and Kernel Methods, to be quite intuitive. It explains the idea by starting with examples of linear classification and its shortfalls. But then the examples show that by scaling the data in a particular way first, linear classification suddenly becomes possible. And the kernel trick is simply a neat and efficient way to reduce the amount of calculation necessary to train a classifier on scaled data.
The final chapter summarizes all the key algorithms, and for many it includes commentary on their strengths and weaknesses. This seems like valuable reference material, especially for when you have a new data set to learn from, and you're not sure which algorithms will help get the results you're looking for. Overall, I found Programming Collective Intelligence to be an enjoyable read on my Kindle 3, and highly recommend it to anyone getting started with machine learning and Python, as well as anyone interested in a general survey of machine learning algorithms.