PyCon NLTK Tutorial Suggestions
PyCon 2012 just released a CFP, and NLTK shows up 3 times in the suggested topics. While I've never done this before, I know stuff about Text Processing with NLTK so I'm going to submit a tutorial abstract. But I want your feedback: what exactly should this tutorial cover? If you could attend a 3 hour class on NLTK, what knowledge & skills would you like to come away with? Here are a few specific topics I could cover:
- part-of-speech tagging & chunking
- text classification
- creating a custom corpus and corpus reader
- training custom models (manually and/or with nltk-trainer)
- bootstrapping a custom corpus for text classification
Or I could do a high-level survey of many NLTK modules and corpora. Please let me know what you think in the comments, if you plan on going to PyCon 2012, and if you'd want to attend a tutorial on NLTK. You can also contact me directly if you prefer.
Co-Hosting
If you've done this kind of thing before, have some teaching and/or speaking experience, and you feel you could add value (maybe you're a computational linguist or NLP'er and/or have used NLTK professionally), I'd be happy to work with a co-host. Contact me if you're interested, or leave a note in the comments.
Testing Command Line Scripts with Roundup
As nltk-trainer becomes more stable, I realized that I needed some way to test the command line scripts. My previous ad-hoc method of "test whatever script options I can remember" was becoming unwieldy and unreliable. But how do you make repeatable tests for a command line script? It doesn't really fit into the standard unit testing model.
Enter roundup by Blake Mizerany. (NOTE: do not try to do apt-get install roundup. You will get an issue tracking system, not a script testing tool).
Roundup provides a great way to prevent shell bugs by creating simple test functions within a shell script. Here's the first dozen lines of train_classifier.sh, which you can probably guess tests train_classifier.py:
#!/usr/bin/env roundup
describe "train_classifier.py"
it_displays_usage_when_no_arguments() {
./train_classifier.py 2>&1 | grep -q "usage: train_classifier.py"
}
it_cannot_find_foo() {
last_line=$(./train_classifier.py foo 2>&1 | tail -n 1)
test "$last_line" "=" "ValueError: cannot find corpus path for foo"
}
describe is like the name of a module or test case, and all test functions begin with test_. Within the test functions, you use standard shell commands that should produce no output on success (like grep -q or the test command). You can also match multiple lines of output, as in:
it_trains_movie_reviews_paras() {
test "$(./train_classifier.py movie_reviews --no-pickle --no-eval --fraction 0.5 --instances paras)" "=" "loading movie_reviews
2 labels: ['neg', 'pos']
1000 training feats, 1000 testing feats
training NaiveBayes classifier"
}
Once you've got all your test functions defined, make sure your test script is executable and roundup is installed, then run your test script. You'll get nice output that looks like:
nltk-trainer$ tests/train_classifier.sh train_classifier.py it_displays_usage_when_no_arguments: [PASS] it_cannot_find_foo: [PASS] it_cannot_import_reader: [PASS] it_trains_movie_reviews_paras: [PASS] it_trains_corpora_movie_reviews_paras: [PASS] it_cross_fold_validates: [PASS] it_trains_movie_reviews_sents: [PASS] it_trains_movie_reviews_maxent: [PASS] it_shows_most_informative: [PASS] ========================================================= Tests: 9 | Passed: 9 | Failed: 0
So far, roundup has been a perfect tool for testing all the nltk-trainer scripts, and the only downside is the one-time manual installation. I highly recommend it for anyone writing custom commands and scripts, no matter what language you use to write them.
Python Testing Cookbook Review
Python Testing Cookbook, by Greg L Turnquist (@gregturn), goes far beyond Unit Testing, but overall it's a mixed bag. Here's a breakdown for each chapter (disclaimer: I received a free eBook from Packt for review):
- Basic introduction to testing with unittest, which is great if you're just starting with Python and testing.
- Good coverage of nose. I was pleasantly surprised at how easy it is to write nose plugins.
- Deep coverage of using doctest and writing testable docstrings. You can download a free PDF of Chapter 3 here.
- BDD with a cool nose plugin, and how to use mock or mockito for testing with mock objects. I wish the author had expressed an opinion in favor of either mock or mockito, but he didn't, so I will: use Fudge. Chapter 4 also covers the Lettuce DSL, which I think is pretty neat, but since every step requires writing a function handler, I'm not convinced it's actually easier or better than writing doctests or unit tests.
- Acceptance testing with Pyccuracy and Robot Framework, which both give you a way to use Selenium from Python. I thought the DSLs used seemed too "magic", but I that's probably because I didn't know the command words, and they weren't highlighted or adequately explained.
- How to install and use Jenkins and TeamCity, and how to display XML reports produced using NoseXUnit. This is a very useful chapter for anyone thinking about or setting up continuous integration.
- This chapter is supposed to be about test coverage, and does introduce coverage, but the examples get needlessly complicated. Previous chapters used a simple shopping cart example, but this chapter uses network events, which really distracts from the tests. The author also writes unittests that just print the results intead of actually testing results with assertions.
- More network event complexity while trying to demonstrate smoke testing and load testing. This chapter would have made a lot more sense in a book about network programming and how to test network events. Pyro is used with very little explanation, and MySQL and SQLlite are brought in too, increasing the complexity even more.
- This chapter is filled with useful advice, but there's no actual code examples. Instead, the advice is shoehorned into the cookbook format, which I felt detracted from the otherwise great content.
Throughout the book, the author presents a kind of "main script" that he updates at the end of many of the chapters. However, the script also contains stub functions that are never touched and barely explained, making their existance completely unnecessary. There's also far too many import *, which can make it difficult to understand the code. But I did learn enough new things that I think Python Testing Cookbook is worth buying and reading. Leaving out Chapters 7 and 8, I think the book is a great resource if you're just getting started with testing, you want to do continuous integration, and/or you want to get non-programmers involved in the testing process. There's also a blog about the book, which has links to other reviews.
Programming Collective Intelligence Review
Programming Collective Intelligence is a great conceptual introduction to many common machine learning algorithms and techniques. It covers classification algorithms such as Naive Bayes and Neural Networks, and algorithmic optimization approaches like Genetic Programming. The book also manages to pick interesting example applications, such as stock price prediction and topic identification.
There are two chapters in particular that stand out to me. First is Chapter 6, which covers Naive Bayes classification. What stood out was that the algorithm presented is an online learner, which means it can be updated as data comes in, unlike the NLTK NaiveBayesClassifier, which can be trained only once. Another thing that caught my attention was Fisher's method, which is not implemented in NLTK, but could be with a little work. Apparently Fisher's method is great for spam filtering, and is used by the SpamBayes Outlook plugin (which is also written in Python).
Second, I found Chapter 9, which covers Support Vector Machines and Kernel Methods, to be quite intuitive. It explains the idea by starting with examples of linear classification and its shortfalls. But then the examples show that by scaling the data in a particular way first, linear classification suddenly becomes possible. And the kernel trick is simply a neat and efficient way to reduce the amount of calculation necessary to train a classifier on scaled data.
The final chapter summarizes all the key algorithms, and for many it includes commentary on their strengths and weaknesses. This seems like valuable reference material, especially for when you have a new data set to learn from, and you're not sure which algorithms will help get the results you're looking for. Overall, I found Programming Collective Intelligence to be an enjoyable read on my Kindle 3, and highly recommend it to anyone getting started with machine learning and Python, as well as anyone interested in a general survey of machine learning algorithms.
Bay Area NLP Meetup
This Thursday, June 7 2011, will be the first meeting of the Bay Area NLP group, at Chomp HQ in San Francisco, where I will be giving a talk on NLTK titled "NLTK: the Good, the Bad, and the Awesome". I'll be sharing some of the things I've learned using NLTK, operating text-processing.com, and doing random consulting on natural language processing. I'll also explain why NLTK-Trainer exists and how awesome it is for training NLP models. So if you're in the area and have some time Thursday evening, come by and say hi.
Update on 07/10/2011: slides are online from my talk: NLTK: the Good, the Bad, and the Awesome.
Upcoming Python Book Reviews
Programming Collective Intelligence
I recently finished reading Programming Collective Intellegince and will be posting a review soon. The TL;DR review is: get it if want an great introduction to machine learning with Python. It covers a lot of complex algorithms in a simple way, and provides some great example use cases.
Python Testing Cookbook
Testing is something nearly every developer can do more of, and this Python Testing Cookbook looks to be full of techniques for integrating testing at various levels of a project. As a preview, you can download a PDF of Chapter 3 - Creating Testable Documentation with doctest.
Python 3 Web Development Beginner's Guide
I haven't used Python 3 yet, so Python 3 Web Development Beginner's Guide is a good excuse to do so. I also haven't done any web development outside of Django in a few years, and I'm interested to see how it compares to doing it from scratch. As a preview, you can download a PDF of Chapter 3 - Tasklist I Persistence.
Kindle 3
I'm reading all of these on a Kindle 3, which has worked out surprisingly well. It's obviously not good for copy & pasting code snippets, but that's generally a bad idea anyway. And if don't want to type code in yourself, you can always download it from the publisher's site.
Weotta at TechCrunch Disrupt
For those that missed it, my company, Weotta, launched at TechCrunch Disrupt NY 2011. The experience was at turns exciting, stressful, and fun. We met many cool people (like the teams from Skylines and Rexly) and had some delicious food at restaurants like Song, Fatty Crab, and Momofuku.
On the first day, we gave our demo, and I nearly swore on stage when I saw the big red X's that you get when the Google Static Maps API rate limits your IP address. I had checked before the session started, and everything seemed okay, but I guess you can't escape Murphy's Law (especially when hundreds of people are sharing the same IP address). Afterwards, we immediately scrambled to get the site ready to allow people in. So many people were sharing our beta invite link on Facebook that the Weotta Facebook App was temporarily disabled due to unusual behavior. Luckily, our excellent advisor Mike Hart connected us with some great people at the Facebook API team, and they quickly got us back online.
The next day we discovered, and quickly fixed, an inaccurate geocode that was causing certain plans not to generate. Then I found out that anonymized facebook emails are much longer than Django's 75 character default EmailField max_length. Not wanting to do a database migration while so many people were using the site, I waited until getting back home to fix this issue. But despite these small problems, hundreds of people were able to get in to Weotta, make plans, and discover fun things to do with their friends.
Weotta has been running smoothly ever since, and now that the conference craziness is over, we can start focusing on our #1 feedback: when will Weotta be in my city? We got requests for everywhere from Chicago and Denver to Sydney and Singapore. We hear you, and will be expanding outside of SF and NY as fast we can. While our methods are very algorithmic and we don't depend on UGC, it still takes human effort to give you focused, localized, highly relevant content so you can easily discover and plan amazing occasions. And if you'd like to help us expand and improve Weotta, get in touch. On the technical side, we're looking for at least 2 developers: a crawler/content person familiar with Scrapy, and a Django/jQuery web developer. If you're interested, contact me on github, LinkedIn, or directly at jacob@weotta.com.
We hope that everyone who signed up for the beta has received their invite; if you haven't (or want one), then you can signup for Weotta here (only a limited number will get in). And if you want to learn more about Weotta, then check out the Weotta press coverage. Weotta currently covers San Francisco and New York, so if you're interested in a "personal concierge" like service that can provide recommended plans/itineraries of things to do in a city, then signup for weotta here.
Interview and Article about NLTK and Text-Processing
I recently did an interview with Zoltan Varju (@zoltanvarju) about Python, NLTK, and my demos & APIs at text-processing.com, which you can read here. There's even a bit about Erlang & functional programming, as well as some insight into what I've been working on at Weotta. And last week, the text-processing.com API got a write up (and a nice traffic boost) from Garrett Wilkin (@garrettwilkin) on programmableweb.com.
Text Processing API Survey
If you've been using the text-processing.com API, or are thinking about using it, I'd appreciate it if you take this survey. Usage of the API has gone up recently (especially sentiment analysis), and a number of people have gone over the 1k requests/day/IP limit, so I'm considering a freemium model and/or commercial licensing for a self-hosted version. So if you'd like to use the API to do more than 1k reqs/day and/or analyze text whose length is greater 10k characters, please take this short survey.
Analyzing Tagged Corpora and NLTK Part of Speech Taggers
NLTK Trainer includes 2 scripts for analyzing both a tagged corpus and the coverage of a part-of-speech tagger.
Analyze a Tagged Corpus
You can get part-of-speech tag statistics on a tagged corpus using analyze_tagged_corpus.py. Here's the tag counts for the treebank corpus:
$ python analyze_tagged_corpus.py treebank loading nltk.corpus.treebank 100676 total words 12408 unique words 46 tags Tag Count ======= ========= # 16 $ 724 '' 694 , 4886 -LRB- 120 -NONE- 6592 -RRB- 126 . 3874 : 563 CC 2265 CD 3546 DT 8165 EX 88 FW 4 IN 9857 JJ 5834 JJR 381 JJS 182 LS 13 MD 927 NN 13166 NNP 9410 NNPS 244 NNS 6047 PDT 27 POS 824 PRP 1716 PRP$ 766 RB 2822 RBR 136 RBS 35 RP 216 SYM 1 TO 2179 UH 3 VB 2554 VBD 3043 VBG 1460 VBN 2134 VBP 1321 VBZ 2125 WDT 445 WP 241 WP$ 14 WRB 178 `` 712 ======= =========
By default, analyze_tagged_corpus.py sorts by tags, but you can sort by the highest count using --sort count --reverse. You can also see counts for simplified tags using --simplify_tags:
$ python analyze_tagged_corpus.py treebank --simplify_tags
loading nltk.corpus.treebank
100676 total words
12408 unique words
31 tags
Tag Count
======= =========
7416
# 16
$ 724
'' 694
( 120
) 126
, 4886
. 3874
: 563
ADJ 6397
ADV 2993
CNJ 2265
DET 8192
EX 88
FW 4
L 13
MOD 927
N 19213
NP 9654
NUM 3546
P 9857
PRO 2698
S 1
TO 2179
UH 3
V 6000
VD 3043
VG 1460
VN 2134
WH 878
`` 712
======= =========
Analyze Tagger Coverage
You can analyze the coverage of a part-of-speech tagger against any corpus using analyze_tagger_coverage.py. Here's the results for the treebank corpus using NLTK's default part-of-speech tagger:
$ python analyze_tagger_coverage.py treebank loading tagger taggers/maxent_treebank_pos_tagger/english.pickle analyzing tag coverage of treebank with ClassifierBasedPOSTagger Tag Found ======= ========= # 16 $ 724 '' 694 , 4887 -LRB- 120 -NONE- 6591 -RRB- 126 . 3874 : 563 CC 2271 CD 3547 DT 8170 EX 88 FW 4 IN 9880 JJ 5803 JJR 386 JJS 185 LS 12 MD 927 NN 13166 NNP 9427 NNPS 246 NNS 6055 PDT 21 POS 824 PRP 1716 PRP$ 766 RB 2800 RBR 130 RBS 33 RP 213 SYM 1 TO 2180 UH 3 VB 2562 VBD 3035 VBG 1458 VBN 2145 VBP 1318 VBZ 2124 WDT 440 WP 241 WP$ 14 WRB 178 `` 712 ======= =========
If you want to analyze the coverage of your own pickled tagger, use --tagger PATH/TO/TAGGER.pickle. You can also get detailed metrics on Found vs Actual counts, as well as Precision and Recall for each tag by using the --metrics argument with a corpus that provides a tagged_sents method, like treebank:
$ python analyze_tagger_coverage.py treebank --metrics loading tagger taggers/maxent_treebank_pos_tagger/english.pickle analyzing tag coverage of treebank with ClassifierBasedPOSTagger Accuracy: 0.995689 Unknown words: 440 Tag Found Actual Precision Recall ======= ========= ========== ============= ========== # 16 16 1.0 1.0 $ 724 724 1.0 1.0 '' 694 694 1.0 1.0 , 4887 4886 1.0 1.0 -LRB- 120 120 1.0 1.0 -NONE- 6591 6592 1.0 1.0 -RRB- 126 126 1.0 1.0 . 3874 3874 1.0 1.0 : 563 563 1.0 1.0 CC 2271 2265 1.0 1.0 CD 3547 3546 0.99895833333 0.99895833333 DT 8170 8165 1.0 1.0 EX 88 88 1.0 1.0 FW 4 4 1.0 1.0 IN 9880 9857 0.99130434782 0.95798319327 JJ 5803 5834 0.99134948096 0.97892938496 JJR 386 381 1.0 0.91489361702 JJS 185 182 0.96666666666 1.0 LS 12 13 1.0 0.85714285714 MD 927 927 1.0 1.0 NN 13166 13166 0.99166034874 0.98791540785 NNP 9427 9410 0.99477911646 0.99398073836 NNPS 246 244 0.99029126213 0.95327102803 NNS 6055 6047 0.99515235457 0.99722414989 PDT 21 27 1.0 0.66666666666 POS 824 824 1.0 1.0 PRP 1716 1716 1.0 1.0 PRP$ 766 766 1.0 1.0 RB 2800 2822 0.99305555555 0.975 RBR 130 136 1.0 0.875 RBS 33 35 1.0 0.5 RP 213 216 1.0 1.0 SYM 1 1 1.0 1.0 TO 2180 2179 1.0 1.0 UH 3 3 1.0 1.0 VB 2562 2554 0.99142857142 1.0 VBD 3035 3043 0.990234375 0.98065764023 VBG 1458 1460 0.99650349650 0.99824868651 VBN 2145 2134 0.98852223816 0.99566473988 VBP 1318 1321 0.99305555555 0.98281786941 VBZ 2124 2125 0.99373040752 0.990625 WDT 440 445 1.0 0.83333333333 WP 241 241 1.0 1.0 WP$ 14 14 1.0 1.0 WRB 178 178 1.0 1.0 `` 712 712 1.0 1.0 ======= ========= ========== ============= ==========
These additional metrics can be quite useful for identifying which tags a tagger has trouble with. Precision answers the question "for each word that was given this tag, was it correct?", while Recall answers the question "for all words that should have gotten this tag, did they get it?". If you look at PDT, you can see that Precision is 100%, but Recall is 66%, meaning that every word that was given the PDT tag was correct, but 6 out of the 27 words that should have gotten PDT were mistakenly given a different tag. Or if you look at JJS, you can see that Precision is 96.6% because it gave JJS to 3 words that should have gotten a different tag, while Recall is 100% because all words that should have gotten JJS got it.





