PyCon NLTK Tutorial Suggestions

PyCon 2012 just released a CFP, and NLTK shows up 3 times in the suggested topics. While I’ve never done this before, I know stuff about Text Processing with NLTK so I’m going to submit a tutorial abstract. But I want your feedback: what exactly should this tutorial cover? If you could attend a 3 hour class on NLTK, what knowledge & skills would you like to come away with? Here are a few specific topics I could cover:

  • part-of-speech tagging & chunking
  • text classification
  • creating a custom corpus and corpus reader
  • training custom models (manually and/or with nltk-trainer)
  • bootstrapping a custom corpus for text classification

Or I could do a high-level survey of many NLTK modules and corpora. Please let me know what you think in the comments, if you plan on going to PyCon 2012, and if you’d want to attend a tutorial on NLTK. You can also contact me directly if you prefer.

Co-Hosting

If you’ve done this kind of thing before, have some teaching and/or speaking experience, and you feel you could add value (maybe you’re a computational linguist or NLP’er and/or have used NLTK professionally), I’d be happy to work with a co-host. Contact me if you’re interested, or leave a note in the comments.

  • http://twitter.com/MikeScarp Mike Scarpati

    I doubt that I’ll be able to make it, but I’d be interested to see some some of “case study” on a custom corpus.  I had a project recently where I wanted to do some text classification and went with R’s tm package instead of NLTK because it was easier to find documentation and examples of preprocessing in tm.  Best of luck!

  • http://twitter.com/MikeScarp Mike Scarpati

    I doubt that I’ll be able to make it, but I’d be interested to see some some of “case study” on a custom corpus.  I had a project recently where I wanted to do some text classification and went with R’s tm package instead of NLTK because it was easier to find documentation and examples of preprocessing in tm.  Best of luck!

  • http://victorypoints.com Pete Mancini

    I’m not going to PyCon but I did used NLTK in a big project this year. My suggestion would be a tutorial on noise control. I found that by implementing strong noise control on the incoming text the out going processed structures from NLTK were vastly improved. I was doing work with concept maps and creating noun groups. Noise control was essential to getting useable results.

  • Peter Herndon

    I’d say pick a small set of vaguely defined, ill-worded business goals, and demonstrate how to get there using NLTK. Not just the academic examples that come with NLTK, but stuff that a business could actually use and would want to do.  Examples: an ecommerce outfit has customer-supplied product reviews. How can we use NLTK to classify those reviews? 

    Or, how can we use NLTK to categorize a set of blog articles, where the categories emerge from the corpus? That is, the categories are not pre-defined.

  • http://twitter.com/CaptSolo Uldis Bojars

    It would be interesting to learn how Twitter messages can be analyzed using NLTK. They are different from other corpus in that they are smaller, abbreviated and often do not follow correct grammar. Yet it would be interesting to mine them using natural language processing techniques.

  • http://victorypoints.com Pete Mancini

    That description makes them similar to phone conversation transcripts! However there is much less of the call-response back and forth you get in phone transcripts. 

  • http://streamhacker.com/ Jacob Perkins

    Good suggestion Pete. What sort of noise control methods did you use? Filtering by information gain or TF/IDF?

  • http://streamhacker.com/ Jacob Perkins

    Yes, I think something practical with real potential business value would be very useful. And I get the impression that most people don’t know where to start when it comes to effectively using NLP/NL on their data.

  • http://streamhacker.com/ Jacob Perkins

    There’s actually a tagged phone transcript corpus that comes with NLTK. I’ve been meaning to look into it more specifically for bootstrapping models for tweets because of its potential similarity.

  • http://streamhacker.com/ Jacob Perkins

    This is a popular area of research right now, and I’m sure we’ll see more in the next few years. But I’m not sure if it would be a good subject for a PyCon tutorial.

  • http://streamhacker.com/ Jacob Perkins

    What sort of preprocessing were you interested in? Perhaps you can do this in NLTK, but it’s just not as easy or well documented.

  • http://victorypoints.com Pete Mancini

    For noise control I was looking at the types of errors that were commonly occuring and filtering them on a case by case basis. I also created a simplified character filter that removed or translated esoteric characters. This was necessary because a lot of the text came from docx or pdf files and there was a ton of junk in them. Formatting issues I didn’t deal with and the NLTK seemed to work fine in spite of them.

    Mainly I was using NLTK to automate the taxonomy of files uploaded to SharePoint 2010. It allowed for automatic clustering of files and gave a head start. I used the coefficient of variability to determine if there was enough signal in the document to produce the concept map and then applied cutoffs in the histogram to remove infrequently appearing concepts. I gave special preference to bi-grams and larger n-grams since generally they are more important.

    For phone transcripts I did analysis on chat rooms and other active communications while working for the Army.

    For a real challenge, point NLTK at rap lyrics. Ex.: “Got the title from my Mama; put the whip in my own name now.” Forget ARTINT, HUMINT has a hard time deciphering these!

  • http://victorypoints.com Pete Mancini

    I think it may work. Tweets and SMS messages are in the same category with SMS messages having more of the call-response feature than Tweets.

    Phone conversations have repairs that tweets and sms messages don’t. People will say something then, “uhm” and then rephrase it.

    Flight transcripts during emergencies or hijackings are probably quite interesting.

    Basically I think the talk should focus on why you’d want to use NLTK and what it can do. I think the typical audience doesn’t have a clue what is possible. When I was doing my recent consulting I had two meetings (one with IT and then the other with upper management) where I laid out what NLP was good for, some of the math, lots of graphs and examples using their own data. It was hard for some to get but they understood the results, if not how we would get them.