Tag Archives: nlp

Highlights from Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation

We introduce a model for constructing vector representations of words by composing characters using bidirectional LSTMs

Below are more highlights from Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation

our model requires only a single vector per character type and a fixed set of parameters for the compositional model. Despite the compactness of this model and, more importantly, the arbitrary nature of the form–function relationship in language, our “composed” word representations yield state-of-the-art results in language modeling and part-of-speech tagging. Benefits over traditional baselines are particularly pronounced in morphologically rich languages

it is manifestly clear that similarity in form is neither a necessary nor sufficient condition for similarity in function: small orthographic differences may correspond to large semantic or syntactic differences (butter vs. batter), and large orthographic differences may obscure nearly perfect functional correspondence (rich vs. affluent). Thus, any orthographically aware model must be able to capture non-compositional effects in addition to more regular effects due to, e.g., morphological processes. To model the complex form–function relationship, we turn to long short-term memories (LSTMs), which are designed to be able to capture complex non-linear and non-local dynamics in sequences

our character-based model is able to generate similar representations for words that are semantically and syntactically similar, even for words are orthographically distant (e.g., October and January)

The goal of our work is not to overcome existing benchmarks, but show that much of the feature engineering done in the benchmarks can be learnt automatically from the task specific data. More importantly, we wish to show large dimensionality word look tables can be compacted into a lookup table using characters and a compositional model allowing the model scale better with the size of the training data. This is a desirable property of the model as data becomes more abundant in many NLP tasks.

The authors have also released Java code for training neural networks.

NLTK 3 Changes

NLTK 3 has quite a number of changes from NLTK 2, many of which will break old code. You can see a list of documented changes in the wiki page, Porting your code to NLTK 3.0. Below are the major changes I encountered while working on the NLTK 3 Cookbook.

Probability Classes

The FreqDist api has changed. It now inherits from collections.Counter, which implements most of the previous functionality, but in a different way. So instead of fd.inc(tag), you now need to do fd[tag] += 1.

fd.samples() doesn’t exist anymore. Instead, you can use fd.most_common(), which is a method of collections.Counter that returns a list that looks like [(word, count)].

ConditionalFreqDist now inherits from collections.defaultdict (one of my favorite Python data structures) which provides most of the previous functionality for free.

WordNet API

NLTK 3 has changed many wordnet Synset attributes to methods:

  • syn.definition -> syn.definition()
  • syn.examples -> syn.examples()
  • syn.lemmas -> syn.lemmas()
  • syn.name -> syn.name()
  • syn.pos -> syn.pos()

Same goes for the Lemma class. For example, lemma.antonyms() is now a method.


The batch_tag() method is now tag_sents(). The brill tagger API has changed significantly: brill.FastBrillTaggerTrainer is now brill_trainer.BrillTaggerTrainer, and the brill templates have been replaced by the tbl.feature.Feature interface with brill.Pos or brill.Word as implementations of the interface.

Universal Tagset

Simplified tags have been replaced with the universal tagset. So tagged_corpus.tagged_sents(simplify_tags=True) becomes tagged_corpus.tagged_sents(tagset='universal'). In order to make this work, TaggedCorpusReader should be initialized with a known tagset, using the tagset kwarg, so that its tags can be mapped to the universal tagset. Known tagset mappings are stored in nltk_data/taggers/universal_tagset. The treebank tagset is called en-ptb (PennTreeBank) and the brown tagset is called en-brown. These files are simply 2 column, tab separated mappings of source tag to universal tag. The function nltk.tag.mapping.map_tag(source, target, source tag) is used to perform the mapping.

Chunking & Parse Trees

The main change in chunkers & parsers is replacing the term node with label. RegexpChunkParser now takes a chunk chunk_label argument instead of chunk_node, while in the Tree class, the node attribute has been replaced with the label() method.


The SVM classifiers and scipy based MaxentClassifier algorithms (like CG) have been removed, but the addition of the SklearnClassifier more than makes up for it. This classifier allows you to make use of most scikit-learn classification algorithms, which are generally faster and more memory efficient than the other NLTK classifiers, while being at least as accurate.

Python 3

NLTK 3 is compatible with both Python 2 and Python 3. If you are new to Python 3, then you’ll likely be puzzled when you find that training the same model on the same data can result in slightly different accuracy metrics, because dictionary ordering is random in Python 3. This is a deliberate decision to improve security, but you can control it with the PYTHONHASHSEED environment variable. Just run $ PYTHONHASHSEED=0 python to get consistent dictionary ordering & accuracy metrics.

Python 3 has also removed the separate unicode string object, so that now all strings are unicode. But some of the NLTK corpus functions return byte strings, which look like b"raw string", so you may need convert these to normal strings before doing any further string processing.

Here’s a few other Python 3 changes I ran into:

  • itertools.izip -> zip
  • dict.iteritems() doesn’t exist, use dict.items() instead
  • dict.keys() does not produce a list (it returns a view). If you want a list, use dict.dict_keys()


Because of the above switching costs, upgrading right away may not be worth it. I’m still running plenty of NLTK 2 code, because it’s stable and works great. But if you’re starting a new project, or want to take advantage of new functionality, you should definitely start with NLTK 3.

Python 3 Text Processing with NLTK 3 Cookbook

Python Text Processing with NLTK 3 Cookbook

After many weekend writing sessions, the 2nd edition of the NLTK Cookbook, updated for NLTK 3 and Python 3, is available at Amazon and Packt. Code for the book is on github at nltk3-cookbook. Here’s some details on the changes & updates in the 2nd edition:

First off, all the code in the book is for Python 3 and NLTK 3. Most of it should work for Python 2, but not all of it. And NLTK 3 has made many backwards incompatible changes since version 2.0.4. One of the nice things about Python 3 is that it’s unicode all the way. No more issues with ASCII versus unicode strings. However, you do have to deal with byte strings in a few cases. Another interesting change is that hash randomization is on by default, which means that if you don’t set the PYTHONHASHSEED environment variable, training accuracy can change slightly on each run, because the iteration order of dictionaries is no longer consistent by default.

In Chapter 1, Tokenizing Text and WordNet Basics, I added a recipe for training a sentence tokenizer using the PunktSentenceTokenizer. This is surprisingly easy, and you can find the code in chapter1.py.

Chapter 2, Replacing and Correcting Words, shows the additional languages supported by the SnowballStemmer. An unfortunate removal from this chapter is babelizer, which was a fun library to use, but is no longer supported by Yahoo.

NLTK 3 replaced simplify_tags with universal tagset mappings, so I updated Chapter 3, Creating Custom Corpora to show how to use these tagset mappings to get the universal tags.

In Chapter 4, Part-of-Speech Tagging, the last recipe shows how to use train_tagger.py from NLTK-Trainer to replicate most of the tagger training recipes detailed earlier in the chapter. NLTK-Trainer was largely inspired by my experience writing Python Text Processing with NLTK 2.0 Cookbook, after realizing that many aspects of training part-of-speech taggers could be encapsulated in a command line script.

Chapter 5, Extracing Chunks, adds examples for using train_chunker.py to train phrase chunkers.

Chapter 7, Text Classification, adds coverage of train_classifier.py, along with examples of using the SklearnClassifier, which provides access to many of the scikit-learn classification algorithms. The scikit-learn classifiers tend to be at least as accurate as NLTK’s classifiers, are often faster to train, and have much smaller memory & disk footprints. And since NLTK 3 removed support for scipy based MaxentClassifier algorithms and SVM classifiers, the choice of which classifers to use has become very easy: when in doubt, choose SklearnClassifier (code examples can be found in chapter7.py).

There are a few library changes in Chapter 9, Parsing Specific Data Types:

  • timex and SimpleParse recipes have been removed due to lack of Python 3 compatibility
  • uses beautifulsoup4 with examples of UnicodeDammit
  • chardet was replaced with charade, which is compatible with both Python 2 & 3. But since publication, charade was merged back into chardet and is no longer maintained. I recommend installing chardet and replacing all instances of the charade module name with chardet.

So if you want to learn the latest & greatest NLTK 3, pickup your copy of Python 3 Text Processing with NLTK 3 Cookbook, and checkout the code at nltk3-cookbook. If you like the book, please review it at Amazon or goodreads.

Deep Search

As we prepare to launch Weotta, we’ve struggled with how to describe what we’ve built. Is our technology big data? Yes. Do we use machine learning and natural language processing? Yes. Could you call us a search engine? Absolutely. But we think the sum is more than those parts.

We finally decided that the term that best describes what we do is deep search — a concise description of a complex search system that goes far beyond basic text search. Needless to say, we aren’t the only ones in this area by any means; Google, and plenty of other companies do various aspects of deep search. But no one has created a deep search system quite like ours — a search technology built to handle the kinds of everyday queries that don’t make sense to a normal text search engine.

Basic Search

Text search engines such as sphinx or lucene/solr, use faceted filtering: collections of documents which each have a set of fields, often specified in an XML format and each indexed so that they can be efficiently retrieved given a search query and optional facet parameters. (I recommend reading Introduction to Information Retrieval to learn specific implementation details).

Basic Search Engine

In text indexing you can usually specify different fields, each with its own weight. For instance, if you choose a heavily weighted title field and a lower weighted text field, documents with your search term in their title will get a higher score than documents that just have it in the text field.

To retrieve indexed documents you use search query strings that are analogous to SQL queries, in that there’s usually special syntax to control how the search engine decides which documents to retrieve. For example, you might be able to specify which words are required and which are optional, or how far apart the words can be.

Now, all of this is fine for programmers and technical users, but it’s hardly ideal for typical consumers, who don’t even want to know that special syntax exists, let alone how to use it. Thankfully, a deep search query engine’s superior query parsing and understanding of natural language makes that special syntax unnecessary.

Facets provide ways of filtering search results. A faceted search could look for a specific value in the field, or something more complex like a date/time range or distance from a given point. Facets don’t usually affect a document’s score, but simply reduce the set of documents that get returned. Faceted search can also be called Faceted Navigation, because facets often enable a combination search/browse interface. A basic search system, if it offers facets at all, will generally do so via checkboxes, dropdowns or similar controls. ebay is a perfect example of this, offering many facets to drill down & filter your search. A deep search system, by contrast, moves facets to the background.

These text search engines are powerful, but in the context of deep search, they’re really just another kind of database. Both text search and deep search use indexes to optimize retrieval, but instead of using SQL to retrieve data, text search uses specially formatted query strings and facet specifications. A SQL database is optimized for row-based or column-based data, while a text search engine is optimized for plain text data, using inverted indexes. Either way, from the developer standpoint, both are low-level data stores, best suited for different use cases.

Advanced Search

An advanced search system uses a text search engine at its lowest levels, but integrates additional ranking signals. An obvious example of this is Google’s Page Rank, which combines text search with keyword relevance, website authority, and many other signals in order to sort results. Where basic text search only knows about individual documents, and statistics about collections of documents, an advanced search system also considers external signals like trustworthiness, popularity and link strength. Amazon, for instance, lets users sort results by average rating or popularity. But this still isn’t deep search, because there’s no deeper understanding of the data or the query, just more powerful controls for sorting results.

Deep Search

I believe deep search has four fundamental requirements:

  1. A simple search input. This means natural language understanding (NLU) of queries, so that lower levels of the system know which facets to invoke.
  2. Multi-category search. If you’re only searching for one thing, your search system can be relatively simple. But as soon as a search contains multiple variables with no explicit facets given by a user, you need NLU to know precisely what’s being searched for, and how to search for it. You also need to effectively and automatically integrate multiple data sets into one system.
  3. Feature engineering for deep data understanding. Contrary to popular belief, big data isn’t enough. Simply having access to tons of data doesn’t automatically mean you know how to get meaningful insights out of it. A good metaphor is that of an iceberg: users can only see the tip, while most of the berg lies hidden below the water. In this metaphor, data is the ice, and feature engineering is how you shape the ice below the water, in order to surface the best results where users will see them.
  4. Contextual understanding. The more you know about the user, the more knowledge you have with which to tailor search results. This could mean knowing the user’s location, their past search history, and/or explicit preferences. Context is king!

Many of today’s search systems don’t meet any of these requirements. Some implement one or two, but very few meet them all. Siri has device context and does NLU to understand queries, but instead of actually doing the search, it routes it to another application or search engine. Google and Weotta meet all the requirements, but have very different implementation, approaches, and use cases.

How does one build a deep search system? As with simple text search, there are two major stages: indexing and querying. Here’s an overview of both, from a deep search perspective.

Deep Search

Deep Indexing

Deep search requires a deep understanding of your data: what it is, what it looks like, what it’s good for, and how to transform it into a format that machines can understand. Here are a few examples:

  • places have addresses and geographic points
  • products have a weight and size
  • movies have actors and directors

Once you’ve got your low-level data structure, you transform it into a document structure suitable for text and facet indexing. But deep search also requires higher level knowledge and understanding, which is where feature engineering comes into play. You have to think deeply about what kinds of searches your customers may do, and what level of quality they expect in the results. Then you have to figure out how to translate that into indexable document features.

Here are two examples of this thinking.

A restaurant serves chicken wings. Okay, but are they any good. How much do people like or dislike them? Are they the best in the city? Questions like this could be answered through a twist on menu-based sentiment analysis.

A specific concert may be a one-time event, but the bands have probably played other shows before. How did people like those previous gigs? What are their fan’s general demographics? What’s the venue like? Answering these questions may require combining multiple datasets in order to cross-correlate performers with concerts and venues.

Deep indexing is all about answering these kinds of questions, and converting the answers into values that are usable for ranking and/or filtering search results. This may involve applied data science, linear regression or sentiment analysis. There’s no specific methodology, because the questions and answers depend on the nature of your data and what kind of results you need. But with the proper methods, you can achieve insights that weren’t possible before. For example, with latent semantic analysis you can discover features that aren’t explicit in the data, which allows queries that would be impossible with basic text indexing. Unsurprisingly, you can expect to spend most of your time deep in the data trenches. To quote Pedro Domingos, from his paper A Few Useful Things to Know about Machine Learning:

“First-timers are often surprised by how little time in a machine learning project is spent actually doing machine learning. But it makes sense if you consider how time-consuming it is to gather data, integrate it, clean it and pre-process it, and how much trial and error can go into feature design.”

“70% of the project’s time goes into feature engineering, 20% goes towards figuring out what comprises a proper and comprehensive evaluation of the algorithm, and only 10% goes into algorithm selection and tuning.”

A major part of feature engineering is getting more data and better data. You need large, diverse datasets to get the necessary context. In Weotta’s case, that includes geographic info, demographic profiles, POI and location databases, and the social graph. But you also need a deep understanding of how to integrate and correlate this data, which machine learning algorithms to apply, and most important, which questions to ask of it and which can be answered. All of this goes into engineering an integrated system that can do so automatically. “We don’t have better algorithms,” says Google Research Director Peter Norvig. “We just have more data.”

At Weotta, we believe that high-quality data is paramount, so we spend a surprising amount of effort filtering out noisy data to extract meaningful signals. A huge part of any significant feature engineering, in fact, is data cleansing. After all, garbage in, garbage out.

You also need an automated process for continuous learning. As data comes in and is integrated, your system should automatically improve. “Machine learning isn’t a one-shot process of building a data set and running a learner,” says Pedro Domingos, “but rather an iterative process of running the learner, analyzing the results, modifying the data and/or the learner, and repeating.”

And people are an essential part of this process. You must be able to incorporate human knowledge and expertise into your data pipeline at almost every level; it is the right balance and combination of humans and machines that will determine a deep search system’s true capabilities and ability to adapt to change.

Deep Querying

Once you’ve got a deep index powered by deep data, you need to use it effectively. Simple text queries won’t suffice; you need to understand exactly what you’re searching for in order to get the right results. That means query parsing and natural language understanding.

We’ve spent a lot of time at Weotta refining our query parsing to handle queries such as restaurants for my anniversary or concerts happening this weekend for a date. Other search systems have different query parsing abilities: Siri recognizes the word call plus a name, while Google Knowledge Graph can recognize almost any entity in Wikipedia.

Once you’ve parsed the query and know what to search for, the next step is retrieving results. Since we’re doing multi-category search, that means querying multiple indexes. At this point the NLU query parsing becomes essential, because you need to know what kinds of query parameters each index supports, so the system can slice and dice the query intelligently.

But if you’re retrieving different kinds of information, how do you compose them into one set of results? How do you rank and order different kinds of things? These are fundamentally interface design and user experience questions. Google uses different parts of their results page for different kinds of results, such as maps and knowledge graph.

At Weotta, we’ve decided the card analogy makes a lot of sense. On mobile we have one stack, and on web up to five cards in a row. Why? This presentation visually focuses the user on a few results at a time while letting us show multi- category results. That’s how you can do a search like dinner drinks and a movie and get three different kinds of results, all mixed together.

Remember facets from earlier? With deep search, facets are hidden to the user, but they’re still essential to the query engine. Instead of relying on explicit checkboxes, the query parser uses natural language understanding to decide which facets to use based on the query. This decision can also be driven by the nature of the data and the product. At Weotta, when we know a query is about food, we use a facet to restrict the results to restaurants. Google does things differently; while they may know that a query has food words, because their data are so much larger and more diverse, they are unable or unwilling to make a clear decision about what kinds of results to show, so you often end up with a mix. For example, I just tried a search for sushi and along with a list of web pages, I got a ribbon of local restaurants, a map, and a knowledge graph box. Since Weotta is focused on local search and “what to do,” we know you’re looking for sushi restaurants, and that’s what we’ll produce for you. Better yet, with Weotta Deep Search, a user can be even more specific and get relevant results for restaurants that have hamachi sushi.

Another key to our deep query understanding is context: Who is doing the search? Where are they? What time is it? What’s the weather there right now? What searches have they done in past? Who are their friends or contacts? What are their stated preferences? What are their implicit preferences?

The answers to these questions could have a significant effect on results. If you know someone is in New York, you may not want to show places or events happening elsewhere. If it’s raining outside, you may want to feature indoor events or nearby places. If you know someone dislikes fast food, you don’t want to show them McDonald’s.

People tend to like what their friends like. It may not be a strong signal, but social proof does matter to almost everyone. Plus, people often do things with their friends and family, so if you take all their preferences into account, you may be able to find more relevant results. In fact, if you use Facebook to signup for Weotta, you’ll be able to search for places and events your friends like.

Deep Search Stack


A deep search system goes beyond basic text search and advanced search with the following requirements:

  1. No explicit facets
  2. Multi-category search
  3. Deep feature engineering
  4. Context

To implement these, you’ll need to make use of natural language understanding, machine learning, and big data. It’s even more work to implement than you’d think, but the benefits are quite clear: you can do natural language queries with a simpler interface and get more relevant, personalized results.

As we build ever more machines to adapt to human needs, I believe deep search technology will become an integral part of our daily lives in countless ways. For now, you can get a taste of its capabilities with Weotta.

Avogadro Corp Book Review / AI Speculation

Avogadro CorpAvogadro Corp: The Singularity Is Closer Than It Appears, by William Hertling, is the first sci-fi book I’ve read with a semi-plausible AI origin story. That’s because the premise isn’t so simple as “increased computing power -> emergent AI”. It’s a much more well defined formula: ever increasing computing power + powerful language processing + never ending stream of training data + goal oriented behavior + deep integration into internet infrastructure -> AI. The AI in the story is called ELOPe, which stands for Email Language Optimization Program, and its function is essentially to improve the quality of emails. WARNING there will be spoilers below, but only enough to describe ELOPe and speculate about how it might be implemented.

What is ELOPe

The idea behind ELOPe is to provide writing suggestions as a feature of a popular web-based email service. These writing suggestions are designed to improve the outcome of your email, whatever that may be. To take an example from the book, if you’re requesting more compute resources for a project, then ELOPe’s job is to offer writing suggestions that are most likely to get your request approved. By taking into account your own past writings, who you’re sending the email to, and what you’re asking for, it can go as far as completely re-writing the email to achieve the optimal outcome.

Using the existence of ELOPe as a given, the author writes a enjoyable story that is (mostly) technically accurate with plenty of details, without being boring. If you liked Daemon by Daniel Suarez, or you work with any kind of natural language / text-processing technology, you’ll probably enjoy the story. I won’t get into how an email writing suggestion program goes from that to full AI & takes over the world as a benevolent ghost in the wires – for that you need to read the book. What I do want to talk about is how this email optimization system could be implemented.

How ELOPe might work

Let’s start by defining the high-level requirements. ELOPe is an email optimizer, so we have the sender, the receiver, and the email being written as inputs. The output is a re-written email that preserves the “voice” of the sender while using language that will be much more likely to achieve the sender’s desired outcome, given who they’re sending the email to. That means we need the following:

  1. ability to analyze the email to determine what outcome is desired
  2. prior knowledge of how the receiver has responded to other emails with similar outcome topics, in order to know what language produced the best outcomes (and what language produced bad outcomes)
  3. ability to re-write (or generate) an email whose language is consistent with the sender, while also using language optimized to get the best response from the receiver

Topic Analysis

Determining the desired outcome for an email seems to me like a sophisticated combination of topic modeling and deep linguistic parsing. The goal would be to identify the core reason for the email: what is the sender asking for, and what would be an optimal response?

Being able to do this from a single email is probably impossible, but if you have access to thousands, or even millions of email chains, accurate topic modeling is much more do-able. Nearly every email someone sends will have some similarity to past emails sent by other people in similar situations. So you could create feature vectors for every email chain (using deep semantic parsing), then cluster the chains using feature similarity. Now you have topic clusters, and from that you could create training data for thousands of topic classifiers. Once you have the classifiers, you can run those in parallel to determine the most likely topic(s) of a single email.

Obviously it would be very difficult to create accurate clusters, and even harder to do so at scale. Language is very fuzzy, humans are inconsistent, and a huge fraction of email is spam. But the core of the necessary technology exists, and can work very well in limited conditions. The ability to parse emails, extract textual features, and cluster & classify feature vectors are functionality that’s available in at least a few modern programming libraries today (i.e. Python, NLTK & scikit-learn). These are areas of software technology that are getting a lot of attention right now, and all signs indicate that attention will only increase over time, so it’s quite likely that the difficulty level will decrease significantly over the next 10 years. Moving on, let’s assume we can do accurate email topic analysis. The next hurdle is outcome analysis.

Outcome Analysis

Once you can determine topics, now you need to learn about outcomes. Two email chains about acquiring compute resources have the same topic, but one chain ends with someone successfully getting access to more compute resources, while the other ends in failure. How do you differentiate between these? This sounds like next-generation sentiment analysis. You need to go deeper than simple failure vs. success, positive vs. negative, since you want to know which email chains within a given topic produced the best responses, and what language they have in common. In other words, you need a language model that weights successful outcome language much higher than failure outcome language. The only way I can think of doing this with a decent level of accuracy is massive amounts of human verified training data. Technically do-able, but very expensive in terms of time and effort.

What really pushes the bounds of plausibility is that the language model can’t be universal. Everyone has their own likes, dislikes, biases, and preferences. So you need language models that are specific to individuals, or clusters of individuals that respond similarly on the same topic. Since these clusters are topic specific, every individual would belong to many (topic, cluster) pairs. Given N topics and an average of M clusters within each topic, that’s N*M language models that need to be created. And one of the major plot points of the book falls out naturally: ELOPe needs access to huge amounts of high end compute resources.

This is definitely the least do-able aspect of ELOPe, and I’m ignoring all the implicit conceptual knowledge that would be required to know what an optimal outcome is, but let’s move on 🙂

Language Generation

Assuming that we can do topic & outcome analysis, the final step is using language models to generate more persuasive emails. This is perhaps the simplest part of ELOPe, assuming everything else works well. That’s because natural language generation is the kind of technology that works much better with more data, and it already exists in various forms. Google translate is a kind of language generator, chatbots have been around for decades, and spammers use software to spin new articles & text based on existing writings. The differences in this case are that every individual would need their own language generator, and it would have to be parameterized with pluggable language models based on the topic, desired outcome, and receiver. But assuming we have good topic & receiver specific outcome analysis, plus hundreds or thousands of emails from the sender to learn from, then generating new emails, or just new phrases within an email, seems almost trivial compared to what I’ve outlined above.

Final Words

I’m still highly skeptical that strong AI will ever exist. We humans barely understand the mechanisms of own intelligence, so to think that we can create comparable artificial intelligence smells of hubris. But it can be fun to think about, and the point of sci-fi is to tell stories about possible futures, so I have no doubt various forms of AI will play a strong role in sci-fi stories for years to come.

NLTK 2 Release Highlights

NLTK 2.0.1, a.k.a NLTK 2, was recently released, and what follows is my favorite changes, new features, and highlights from the ChangeLog.

New Classifiers

The SVMClassifier adds support vector machine classification thru SVMLight with PySVMLight. This is a much needed addition to the set of supported classification algorithms. But even more interesting…

The SklearnClassifier provides a general interface to text classification with scikit-learn. While scikit-learn is still pre-1.0, it is rapidly becoming one of the most popular machine learning toolkits, and provides more advanced feature extraction methods for classification.


NLTK has moved development and hosting to github, replacing google code and SVN. The primary motivation is to make new development easier, and already a Python 3 branch is under active development. I think this is great, since github makes forking & pull requests quite easy, and it’s become the de-facto “social coding” site.


Coinciding with the github move, the documentation was updated to use Sphinx, the same documentation generator used by Python and many other projects. While I personally like Sphinx and restructured text (which I used to write this post), I’m not thrilled with the results. The new documentation structure and NLTK homepage seem much less approachable. While it works great if you know exactly what you’re looking for, I worry that new/interested users will have a harder time getting started.

New Corpora

Since the 0.9.9 release, a number of new corpora and corpus readers have been added:

ChangeLog Highlights

And here’s a few final highlights:

The Future

I think NLTK’s ideal role is be a standard interface between corpora and NLP algorithms. There are many different corpus formats, and every algorithm has its own data structure requirements, so providing common abstract interfaces to connect these together is very powerful. It allows you to test the same algorithm on disparate corpora, or try multiple algorithms on a single corpus. This is what NLTK already does best, and I hope that becomes even more true in the future.

Upcoming Talks

At the end of February and the beginning of March, I’ll be giving 3 talks in the SF Bay Area and one in St Louis, MO. In chronological order…

How Weotta uses MongoDB

Grant and I will be helping 10gen celebrate the opening of their new San Francisco office on Tuesday, February 21, by talking about
How Weotta uses MongoDB. We’ll cover some of our favorite features of MongoDB and how we use it for local place & events search. Then we’ll finish with a preview of Weotta’s upcoming MongoDB powered local search APIs.

NLTK Jam Session at NICAR 2012

On Thursday, February 23, in St Louis, MO, I’ll be demonstrating how to use NLTK as part of the NewsCamp workshop at NICAR 2012. This will be a version of my PyCon NLTK Tutorial with a focus on news text and corpora like treebank.

Corpus Bootstrapping with NLTK at Strata 2012

As part of the Strata 2012 Deep Data program, I’ll talk about Corpus Bootstrapping with NLTK on Tuesday, February 28. The premise of this talk is that while there’s plenty of great algorithms and methods for natural language processing, most of them require a training corpus, and chances are the training corpus you really need doesn’t exist. So how can you quickly create a quality corpus at minimal cost? I’ll cover specific real-world examples to answer this question.

NLTK Tutorial at PyCon 2012

Introduction to NLTK will be a 3 hour tutorial at PyCon on Thursday, March 8th. You’ll get to know NLTK in depth, learn about corpus organization, and train your own models manually & with nltk-trainer. My goal is that you’ll walk out with at least one new NLP superpower that you can put to use immediately.

Bay Area NLP Meetup

This Thursday, June 7 2011, will be the first meeting of the Bay Area NLP group, at Chomp HQ in San Francisco, where I will be giving a talk on NLTK titled “NLTK: the Good, the Bad, and the Awesome”. I’ll be sharing some of the things I’ve learned using NLTK, operating text-processing.com, and doing random consulting on natural language processing. I’ll also explain why NLTK-Trainer exists and how awesome it is for training NLP models. So if you’re in the area and have some time Thursday evening, come by and say hi.

Update on 07/10/2011: slides are online from my talk: NLTK: the Good, the Bad, and the Awesome.

Interview and Article about NLTK and Text-Processing

I recently did an interview with Zoltan Varju (@zoltanvarju) about Python, NLTK, and my demos & APIs at text-processing.com, which you can read here. There’s even a bit about Erlang & functional programming, as well as some insight into what I’ve been working on at Weotta. And last week, the text-processing.com API got a write up (and a nice traffic boost) from Garrett Wilkin (@garrettwilkin) on programmableweb.com.

Analyzing Tagged Corpora and NLTK Part of Speech Taggers

NLTK Trainer includes 2 scripts for analyzing both a tagged corpus and the coverage of a part-of-speech tagger.

Analyze a Tagged Corpus

You can get part-of-speech tag statistics on a tagged corpus using analyze_tagged_corpus.py. Here’s the tag counts for the treebank corpus:

$ python analyze_tagged_corpus.py treebank
loading nltk.corpus.treebank
100676 total words
12408 unique words
46 tags

  Tag      Count
=======  =========
#               16
$              724
''             694
,             4886
-LRB-          120
-NONE-        6592
-RRB-          126
.             3874
:              563
CC            2265
CD            3546
DT            8165
EX              88
FW               4
IN            9857
JJ            5834
JJR            381
JJS            182
LS              13
MD             927
NN           13166
NNP           9410
NNPS           244
NNS           6047
PDT             27
POS            824
PRP           1716
PRP$           766
RB            2822
RBR            136
RBS             35
RP             216
SYM              1
TO            2179
UH               3
VB            2554
VBD           3043
VBG           1460
VBN           2134
VBP           1321
VBZ           2125
WDT            445
WP             241
WP$             14
WRB            178
``             712
=======  =========

By default, analyze_tagged_corpus.py sorts by tags, but you can sort by the highest count using --sort count --reverse. You can also see counts for simplified tags using --simplify_tags:

$ python analyze_tagged_corpus.py treebank --simplify_tags
loading nltk.corpus.treebank
100676 total words
12408 unique words
31 tags

  Tag      Count
=======  =========
#               16
$              724
''             694
(              120
)              126
,             4886
.             3874
:              563
ADJ           6397
ADV           2993
CNJ           2265
DET           8192
EX              88
FW               4
L               13
MOD            927
N            19213
NP            9654
NUM           3546
P             9857
PRO           2698
S                1
TO            2179
UH               3
V             6000
VD            3043
VG            1460
VN            2134
WH             878
``             712
=======  =========

Analyze Tagger Coverage

You can analyze the coverage of a part-of-speech tagger against any corpus using analyze_tagger_coverage.py. Here’s the results for the treebank corpus using NLTK’s default part-of-speech tagger:

$ python analyze_tagger_coverage.py treebank
loading tagger taggers/maxent_treebank_pos_tagger/english.pickle
analyzing tag coverage of treebank with ClassifierBasedPOSTagger

  Tag      Found
=======  =========
#               16
$              724
''             694
,             4887
-LRB-          120
-NONE-        6591
-RRB-          126
.             3874
:              563
CC            2271
CD            3547
DT            8170
EX              88
FW               4
IN            9880
JJ            5803
JJR            386
JJS            185
LS              12
MD             927
NN           13166
NNP           9427
NNPS           246
NNS           6055
PDT             21
POS            824
PRP           1716
PRP$           766
RB            2800
RBR            130
RBS             33
RP             213
SYM              1
TO            2180
UH               3
VB            2562
VBD           3035
VBG           1458
VBN           2145
VBP           1318
VBZ           2124
WDT            440
WP             241
WP$             14
WRB            178
``             712
=======  =========

If you want to analyze the coverage of your own pickled tagger, use --tagger PATH/TO/TAGGER.pickle. You can also get detailed metrics on Found vs Actual counts, as well as Precision and Recall for each tag by using the --metrics argument with a corpus that provides a tagged_sents method, like treebank:

$ python analyze_tagger_coverage.py treebank --metrics
loading tagger taggers/maxent_treebank_pos_tagger/english.pickle
analyzing tag coverage of treebank with ClassifierBasedPOSTagger

Accuracy: 0.995689
Unknown words: 440

  Tag      Found      Actual      Precision      Recall
=======  =========  ==========  =============  ==========
#               16          16  1.0            1.0
$              724         724  1.0            1.0
''             694         694  1.0            1.0
,             4887        4886  1.0            1.0
-LRB-          120         120  1.0            1.0
-NONE-        6591        6592  1.0            1.0
-RRB-          126         126  1.0            1.0
.             3874        3874  1.0            1.0
:              563         563  1.0            1.0
CC            2271        2265  1.0            1.0
CD            3547        3546  0.99895833333  0.99895833333
DT            8170        8165  1.0            1.0
EX              88          88  1.0            1.0
FW               4           4  1.0            1.0
IN            9880        9857  0.99130434782  0.95798319327
JJ            5803        5834  0.99134948096  0.97892938496
JJR            386         381  1.0            0.91489361702
JJS            185         182  0.96666666666  1.0
LS              12          13  1.0            0.85714285714
MD             927         927  1.0            1.0
NN           13166       13166  0.99166034874  0.98791540785
NNP           9427        9410  0.99477911646  0.99398073836
NNPS           246         244  0.99029126213  0.95327102803
NNS           6055        6047  0.99515235457  0.99722414989
PDT             21          27  1.0            0.66666666666
POS            824         824  1.0            1.0
PRP           1716        1716  1.0            1.0
PRP$           766         766  1.0            1.0
RB            2800        2822  0.99305555555  0.975
RBR            130         136  1.0            0.875
RBS             33          35  1.0            0.5
RP             213         216  1.0            1.0
SYM              1           1  1.0            1.0
TO            2180        2179  1.0            1.0
UH               3           3  1.0            1.0
VB            2562        2554  0.99142857142  1.0
VBD           3035        3043  0.990234375    0.98065764023
VBG           1458        1460  0.99650349650  0.99824868651
VBN           2145        2134  0.98852223816  0.99566473988
VBP           1318        1321  0.99305555555  0.98281786941
VBZ           2124        2125  0.99373040752  0.990625
WDT            440         445  1.0            0.83333333333
WP             241         241  1.0            1.0
WP$             14          14  1.0            1.0
WRB            178         178  1.0            1.0
``             712         712  1.0            1.0
=======  =========  ==========  =============  ==========

These additional metrics can be quite useful for identifying which tags a tagger has trouble with. Precision answers the question “for each word that was given this tag, was it correct?”, while Recall answers the question “for all words that should have gotten this tag, did they get it?”. If you look at PDT, you can see that Precision is 100%, but Recall is 66%, meaning that every word that was given the PDT tag was correct, but 6 out of the 27 words that should have gotten PDT were mistakenly given a different tag. Or if you look at JJS, you can see that Precision is 96.6% because it gave JJS to 3 words that should have gotten a different tag, while Recall is 100% because all words that should have gotten JJS got it.