Tag Archives: nlp

Recent Advances in Deep Learning for Natural Language Processing

This article was original published at The New Stack under the title “How Deep Learning Supercharges Natural Language Processing“.

Voice search, intelligent assistants, and chatbots are becoming common features of modern technology. Users and customers are demanding a better, more human experience when interacting with computers. According to Tableau’s business trends report, IDC predicts that by 2019, intelligent assistants will become commonly accessible to enterprise workers, while Gartner predicts that by 2020, 50 percent of analytics queries will involve some form of natural language processing. Chatbots, intelligent assistants, natural language queries, and voice-enabled applications all involve various forms of natural language processing. To fully realize these new user experiences, we will need to build upon the latest methods, some of which I will cover here.

Let’s start with the basics: what is natural language processing? Natural language processing (NLP), is a collection of techniques for helping machines understand human language. For example, one of the essential techniques is tokenization: breaking up text into “tokens,” such as words. Given individual words in sequence, you can start to apply reason to them, and do things like sentiment analysis to determine if a piece of text is positive or negative. But even a task as simple as word identification can be quite tricky. Is the word what’s really one word or two (what + is, or what + was)? What about languages that use characters to represent multi-word concepts, like Kanjii?

Deep learning is an advanced type of machine learning using neural networks. It became popular due to the success of the techniques at solving problems such as image classification (labeling an image based on visual content) and speech recognition (converting sounds into text). Many people thought that deep learning techniques, when applied to natural language, would quickly achieve similar levels of performance. But because of all the idiosyncrasies of natural language, the field has not seen the same kind of breakthrough success with deep learning as other fields, like image processing. However, that appears to be changing. In the past few years, researchers have been applying newer deep learning methods to natural language processing, and I will share some of these recent successes.

Deep learning — through recent improvements to word embeddings, a focus on attention, mobile enablement, and its appearance in the home — is starting to capture natural language processing like it previously captured image processing. In this article, I will cover some recent deep learning-based NLP research successes that have made an impact on the field. Because of these improvements, we will see simpler and more natural user experiences, better software performance, and more powerful home and mobile applications.

Word Embeddings

Words are essential to every natural language processing system. Traditional NLP looks at words as strings, but deep learning techniques can only process numeric vectors. Word embeddings were invented as a way to transform words into vectors, enabling new kinds of mathematical feature analysis. But the vector representation of words is only as good as the text it was trained on.

The more common word embeddings are trained on Wikipedia, but Wikipedia text may not be representative of whatever text you’re processing. It’s generally written as well structured factual statements, which is nothing like text found on twitter, and both of these are different than restaurant reviews. So vectors trained on Wikipedia might be mathematically misleading if you use those vectors to analyze a different style of text. Text from the Common Crawl provides a more diverse set of text for training a word embedding model. The FastText library provides some great pre-trained English word vectors, along with tools for training your own. Training your own vectors is essential if you’re processing any language other than English.

Character level embeddings have also shown surprising results. This technique tries to learn vectors for individual characters, where words would be represented as a composition of the individual character vectors. In an effort to learn how to predict the next character in reviews, researchers discovered a sentiment neuron, which they could control to produce positive or negative review output. Using the sentiment neuron, they were able to beat the previous top accuracy score on the sentiment treebank. This is quite an impressive result for something discovered as a side effect of other research.

CNNs, RNNs, and Attention

Moving beyond vectors, deep learning requires training neural networks for various tasks. Vectors are the input and output, in between are layers of nodes connected together in a network. The nodes represent functions on the input data, with each function taking the input from the previous layer and producing output for the next layer. The structure of the network and how the nodes are connected very much determines the learning capabilities and performance.

In general, the deeper and more complicated a network, the longer it takes to train. When using large datasets, many networks can only be effectively trained using clusters of graphics processors (GPUs), because GPUs are optimized for the necessary floating point math. This puts some types of deep learning outside the reach of anyone not at large companies or institutions that can afford the expensive GPU clusters necessary for deep learning on big data.

Standard neural networks are feedforward networks, where each node in a layer is forward connected to every node in the next layer. A Recurrent Neural Network (RNN) is a network where the nodes in each layer also connect back to the previous layer. This creates a kind of memory that can be great for learning from sequences, such as words in a sentence.

A Convolutional Neural Networks (CNN) is a type feedforward network, but with more layers, and where the forward connections have been manipulated, or convoluted, to achieve certain properties. CNNs tend to be good at extracting position invariant features, meaning they do not care so much about sequence ordering. Because of this, CNNs can be trained in a more parallel manner, leading to faster training and optimization compared to RNNs.

While CNNs may win in raw speed, both types of neural networks tend to have comparable performance characteristics.  In fact, RNNs have a slight edge when it comes to sequence oriented tasks like Part-of-Speech tagging, where you are trying to identify the part of speech (such as “noun” or “verb”) for each word in a sentence. For a detailed performance comparison of CNNs and RNNs applied to NLP see: Comparative Study of CNN and RNN for Natural Language Processing.

The most successful RNN models are the LSTM (Long short-term memory) and GRU (gated recurrent unit). These use attention gates, which act as a kind of short-term memory for the network. However, a newer research paper implies that attention may be all you need. By doing away with recurrence networks and convolution, and keeping only attention mechanisms, these models can be trained in parallel like a CNN, but even faster, and have comparable better performance than RNNs on some sequence learning tasks, such machine translation.

Reducing the training cost while maintaining comparable performance means that smaller companies and individuals can throw more data at their deep learning models, and potentially compete more effectively with larger companies and institutions.

Software 2.0

One of the nice properties of neural network models is that the core algorithms and math are mostly the same. Once you have the infrastructure, model definition, and training algorithms all setup, these models are very reusable. “Software 2.0” is the idea that significant components of an application or system can be replaced by neural network models. Instead of writing code, developers:

  1. Collect training data
  2. Clean and label the data
  3. Train a model
  4. Integrate the model

While the most interesting parts are often steps three and four, most of the work happens in the data preparation steps one and two. Collecting and curating good, useful, clean data can be a significant amount of work, which is why methods like corpus bootstrapping are important for getting to good data faster. In the long run, it is often easier to make better data than it is to design better algorithms.

The past few years have demonstrated that neural networks can achieve much better performance than many alternatives, sometimes even in areas not traditionally touched by machine learning. One of the most recent and interesting advances is in learning data indexing structures. B-tree indexes are a commonly used data structure that provides an efficient way of finding data, assuming the tree is structured well. However, these newly learned indexes significantly outperformed the traditional B-tree indexes in both speed and memory usage. Such low-level data structure performance improvements could have far-reaching impacts if it can be integrated into standard development practices.

As research progresses, and the necessary infrastructure becomes cheaper and more available, deep learning models are likely to be used in more and more parts of the software stack, including mobile applications.

Mobile Machine Learning

Most deep learning requires clusters of expensive GPUs and lots of RAM. This level of compute power is only accessible to those who can afford it, usually in the cloud. But consumers are increasingly using mobile devices, and much of the world does not have reliable and affordable full-time wireless connectivity. Getting machine learning into mobile devices will enable more developers to create all sorts of new applications.

  • Apple’s CoreML framework enables a number of NLP capabilities on iOS devices, such as language identification and named entity recognition.
  • Baidu developed a CNN library for mobile deep learning that works on both iOS and Android.
  • Qualcomm created a Neural Processing Engine for its mobile processors, enabling popular deep learning frameworks to operate on mobile devices.

Expect a lot more of this in the near future, as mobile devices continue to become more powerful and ubiquitous. Marc Andreessen famously said that “software is eating the world,” and now machine learning appears to be eating software. Not only is it in our pocket, it is also in our homes.

Deep Learning in the Home

Alexa and other voice assistants became mainstream in 2017, bringing NLP into millions of homes. Mobile users are already familiar with Siri and Google Assistant, but the popularity of Alexa and Google Home shows how many people have become comfortable having conversations with voice-activated dialogue systems. How much these systems rely on deep learning is somewhat unknown, but it is fairly certain that significant parts of their dialogue systems use deep learning models for core functions such as speech to text, part of speech tagging, natural language generation, and text to speech.

As research advances and these companies collect increasing amounts of data from their users, deep learning capabilities will improve as well, and implementations of “software 2.0” will become pervasive. While a few large companies are creating powerful data moats, there is always room on the edges for highly specialized, domain-specific applications of natural languages, such as cybersecurity, IT operations, and data analytics.

Deep learning has become a core component of modern natural language processing systems.

However, many traditional natural language processing techniques are still quite effective and useful, especially in areas that lack the huge amounts of training data necessary for deep learning. I will cover these traditional statistical techniques in an upcoming article.

Highlights from Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation

We introduce a model for constructing vector representations of words by composing characters using bidirectional LSTMs

Below are more highlights from Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation

our model requires only a single vector per character type and a fixed set of parameters for the compositional model. Despite the compactness of this model and, more importantly, the arbitrary nature of the form–function relationship in language, our “composed” word representations yield state-of-the-art results in language modeling and part-of-speech tagging. Benefits over traditional baselines are particularly pronounced in morphologically rich languages

it is manifestly clear that similarity in form is neither a necessary nor sufficient condition for similarity in function: small orthographic differences may correspond to large semantic or syntactic differences (butter vs. batter), and large orthographic differences may obscure nearly perfect functional correspondence (rich vs. affluent). Thus, any orthographically aware model must be able to capture non-compositional effects in addition to more regular effects due to, e.g., morphological processes. To model the complex form–function relationship, we turn to long short-term memories (LSTMs), which are designed to be able to capture complex non-linear and non-local dynamics in sequences

our character-based model is able to generate similar representations for words that are semantically and syntactically similar, even for words are orthographically distant (e.g., October and January)

The goal of our work is not to overcome existing benchmarks, but show that much of the feature engineering done in the benchmarks can be learnt automatically from the task specific data. More importantly, we wish to show large dimensionality word look tables can be compacted into a lookup table using characters and a compositional model allowing the model scale better with the size of the training data. This is a desirable property of the model as data becomes more abundant in many NLP tasks.

The authors have also released Java code for training neural networks.

NLTK 3 Changes

NLTK 3 has quite a number of changes from NLTK 2, many of which will break old code. You can see a list of documented changes in the wiki page, Porting your code to NLTK 3.0. Below are the major changes I encountered while working on the NLTK 3 Cookbook.

Probability Classes

The FreqDist api has changed. It now inherits from collections.Counter, which implements most of the previous functionality, but in a different way. So instead of fd.inc(tag), you now need to do fd[tag] += 1.

fd.samples() doesn’t exist anymore. Instead, you can use fd.most_common(), which is a method of collections.Counter that returns a list that looks like [(word, count)].

ConditionalFreqDist now inherits from collections.defaultdict (one of my favorite Python data structures) which provides most of the previous functionality for free.

WordNet API

NLTK 3 has changed many wordnet Synset attributes to methods:

  • syn.definition -> syn.definition()
  • syn.examples -> syn.examples()
  • syn.lemmas -> syn.lemmas()
  • syn.name -> syn.name()
  • syn.pos -> syn.pos()

Same goes for the Lemma class. For example, lemma.antonyms() is now a method.

Tagging

The batch_tag() method is now tag_sents(). The brill tagger API has changed significantly: brill.FastBrillTaggerTrainer is now brill_trainer.BrillTaggerTrainer, and the brill templates have been replaced by the tbl.feature.Feature interface with brill.Pos or brill.Word as implementations of the interface.

Universal Tagset

Simplified tags have been replaced with the universal tagset. So tagged_corpus.tagged_sents(simplify_tags=True) becomes tagged_corpus.tagged_sents(tagset='universal'). In order to make this work, TaggedCorpusReader should be initialized with a known tagset, using the tagset kwarg, so that its tags can be mapped to the universal tagset. Known tagset mappings are stored in nltk_data/taggers/universal_tagset. The treebank tagset is called en-ptb (PennTreeBank) and the brown tagset is called en-brown. These files are simply 2 column, tab separated mappings of source tag to universal tag. The function nltk.tag.mapping.map_tag(source, target, source tag) is used to perform the mapping.

Chunking & Parse Trees

The main change in chunkers & parsers is replacing the term node with label. RegexpChunkParser now takes a chunk chunk_label argument instead of chunk_node, while in the Tree class, the node attribute has been replaced with the label() method.

Classification

The SVM classifiers and scipy based MaxentClassifier algorithms (like CG) have been removed, but the addition of the SklearnClassifier more than makes up for it. This classifier allows you to make use of most scikit-learn classification algorithms, which are generally faster and more memory efficient than the other NLTK classifiers, while being at least as accurate.

Python 3

NLTK 3 is compatible with both Python 2 and Python 3. If you are new to Python 3, then you’ll likely be puzzled when you find that training the same model on the same data can result in slightly different accuracy metrics, because dictionary ordering is random in Python 3. This is a deliberate decision to improve security, but you can control it with the PYTHONHASHSEED environment variable. Just run $ PYTHONHASHSEED=0 python to get consistent dictionary ordering & accuracy metrics.

Python 3 has also removed the separate unicode string object, so that now all strings are unicode. But some of the NLTK corpus functions return byte strings, which look like b"raw string", so you may need convert these to normal strings before doing any further string processing.

Here’s a few other Python 3 changes I ran into:

  • itertools.izip -> zip
  • dict.iteritems() doesn’t exist, use dict.items() instead
  • dict.keys() does not produce a list (it returns a view). If you want a list, use dict.dict_keys()

Upgrading

Because of the above switching costs, upgrading right away may not be worth it. I’m still running plenty of NLTK 2 code, because it’s stable and works great. But if you’re starting a new project, or want to take advantage of new functionality, you should definitely start with NLTK 3.

Python 3 Text Processing with NLTK 3 Cookbook

Python Text Processing with NLTK 3 Cookbook

After many weekend writing sessions, the 2nd edition of the NLTK Cookbook, updated for NLTK 3 and Python 3, is available at Amazon and Packt. Code for the book is on github at nltk3-cookbook. Here’s some details on the changes & updates in the 2nd edition:

First off, all the code in the book is for Python 3 and NLTK 3. Most of it should work for Python 2, but not all of it. And NLTK 3 has made many backwards incompatible changes since version 2.0.4. One of the nice things about Python 3 is that it’s unicode all the way. No more issues with ASCII versus unicode strings. However, you do have to deal with byte strings in a few cases. Another interesting change is that hash randomization is on by default, which means that if you don’t set the PYTHONHASHSEED environment variable, training accuracy can change slightly on each run, because the iteration order of dictionaries is no longer consistent by default.

In Chapter 1, Tokenizing Text and WordNet Basics, I added a recipe for training a sentence tokenizer using the PunktSentenceTokenizer. This is surprisingly easy, and you can find the code in chapter1.py.

Chapter 2, Replacing and Correcting Words, shows the additional languages supported by the SnowballStemmer. An unfortunate removal from this chapter is babelizer, which was a fun library to use, but is no longer supported by Yahoo.

NLTK 3 replaced simplify_tags with universal tagset mappings, so I updated Chapter 3, Creating Custom Corpora to show how to use these tagset mappings to get the universal tags.

In Chapter 4, Part-of-Speech Tagging, the last recipe shows how to use train_tagger.py from NLTK-Trainer to replicate most of the tagger training recipes detailed earlier in the chapter. NLTK-Trainer was largely inspired by my experience writing Python Text Processing with NLTK 2.0 Cookbook, after realizing that many aspects of training part-of-speech taggers could be encapsulated in a command line script.

Chapter 5, Extracing Chunks, adds examples for using train_chunker.py to train phrase chunkers.

Chapter 7, Text Classification, adds coverage of train_classifier.py, along with examples of using the SklearnClassifier, which provides access to many of the scikit-learn classification algorithms. The scikit-learn classifiers tend to be at least as accurate as NLTK’s classifiers, are often faster to train, and have much smaller memory & disk footprints. And since NLTK 3 removed support for scipy based MaxentClassifier algorithms and SVM classifiers, the choice of which classifers to use has become very easy: when in doubt, choose SklearnClassifier (code examples can be found in chapter7.py).

There are a few library changes in Chapter 9, Parsing Specific Data Types:

  • timex and SimpleParse recipes have been removed due to lack of Python 3 compatibility
  • uses beautifulsoup4 with examples of UnicodeDammit
  • chardet was replaced with charade, which is compatible with both Python 2 & 3. But since publication, charade was merged back into chardet and is no longer maintained. I recommend installing chardet and replacing all instances of the charade module name with chardet.

So if you want to learn the latest & greatest NLTK 3, pickup your copy of Python 3 Text Processing with NLTK 3 Cookbook, and checkout the code at nltk3-cookbook. If you like the book, please review it at Amazon or goodreads.

Deep Search

As we prepare to launch Weotta, we’ve struggled with how to describe what we’ve built. Is our technology big data? Yes. Do we use machine learning and natural language processing? Yes. Could you call us a search engine? Absolutely. But we think the sum is more than those parts.

We finally decided that the term that best describes what we do is deep search — a concise description of a complex search system that goes far beyond basic text search. Needless to say, we aren’t the only ones in this area by any means; Google, and plenty of other companies do various aspects of deep search. But no one has created a deep search system quite like ours — a search technology built to handle the kinds of everyday queries that don’t make sense to a normal text search engine.

Basic Search

Text search engines such as sphinx or lucene/solr, use faceted filtering: collections of documents which each have a set of fields, often specified in an XML format and each indexed so that they can be efficiently retrieved given a search query and optional facet parameters. (I recommend reading Introduction to Information Retrieval to learn specific implementation details).

Basic Search Engine

In text indexing you can usually specify different fields, each with its own weight. For instance, if you choose a heavily weighted title field and a lower weighted text field, documents with your search term in their title will get a higher score than documents that just have it in the text field.

To retrieve indexed documents you use search query strings that are analogous to SQL queries, in that there’s usually special syntax to control how the search engine decides which documents to retrieve. For example, you might be able to specify which words are required and which are optional, or how far apart the words can be.

Now, all of this is fine for programmers and technical users, but it’s hardly ideal for typical consumers, who don’t even want to know that special syntax exists, let alone how to use it. Thankfully, a deep search query engine’s superior query parsing and understanding of natural language makes that special syntax unnecessary.

Facets provide ways of filtering search results. A faceted search could look for a specific value in the field, or something more complex like a date/time range or distance from a given point. Facets don’t usually affect a document’s score, but simply reduce the set of documents that get returned. Faceted search can also be called Faceted Navigation, because facets often enable a combination search/browse interface. A basic search system, if it offers facets at all, will generally do so via checkboxes, dropdowns or similar controls. ebay is a perfect example of this, offering many facets to drill down & filter your search. A deep search system, by contrast, moves facets to the background.

These text search engines are powerful, but in the context of deep search, they’re really just another kind of database. Both text search and deep search use indexes to optimize retrieval, but instead of using SQL to retrieve data, text search uses specially formatted query strings and facet specifications. A SQL database is optimized for row-based or column-based data, while a text search engine is optimized for plain text data, using inverted indexes. Either way, from the developer standpoint, both are low-level data stores, best suited for different use cases.

Advanced Search

An advanced search system uses a text search engine at its lowest levels, but integrates additional ranking signals. An obvious example of this is Google’s Page Rank, which combines text search with keyword relevance, website authority, and many other signals in order to sort results. Where basic text search only knows about individual documents, and statistics about collections of documents, an advanced search system also considers external signals like trustworthiness, popularity and link strength. Amazon, for instance, lets users sort results by average rating or popularity. But this still isn’t deep search, because there’s no deeper understanding of the data or the query, just more powerful controls for sorting results.

Deep Search

I believe deep search has four fundamental requirements:

  1. A simple search input. This means natural language understanding (NLU) of queries, so that lower levels of the system know which facets to invoke.
  2. Multi-category search. If you’re only searching for one thing, your search system can be relatively simple. But as soon as a search contains multiple variables with no explicit facets given by a user, you need NLU to know precisely what’s being searched for, and how to search for it. You also need to effectively and automatically integrate multiple data sets into one system.
  3. Feature engineering for deep data understanding. Contrary to popular belief, big data isn’t enough. Simply having access to tons of data doesn’t automatically mean you know how to get meaningful insights out of it. A good metaphor is that of an iceberg: users can only see the tip, while most of the berg lies hidden below the water. In this metaphor, data is the ice, and feature engineering is how you shape the ice below the water, in order to surface the best results where users will see them.
  4. Contextual understanding. The more you know about the user, the more knowledge you have with which to tailor search results. This could mean knowing the user’s location, their past search history, and/or explicit preferences. Context is king!

Many of today’s search systems don’t meet any of these requirements. Some implement one or two, but very few meet them all. Siri has device context and does NLU to understand queries, but instead of actually doing the search, it routes it to another application or search engine. Google and Weotta meet all the requirements, but have very different implementation, approaches, and use cases.

How does one build a deep search system? As with simple text search, there are two major stages: indexing and querying. Here’s an overview of both, from a deep search perspective.

Deep Search

Deep Indexing

Deep search requires a deep understanding of your data: what it is, what it looks like, what it’s good for, and how to transform it into a format that machines can understand. Here are a few examples:

  • places have addresses and geographic points
  • products have a weight and size
  • movies have actors and directors

Once you’ve got your low-level data structure, you transform it into a document structure suitable for text and facet indexing. But deep search also requires higher level knowledge and understanding, which is where feature engineering comes into play. You have to think deeply about what kinds of searches your customers may do, and what level of quality they expect in the results. Then you have to figure out how to translate that into indexable document features.

Here are two examples of this thinking.

A restaurant serves chicken wings. Okay, but are they any good. How much do people like or dislike them? Are they the best in the city? Questions like this could be answered through a twist on menu-based sentiment analysis.

A specific concert may be a one-time event, but the bands have probably played other shows before. How did people like those previous gigs? What are their fan’s general demographics? What’s the venue like? Answering these questions may require combining multiple datasets in order to cross-correlate performers with concerts and venues.

Deep indexing is all about answering these kinds of questions, and converting the answers into values that are usable for ranking and/or filtering search results. This may involve applied data science, linear regression or sentiment analysis. There’s no specific methodology, because the questions and answers depend on the nature of your data and what kind of results you need. But with the proper methods, you can achieve insights that weren’t possible before. For example, with latent semantic analysis you can discover features that aren’t explicit in the data, which allows queries that would be impossible with basic text indexing. Unsurprisingly, you can expect to spend most of your time deep in the data trenches. To quote Pedro Domingos, from his paper A Few Useful Things to Know about Machine Learning:

“First-timers are often surprised by how little time in a machine learning project is spent actually doing machine learning. But it makes sense if you consider how time-consuming it is to gather data, integrate it, clean it and pre-process it, and how much trial and error can go into feature design.”

“70% of the project’s time goes into feature engineering, 20% goes towards figuring out what comprises a proper and comprehensive evaluation of the algorithm, and only 10% goes into algorithm selection and tuning.”

A major part of feature engineering is getting more data and better data. You need large, diverse datasets to get the necessary context. In Weotta’s case, that includes geographic info, demographic profiles, POI and location databases, and the social graph. But you also need a deep understanding of how to integrate and correlate this data, which machine learning algorithms to apply, and most important, which questions to ask of it and which can be answered. All of this goes into engineering an integrated system that can do so automatically. “We don’t have better algorithms,” says Google Research Director Peter Norvig. “We just have more data.”

At Weotta, we believe that high-quality data is paramount, so we spend a surprising amount of effort filtering out noisy data to extract meaningful signals. A huge part of any significant feature engineering, in fact, is data cleansing. After all, garbage in, garbage out.

You also need an automated process for continuous learning. As data comes in and is integrated, your system should automatically improve. “Machine learning isn’t a one-shot process of building a data set and running a learner,” says Pedro Domingos, “but rather an iterative process of running the learner, analyzing the results, modifying the data and/or the learner, and repeating.”

And people are an essential part of this process. You must be able to incorporate human knowledge and expertise into your data pipeline at almost every level; it is the right balance and combination of humans and machines that will determine a deep search system’s true capabilities and ability to adapt to change.

Deep Querying

Once you’ve got a deep index powered by deep data, you need to use it effectively. Simple text queries won’t suffice; you need to understand exactly what you’re searching for in order to get the right results. That means query parsing and natural language understanding.

We’ve spent a lot of time at Weotta refining our query parsing to handle queries such as restaurants for my anniversary or concerts happening this weekend for a date. Other search systems have different query parsing abilities: Siri recognizes the word call plus a name, while Google Knowledge Graph can recognize almost any entity in Wikipedia.

Once you’ve parsed the query and know what to search for, the next step is retrieving results. Since we’re doing multi-category search, that means querying multiple indexes. At this point the NLU query parsing becomes essential, because you need to know what kinds of query parameters each index supports, so the system can slice and dice the query intelligently.

But if you’re retrieving different kinds of information, how do you compose them into one set of results? How do you rank and order different kinds of things? These are fundamentally interface design and user experience questions. Google uses different parts of their results page for different kinds of results, such as maps and knowledge graph.

At Weotta, we’ve decided the card analogy makes a lot of sense. On mobile we have one stack, and on web up to five cards in a row. Why? This presentation visually focuses the user on a few results at a time while letting us show multi- category results. That’s how you can do a search like dinner drinks and a movie and get three different kinds of results, all mixed together.

Remember facets from earlier? With deep search, facets are hidden to the user, but they’re still essential to the query engine. Instead of relying on explicit checkboxes, the query parser uses natural language understanding to decide which facets to use based on the query. This decision can also be driven by the nature of the data and the product. At Weotta, when we know a query is about food, we use a facet to restrict the results to restaurants. Google does things differently; while they may know that a query has food words, because their data are so much larger and more diverse, they are unable or unwilling to make a clear decision about what kinds of results to show, so you often end up with a mix. For example, I just tried a search for sushi and along with a list of web pages, I got a ribbon of local restaurants, a map, and a knowledge graph box. Since Weotta is focused on local search and “what to do,” we know you’re looking for sushi restaurants, and that’s what we’ll produce for you. Better yet, with Weotta Deep Search, a user can be even more specific and get relevant results for restaurants that have hamachi sushi.

Another key to our deep query understanding is context: Who is doing the search? Where are they? What time is it? What’s the weather there right now? What searches have they done in past? Who are their friends or contacts? What are their stated preferences? What are their implicit preferences?

The answers to these questions could have a significant effect on results. If you know someone is in New York, you may not want to show places or events happening elsewhere. If it’s raining outside, you may want to feature indoor events or nearby places. If you know someone dislikes fast food, you don’t want to show them McDonald’s.

People tend to like what their friends like. It may not be a strong signal, but social proof does matter to almost everyone. Plus, people often do things with their friends and family, so if you take all their preferences into account, you may be able to find more relevant results. In fact, if you use Facebook to signup for Weotta, you’ll be able to search for places and events your friends like.

Deep Search Stack

Summary

A deep search system goes beyond basic text search and advanced search with the following requirements:

  1. No explicit facets
  2. Multi-category search
  3. Deep feature engineering
  4. Context

To implement these, you’ll need to make use of natural language understanding, machine learning, and big data. It’s even more work to implement than you’d think, but the benefits are quite clear: you can do natural language queries with a simpler interface and get more relevant, personalized results.

As we build ever more machines to adapt to human needs, I believe deep search technology will become an integral part of our daily lives in countless ways. For now, you can get a taste of its capabilities with Weotta.

Avogadro Corp Book Review / AI Speculation

Avogadro CorpAvogadro Corp: The Singularity Is Closer Than It Appears, by William Hertling, is the first sci-fi book I’ve read with a semi-plausible AI origin story. That’s because the premise isn’t so simple as “increased computing power -> emergent AI”. It’s a much more well defined formula: ever increasing computing power + powerful language processing + never ending stream of training data + goal oriented behavior + deep integration into internet infrastructure -> AI. The AI in the story is called ELOPe, which stands for Email Language Optimization Program, and its function is essentially to improve the quality of emails. WARNING there will be spoilers below, but only enough to describe ELOPe and speculate about how it might be implemented.

What is ELOPe

The idea behind ELOPe is to provide writing suggestions as a feature of a popular web-based email service. These writing suggestions are designed to improve the outcome of your email, whatever that may be. To take an example from the book, if you’re requesting more compute resources for a project, then ELOPe’s job is to offer writing suggestions that are most likely to get your request approved. By taking into account your own past writings, who you’re sending the email to, and what you’re asking for, it can go as far as completely re-writing the email to achieve the optimal outcome.

Using the existence of ELOPe as a given, the author writes a enjoyable story that is (mostly) technically accurate with plenty of details, without being boring. If you liked Daemon by Daniel Suarez, or you work with any kind of natural language / text-processing technology, you’ll probably enjoy the story. I won’t get into how an email writing suggestion program goes from that to full AI & takes over the world as a benevolent ghost in the wires – for that you need to read the book. What I do want to talk about is how this email optimization system could be implemented.

How ELOPe might work

Let’s start by defining the high-level requirements. ELOPe is an email optimizer, so we have the sender, the receiver, and the email being written as inputs. The output is a re-written email that preserves the “voice” of the sender while using language that will be much more likely to achieve the sender’s desired outcome, given who they’re sending the email to. That means we need the following:

  1. ability to analyze the email to determine what outcome is desired
  2. prior knowledge of how the receiver has responded to other emails with similar outcome topics, in order to know what language produced the best outcomes (and what language produced bad outcomes)
  3. ability to re-write (or generate) an email whose language is consistent with the sender, while also using language optimized to get the best response from the receiver

Topic Analysis

Determining the desired outcome for an email seems to me like a sophisticated combination of topic modeling and deep linguistic parsing. The goal would be to identify the core reason for the email: what is the sender asking for, and what would be an optimal response?

Being able to do this from a single email is probably impossible, but if you have access to thousands, or even millions of email chains, accurate topic modeling is much more do-able. Nearly every email someone sends will have some similarity to past emails sent by other people in similar situations. So you could create feature vectors for every email chain (using deep semantic parsing), then cluster the chains using feature similarity. Now you have topic clusters, and from that you could create training data for thousands of topic classifiers. Once you have the classifiers, you can run those in parallel to determine the most likely topic(s) of a single email.

Obviously it would be very difficult to create accurate clusters, and even harder to do so at scale. Language is very fuzzy, humans are inconsistent, and a huge fraction of email is spam. But the core of the necessary technology exists, and can work very well in limited conditions. The ability to parse emails, extract textual features, and cluster & classify feature vectors are functionality that’s available in at least a few modern programming libraries today (i.e. Python, NLTK & scikit-learn). These are areas of software technology that are getting a lot of attention right now, and all signs indicate that attention will only increase over time, so it’s quite likely that the difficulty level will decrease significantly over the next 10 years. Moving on, let’s assume we can do accurate email topic analysis. The next hurdle is outcome analysis.

Outcome Analysis

Once you can determine topics, now you need to learn about outcomes. Two email chains about acquiring compute resources have the same topic, but one chain ends with someone successfully getting access to more compute resources, while the other ends in failure. How do you differentiate between these? This sounds like next-generation sentiment analysis. You need to go deeper than simple failure vs. success, positive vs. negative, since you want to know which email chains within a given topic produced the best responses, and what language they have in common. In other words, you need a language model that weights successful outcome language much higher than failure outcome language. The only way I can think of doing this with a decent level of accuracy is massive amounts of human verified training data. Technically do-able, but very expensive in terms of time and effort.

What really pushes the bounds of plausibility is that the language model can’t be universal. Everyone has their own likes, dislikes, biases, and preferences. So you need language models that are specific to individuals, or clusters of individuals that respond similarly on the same topic. Since these clusters are topic specific, every individual would belong to many (topic, cluster) pairs. Given N topics and an average of M clusters within each topic, that’s N*M language models that need to be created. And one of the major plot points of the book falls out naturally: ELOPe needs access to huge amounts of high end compute resources.

This is definitely the least do-able aspect of ELOPe, and I’m ignoring all the implicit conceptual knowledge that would be required to know what an optimal outcome is, but let’s move on 🙂

Language Generation

Assuming that we can do topic & outcome analysis, the final step is using language models to generate more persuasive emails. This is perhaps the simplest part of ELOPe, assuming everything else works well. That’s because natural language generation is the kind of technology that works much better with more data, and it already exists in various forms. Google translate is a kind of language generator, chatbots have been around for decades, and spammers use software to spin new articles & text based on existing writings. The differences in this case are that every individual would need their own language generator, and it would have to be parameterized with pluggable language models based on the topic, desired outcome, and receiver. But assuming we have good topic & receiver specific outcome analysis, plus hundreds or thousands of emails from the sender to learn from, then generating new emails, or just new phrases within an email, seems almost trivial compared to what I’ve outlined above.

Final Words

I’m still highly skeptical that strong AI will ever exist. We humans barely understand the mechanisms of own intelligence, so to think that we can create comparable artificial intelligence smells of hubris. But it can be fun to think about, and the point of sci-fi is to tell stories about possible futures, so I have no doubt various forms of AI will play a strong role in sci-fi stories for years to come.

NLTK 2 Release Highlights

NLTK 2.0.1, a.k.a NLTK 2, was recently released, and what follows is my favorite changes, new features, and highlights from the ChangeLog.

New Classifiers

The SVMClassifier adds support vector machine classification thru SVMLight with PySVMLight. This is a much needed addition to the set of supported classification algorithms. But even more interesting…

The SklearnClassifier provides a general interface to text classification with scikit-learn. While scikit-learn is still pre-1.0, it is rapidly becoming one of the most popular machine learning toolkits, and provides more advanced feature extraction methods for classification.

Github

NLTK has moved development and hosting to github, replacing google code and SVN. The primary motivation is to make new development easier, and already a Python 3 branch is under active development. I think this is great, since github makes forking & pull requests quite easy, and it’s become the de-facto “social coding” site.

Sphinx

Coinciding with the github move, the documentation was updated to use Sphinx, the same documentation generator used by Python and many other projects. While I personally like Sphinx and restructured text (which I used to write this post), I’m not thrilled with the results. The new documentation structure and NLTK homepage seem much less approachable. While it works great if you know exactly what you’re looking for, I worry that new/interested users will have a harder time getting started.

New Corpora

Since the 0.9.9 release, a number of new corpora and corpus readers have been added:

ChangeLog Highlights

And here’s a few final highlights:

The Future

I think NLTK’s ideal role is be a standard interface between corpora and NLP algorithms. There are many different corpus formats, and every algorithm has its own data structure requirements, so providing common abstract interfaces to connect these together is very powerful. It allows you to test the same algorithm on disparate corpora, or try multiple algorithms on a single corpus. This is what NLTK already does best, and I hope that becomes even more true in the future.

Upcoming Talks

At the end of February and the beginning of March, I’ll be giving 3 talks in the SF Bay Area and one in St Louis, MO. In chronological order…

How Weotta uses MongoDB

Grant and I will be helping 10gen celebrate the opening of their new San Francisco office on Tuesday, February 21, by talking about
How Weotta uses MongoDB. We’ll cover some of our favorite features of MongoDB and how we use it for local place & events search. Then we’ll finish with a preview of Weotta’s upcoming MongoDB powered local search APIs.

NLTK Jam Session at NICAR 2012

On Thursday, February 23, in St Louis, MO, I’ll be demonstrating how to use NLTK as part of the NewsCamp workshop at NICAR 2012. This will be a version of my PyCon NLTK Tutorial with a focus on news text and corpora like treebank.

Corpus Bootstrapping with NLTK at Strata 2012

As part of the Strata 2012 Deep Data program, I’ll talk about Corpus Bootstrapping with NLTK on Tuesday, February 28. The premise of this talk is that while there’s plenty of great algorithms and methods for natural language processing, most of them require a training corpus, and chances are the training corpus you really need doesn’t exist. So how can you quickly create a quality corpus at minimal cost? I’ll cover specific real-world examples to answer this question.

NLTK Tutorial at PyCon 2012

Introduction to NLTK will be a 3 hour tutorial at PyCon on Thursday, March 8th. You’ll get to know NLTK in depth, learn about corpus organization, and train your own models manually & with nltk-trainer. My goal is that you’ll walk out with at least one new NLP superpower that you can put to use immediately.