All posts by Jacob

NLP for Log Analysis – Tokenization

This is part 1 of a series of posts based on a presentation I gave at the Silicon Valley Cyber Security Meetup on behalf of my company, Insight Engines. Some of the ideas are speculative and I do not know if they are used in practice. If you have any experience applying these techniques on logs, please share in the comments below.

Natural language processing is the art of applying software algorithms to human language. However, the techniques operate on text, and there’s a lot of text that is not natural language. These techniques have been applied to code authorship classification, so why not apply them to log analysis?

Tokenization

To process any kind of text, you need to tokenize it. For natural language, this means splitting the text into sentences and words. But for logs, the tokens are different. Some tokens may be words, but other tokens could be symbols, timestamps, numbers, and more.

Another difference is punctuation. For many human languages, punctuation is mostly regular and predictable, although social media & short text writing has been challenging this assumption.

Logs come in a whole variety of formats. If you only have 1 type of log, then you may be able to tokenize it with a regular expression, like apache access logs for example. But when you have multiple types of logs, regular expressions can become overwhelming or even unusable. Many logs are written by humans, and there’s few rules or conventions when it comes to formatting or use of punctuation. A generic tokenizer could be a useful first pass at parsing arbitrary logs.

Whitespace Tokenizer

Tokenizing on whitespace is an obvious thing to try first. Here’s an example log and the result when run through my NLTK tokenization demo.

Sep 19 19:18:40 acmepayroll syslog: 04/30/10 12:18:51 39480627 wksh: HANDLING TELNET CALL (User: root, Branch: ABCDE, Client: 10101) pid=9644

NLTK Whitespace Tokenizer log exampleAs you can see, it does get some tokens, but there’s punctuation in weird places that would have to be cleaned up later.

Wordpunct Tokenizer

My preferred NLTK tokenizer is the WordPunctTokenizer, since it’s fast and the behavior is predictable: split on whitespace and punctuation. But this is a terrible choice for log tokenization.

NLTK WordPunct Tokenizer log exampleLogs are filled with punctuation, so this produces far too many tokens.

Treebank Tokenizer

Out of curiosity, I tried the TreebankWordTokenizer on the log example. This tokenizer uses a statistical model trained on news text, and it does surprisingly well.

NLTK Treebank Word Tokenizer log exampleThere’s no weird punctuation in the tokens, some is separated out and other punctuation is within tokens. It all looks pretty logical & useful. This was an unexpected result, and indicates that perhaps logs are often closer to natural language text than one might think.

Next Steps

After tokenization, you’ll want to do something with the log tokens. Maybe extract certain features, and then cluster or classify the log. These topics will be covered in a future post.

Recent Advances in Deep Learning for Natural Language Processing

This article was original published at The New Stack under the title “How Deep Learning Supercharges Natural Language Processing“.

Voice search, intelligent assistants, and chatbots are becoming common features of modern technology. Users and customers are demanding a better, more human experience when interacting with computers. According to Tableau’s business trends report, IDC predicts that by 2019, intelligent assistants will become commonly accessible to enterprise workers, while Gartner predicts that by 2020, 50 percent of analytics queries will involve some form of natural language processing. Chatbots, intelligent assistants, natural language queries, and voice-enabled applications all involve various forms of natural language processing. To fully realize these new user experiences, we will need to build upon the latest methods, some of which I will cover here.

Let’s start with the basics: what is natural language processing? Natural language processing (NLP), is a collection of techniques for helping machines understand human language. For example, one of the essential techniques is tokenization: breaking up text into “tokens,” such as words. Given individual words in sequence, you can start to apply reason to them, and do things like sentiment analysis to determine if a piece of text is positive or negative. But even a task as simple as word identification can be quite tricky. Is the word what’s really one word or two (what + is, or what + was)? What about languages that use characters to represent multi-word concepts, like Kanjii?

Deep learning is an advanced type of machine learning using neural networks. It became popular due to the success of the techniques at solving problems such as image classification (labeling an image based on visual content) and speech recognition (converting sounds into text). Many people thought that deep learning techniques, when applied to natural language, would quickly achieve similar levels of performance. But because of all the idiosyncrasies of natural language, the field has not seen the same kind of breakthrough success with deep learning as other fields, like image processing. However, that appears to be changing. In the past few years, researchers have been applying newer deep learning methods to natural language processing, and I will share some of these recent successes.

Deep learning — through recent improvements to word embeddings, a focus on attention, mobile enablement, and its appearance in the home — is starting to capture natural language processing like it previously captured image processing. In this article, I will cover some recent deep learning-based NLP research successes that have made an impact on the field. Because of these improvements, we will see simpler and more natural user experiences, better software performance, and more powerful home and mobile applications.

Word Embeddings

Words are essential to every natural language processing system. Traditional NLP looks at words as strings, but deep learning techniques can only process numeric vectors. Word embeddings were invented as a way to transform words into vectors, enabling new kinds of mathematical feature analysis. But the vector representation of words is only as good as the text it was trained on.

The more common word embeddings are trained on Wikipedia, but Wikipedia text may not be representative of whatever text you’re processing. It’s generally written as well structured factual statements, which is nothing like text found on twitter, and both of these are different than restaurant reviews. So vectors trained on Wikipedia might be mathematically misleading if you use those vectors to analyze a different style of text. Text from the Common Crawl provides a more diverse set of text for training a word embedding model. The FastText library provides some great pre-trained English word vectors, along with tools for training your own. Training your own vectors is essential if you’re processing any language other than English.

Character level embeddings have also shown surprising results. This technique tries to learn vectors for individual characters, where words would be represented as a composition of the individual character vectors. In an effort to learn how to predict the next character in reviews, researchers discovered a sentiment neuron, which they could control to produce positive or negative review output. Using the sentiment neuron, they were able to beat the previous top accuracy score on the sentiment treebank. This is quite an impressive result for something discovered as a side effect of other research.

CNNs, RNNs, and Attention

Moving beyond vectors, deep learning requires training neural networks for various tasks. Vectors are the input and output, in between are layers of nodes connected together in a network. The nodes represent functions on the input data, with each function taking the input from the previous layer and producing output for the next layer. The structure of the network and how the nodes are connected very much determines the learning capabilities and performance.

In general, the deeper and more complicated a network, the longer it takes to train. When using large datasets, many networks can only be effectively trained using clusters of graphics processors (GPUs), because GPUs are optimized for the necessary floating point math. This puts some types of deep learning outside the reach of anyone not at large companies or institutions that can afford the expensive GPU clusters necessary for deep learning on big data.

Standard neural networks are feedforward networks, where each node in a layer is forward connected to every node in the next layer. A Recurrent Neural Network (RNN) is a network where the nodes in each layer also connect back to the previous layer. This creates a kind of memory that can be great for learning from sequences, such as words in a sentence.

A Convolutional Neural Networks (CNN) is a type feedforward network, but with more layers, and where the forward connections have been manipulated, or convoluted, to achieve certain properties. CNNs tend to be good at extracting position invariant features, meaning they do not care so much about sequence ordering. Because of this, CNNs can be trained in a more parallel manner, leading to faster training and optimization compared to RNNs.

While CNNs may win in raw speed, both types of neural networks tend to have comparable performance characteristics.  In fact, RNNs have a slight edge when it comes to sequence oriented tasks like Part-of-Speech tagging, where you are trying to identify the part of speech (such as “noun” or “verb”) for each word in a sentence. For a detailed performance comparison of CNNs and RNNs applied to NLP see: Comparative Study of CNN and RNN for Natural Language Processing.

The most successful RNN models are the LSTM (Long short-term memory) and GRU (gated recurrent unit). These use attention gates, which act as a kind of short-term memory for the network. However, a newer research paper implies that attention may be all you need. By doing away with recurrence networks and convolution, and keeping only attention mechanisms, these models can be trained in parallel like a CNN, but even faster, and have comparable better performance than RNNs on some sequence learning tasks, such machine translation.

Reducing the training cost while maintaining comparable performance means that smaller companies and individuals can throw more data at their deep learning models, and potentially compete more effectively with larger companies and institutions.

Software 2.0

One of the nice properties of neural network models is that the core algorithms and math are mostly the same. Once you have the infrastructure, model definition, and training algorithms all setup, these models are very reusable. “Software 2.0” is the idea that significant components of an application or system can be replaced by neural network models. Instead of writing code, developers:

  1. Collect training data
  2. Clean and label the data
  3. Train a model
  4. Integrate the model

While the most interesting parts are often steps three and four, most of the work happens in the data preparation steps one and two. Collecting and curating good, useful, clean data can be a significant amount of work, which is why methods like corpus bootstrapping are important for getting to good data faster. In the long run, it is often easier to make better data than it is to design better algorithms.

The past few years have demonstrated that neural networks can achieve much better performance than many alternatives, sometimes even in areas not traditionally touched by machine learning. One of the most recent and interesting advances is in learning data indexing structures. B-tree indexes are a commonly used data structure that provides an efficient way of finding data, assuming the tree is structured well. However, these newly learned indexes significantly outperformed the traditional B-tree indexes in both speed and memory usage. Such low-level data structure performance improvements could have far-reaching impacts if it can be integrated into standard development practices.

As research progresses, and the necessary infrastructure becomes cheaper and more available, deep learning models are likely to be used in more and more parts of the software stack, including mobile applications.

Mobile Machine Learning

Most deep learning requires clusters of expensive GPUs and lots of RAM. This level of compute power is only accessible to those who can afford it, usually in the cloud. But consumers are increasingly using mobile devices, and much of the world does not have reliable and affordable full-time wireless connectivity. Getting machine learning into mobile devices will enable more developers to create all sorts of new applications.

  • Apple’s CoreML framework enables a number of NLP capabilities on iOS devices, such as language identification and named entity recognition.
  • Baidu developed a CNN library for mobile deep learning that works on both iOS and Android.
  • Qualcomm created a Neural Processing Engine for its mobile processors, enabling popular deep learning frameworks to operate on mobile devices.

Expect a lot more of this in the near future, as mobile devices continue to become more powerful and ubiquitous. Marc Andreessen famously said that “software is eating the world,” and now machine learning appears to be eating software. Not only is it in our pocket, it is also in our homes.

Deep Learning in the Home

Alexa and other voice assistants became mainstream in 2017, bringing NLP into millions of homes. Mobile users are already familiar with Siri and Google Assistant, but the popularity of Alexa and Google Home shows how many people have become comfortable having conversations with voice-activated dialogue systems. How much these systems rely on deep learning is somewhat unknown, but it is fairly certain that significant parts of their dialogue systems use deep learning models for core functions such as speech to text, part of speech tagging, natural language generation, and text to speech.

As research advances and these companies collect increasing amounts of data from their users, deep learning capabilities will improve as well, and implementations of “software 2.0” will become pervasive. While a few large companies are creating powerful data moats, there is always room on the edges for highly specialized, domain-specific applications of natural languages, such as cybersecurity, IT operations, and data analytics.

Deep learning has become a core component of modern natural language processing systems.

However, many traditional natural language processing techniques are still quite effective and useful, especially in areas that lack the huge amounts of training data necessary for deep learning. I will cover these traditional statistical techniques in an upcoming article.

Highlights from Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation

We introduce a model for constructing vector representations of words by composing characters using bidirectional LSTMs

Below are more highlights from Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation

our model requires only a single vector per character type and a fixed set of parameters for the compositional model. Despite the compactness of this model and, more importantly, the arbitrary nature of the form–function relationship in language, our “composed” word representations yield state-of-the-art results in language modeling and part-of-speech tagging. Benefits over traditional baselines are particularly pronounced in morphologically rich languages

it is manifestly clear that similarity in form is neither a necessary nor sufficient condition for similarity in function: small orthographic differences may correspond to large semantic or syntactic differences (butter vs. batter), and large orthographic differences may obscure nearly perfect functional correspondence (rich vs. affluent). Thus, any orthographically aware model must be able to capture non-compositional effects in addition to more regular effects due to, e.g., morphological processes. To model the complex form–function relationship, we turn to long short-term memories (LSTMs), which are designed to be able to capture complex non-linear and non-local dynamics in sequences

our character-based model is able to generate similar representations for words that are semantically and syntactically similar, even for words are orthographically distant (e.g., October and January)

The goal of our work is not to overcome existing benchmarks, but show that much of the feature engineering done in the benchmarks can be learnt automatically from the task specific data. More importantly, we wish to show large dimensionality word look tables can be compacted into a lookup table using characters and a compositional model allowing the model scale better with the size of the training data. This is a desirable property of the model as data becomes more abundant in many NLP tasks.

The authors have also released Java code for training neural networks.

Using word2vec with NLTK

word2vec is an algorithm for constructing vector representations of words, also known as word embeddings. The vector for each word is a semantic description of how that word is used in context, so two words that are used similarly in text will get similar vector represenations. Once you map words into vector space, you can then use vector math to find words that have similar semantics.

gensim provides a nice Python implementation of Word2Vec that works perfectly with NLTK corpora. The model takes a list of sentences, and each sentence is expected to be a list of words. This is exactly what is returned by the sents() method of NLTK corpus readers. So let’s compare the semantics of a couple words in a few different NLTK corpora:

>>> from gensim.models import Word2Vec
>>> from nltk.corpus import brown, movie_reviews, treebank
>>> b = Word2Vec(brown.sents())
>>> mr = Word2Vec(movie_reviews.sents())
>>> t = Word2Vec(treebank.sents())

>>> b.most_similar('money', topn=5)
[('pay', 0.6832243204116821), ('ready', 0.6152011156082153), ('try', 0.5845392942428589), ('care', 0.5826011896133423), ('move', 0.5752171277999878)]
>>> mr.most_similar('money', topn=5)
[('unstoppable', 0.6900672316551208), ('pain', 0.6289106607437134), ('obtain', 0.62665855884552), ('jail', 0.6140228509902954), ('patients', 0.6089504957199097)]
>>> t.most_similar('money', topn=5)
[('short-term', 0.9459682106971741), ('-LCB-', 0.9449775218963623), ('rights', 0.9442864656448364), ('interested', 0.9430986642837524), ('national', 0.9396077990531921)]

>>> b.most_similar('great', topn=5)
[('new', 0.6999611854553223), ('experience', 0.6718623042106628), ('social', 0.6702290177345276), ('group', 0.6684836149215698), ('life', 0.6667487025260925)]
>>> mr.most_similar('great', topn=5)
[('wonderful', 0.7548679113388062), ('good', 0.6538234949111938), ('strong', 0.6523671746253967), ('phenomenal', 0.6296845078468323), ('fine', 0.5932096242904663)]
>>> t.most_similar('great', topn=5)
[('won', 0.9452997446060181), ('set', 0.9445616006851196), ('target', 0.9342271089553833), ('received', 0.9333916306495667), ('long', 0.9224691390991211)]

>>> b.most_similar('company', topn=5)
[('industry', 0.6164317727088928), ('technical', 0.6059585809707642), ('orthodontist', 0.5982754826545715), ('foamed', 0.5929019451141357), ('trail', 0.5763031840324402)]
>>> mr.most_similar('company', topn=5)
[('colony', 0.6689200401306152), ('temple', 0.6546304225921631), ('arrival', 0.6497283577919006), ('army', 0.6339291334152222), ('planet', 0.6184555292129517)]
>>> t.most_similar('company', topn=5)
[('panel', 0.7949466705322266), ('Herald', 0.7674347162246704), ('Analysts', 0.7463694214820862), ('amendment', 0.7282689809799194), ('Treasury', 0.719698429107666)]

I hope it’s pretty clear from the above examples that the semantic similarity of words can vary greatly depending on the textual context. In this case, we’re comparing a wide selection of text from the brown corpus with movie reviews and financial news from the treebank corpus.

Note that if you call most_similar() with a word that was not present in the sentences, you will get a KeyError exception. This can be a common occurrence with smaller corpora like treebank.

For more details, see this tutorial on using Word2Vec. And Word Embeddings for Fashion is a great introduction to these concepts, although it uses a different word embedding algorithm called glove.

NLTK 3 Changes

NLTK 3 has quite a number of changes from NLTK 2, many of which will break old code. You can see a list of documented changes in the wiki page, Porting your code to NLTK 3.0. Below are the major changes I encountered while working on the NLTK 3 Cookbook.

Probability Classes

The FreqDist api has changed. It now inherits from collections.Counter, which implements most of the previous functionality, but in a different way. So instead of <span class="pre">fd.inc(tag)</span>, you now need to do <span class="pre">fd[tag]</span> <span class="pre">+=</span> <span class="pre">1</span>.

<span class="pre">fd.samples()</span> doesn’t exist anymore. Instead, you can use <span class="pre">fd.most_common()</span>, which is a method of collections.Counter that returns a list that looks like <span class="pre">[(word,</span> <span class="pre">count)]</span>.

ConditionalFreqDist now inherits from collections.defaultdict (one of my favorite Python data structures) which provides most of the previous functionality for free.

WordNet API

NLTK 3 has changed many wordnet Synset attributes to methods:

  • <span class="pre">syn.definition</span> -> <span class="pre">syn.definition()</span>
  • <span class="pre">syn.examples</span> -> <span class="pre">syn.examples()</span>
  • <span class="pre">syn.lemmas</span> -> <span class="pre">syn.lemmas()</span>
  • <span class="pre">syn.name</span> -> <span class="pre">syn.name()</span>
  • <span class="pre">syn.pos</span> -> <span class="pre">syn.pos()</span>

Same goes for the Lemma class. For example, <span class="pre">lemma.antonyms()</span> is now a method.

Tagging

The <span class="pre">batch_tag()</span> method is now <span class="pre">tag_sents()</span>. The brill tagger API has changed significantly: <span class="pre">brill.FastBrillTaggerTrainer</span> is now <span class="pre">brill_trainer.BrillTaggerTrainer</span>, and the brill templates have been replaced by the tbl.feature.Feature interface with <span class="pre">brill.Pos</span> or <span class="pre">brill.Word</span> as implementations of the interface.

Universal Tagset

Simplified tags have been replaced with the universal tagset. So <span class="pre">tagged_corpus.tagged_sents(simplify_tags=True)</span> becomes <span class="pre">tagged_corpus.tagged_sents(tagset='universal')</span>. In order to make this work, TaggedCorpusReader should be initialized with a known tagset, using the <span class="pre">tagset</span> kwarg, so that its tags can be mapped to the universal tagset. Known tagset mappings are stored in <span class="pre">nltk_data/taggers/universal_tagset</span>. The <span class="pre">treebank</span> tagset is called <span class="pre">en-ptb</span> (PennTreeBank) and the <span class="pre">brown</span> tagset is called <span class="pre">en-brown</span>. These files are simply 2 column, tab separated mappings of source tag to universal tag. The function <span class="pre">nltk.tag.mapping.map_tag(source,</span> <span class="pre">target,</span> <span class="pre">source</span> <span class="pre">tag)</span> is used to perform the mapping.

Chunking & Parse Trees

The main change in chunkers & parsers is replacing the term node with label. RegexpChunkParser now takes a chunk <span class="pre">chunk_label</span> argument instead of <span class="pre">chunk_node</span>, while in the Tree class, the <span class="pre">node</span> attribute has been replaced with the <span class="pre">label()</span> method.

Classification

The SVM classifiers and scipy based <span class="pre">MaxentClassifier</span> algorithms (like <span class="pre">CG</span>) have been removed, but the addition of the SklearnClassifier more than makes up for it. This classifier allows you to make use of most scikit-learn classification algorithms, which are generally faster and more memory efficient than the other NLTK classifiers, while being at least as accurate.

Python 3

NLTK 3 is compatible with both Python 2 and Python 3. If you are new to Python 3, then you’ll likely be puzzled when you find that training the same model on the same data can result in slightly different accuracy metrics, because dictionary ordering is random in Python 3. This is a deliberate decision to improve security, but you can control it with the <span class="pre">PYTHONHASHSEED</span> environment variable. Just run <span class="pre">$</span> <span class="pre">PYTHONHASHSEED=0</span> <span class="pre">python</span> to get consistent dictionary ordering & accuracy metrics.

Python 3 has also removed the separate <span class="pre">unicode</span> string object, so that now all strings are unicode. But some of the NLTK corpus functions return byte strings, which look like <span class="pre">b"raw</span> <span class="pre">string"</span>, so you may need convert these to normal strings before doing any further string processing.

Here’s a few other Python 3 changes I ran into:

  • <span class="pre">itertools.izip</span> -> <span class="pre">zip</span>
  • <span class="pre">dict.iteritems()</span> doesn’t exist, use <span class="pre">dict.items()</span> instead
  • <span class="pre">dict.keys()</span> does not produce a list (it returns a view). If you want a list, use <span class="pre">dict.dict_keys()</span>

Upgrading

Because of the above switching costs, upgrading right away may not be worth it. I’m still running plenty of NLTK 2 code, because it’s stable and works great. But if you’re starting a new project, or want to take advantage of new functionality, you should definitely start with NLTK 3.

Python 3 Text Processing with NLTK 3 Cookbook

Python Text Processing with NLTK 3 Cookbook

After many weekend writing sessions, the 2nd edition of the NLTK Cookbook, updated for NLTK 3 and Python 3, is available at Amazon and Packt. Code for the book is on github at nltk3-cookbook. Here’s some details on the changes & updates in the 2nd edition:

First off, all the code in the book is for Python 3 and NLTK 3. Most of it should work for Python 2, but not all of it. And NLTK 3 has made many backwards incompatible changes since version 2.0.4. One of the nice things about Python 3 is that it’s unicode all the way. No more issues with ASCII versus unicode strings. However, you do have to deal with byte strings in a few cases. Another interesting change is that hash randomization is on by default, which means that if you don’t set the PYTHONHASHSEED environment variable, training accuracy can change slightly on each run, because the iteration order of dictionaries is no longer consistent by default.

In Chapter 1, Tokenizing Text and WordNet Basics, I added a recipe for training a sentence tokenizer using the PunktSentenceTokenizer. This is surprisingly easy, and you can find the code in chapter1.py.

Chapter 2, Replacing and Correcting Words, shows the additional languages supported by the SnowballStemmer. An unfortunate removal from this chapter is <span class="pre">babelizer</span>, which was a fun library to use, but is no longer supported by Yahoo.

NLTK 3 replaced <span class="pre">simplify_tags</span> with universal tagset mappings, so I updated Chapter 3, Creating Custom Corpora to show how to use these tagset mappings to get the universal tags.

In Chapter 4, Part-of-Speech Tagging, the last recipe shows how to use train_tagger.py from NLTK-Trainer to replicate most of the tagger training recipes detailed earlier in the chapter. NLTK-Trainer was largely inspired by my experience writing Python Text Processing with NLTK 2.0 Cookbook, after realizing that many aspects of training part-of-speech taggers could be encapsulated in a command line script.

Chapter 5, Extracing Chunks, adds examples for using train_chunker.py to train phrase chunkers.

Chapter 7, Text Classification, adds coverage of train_classifier.py, along with examples of using the SklearnClassifier, which provides access to many of the scikit-learn classification algorithms. The scikit-learn classifiers tend to be at least as accurate as NLTK’s classifiers, are often faster to train, and have much smaller memory & disk footprints. And since NLTK 3 removed support for scipy based <span class="pre">MaxentClassifier</span> algorithms and SVM classifiers, the choice of which classifers to use has become very easy: when in doubt, choose SklearnClassifier (code examples can be found in chapter7.py).

There are a few library changes in Chapter 9, Parsing Specific Data Types:

  • <span class="pre">timex</span> and SimpleParse recipes have been removed due to lack of Python 3 compatibility
  • uses beautifulsoup4 with examples of UnicodeDammit
  • chardet was replaced with charade, which is compatible with both Python 2 & 3. But since publication, charade was merged back into chardet and is no longer maintained. I recommend installing chardet and replacing all instances of the <span class="pre">charade</span> module name with <span class="pre">chardet</span>.

So if you want to learn the latest & greatest NLTK 3, pickup your copy of Python 3 Text Processing with NLTK 3 Cookbook, and checkout the code at nltk3-cookbook. If you like the book, please review it at Amazon or goodreads.

Deep Search

As we prepare to launch Weotta, we’ve struggled with how to describe what we’ve built. Is our technology big data? Yes. Do we use machine learning and natural language processing? Yes. Could you call us a search engine? Absolutely. But we think the sum is more than those parts.

We finally decided that the term that best describes what we do is deep search — a concise description of a complex search system that goes far beyond basic text search. Needless to say, we aren’t the only ones in this area by any means; Google, and plenty of other companies do various aspects of deep search. But no one has created a deep search system quite like ours — a search technology built to handle the kinds of everyday queries that don’t make sense to a normal text search engine.

Basic Search

Text search engines such as sphinx or lucene/solr, use faceted filtering: collections of documents which each have a set of fields, often specified in an XML format and each indexed so that they can be efficiently retrieved given a search query and optional facet parameters. (I recommend reading Introduction to Information Retrieval to learn specific implementation details).

Basic Search Engine

In text indexing you can usually specify different fields, each with its own weight. For instance, if you choose a heavily weighted title field and a lower weighted text field, documents with your search term in their title will get a higher score than documents that just have it in the text field.

To retrieve indexed documents you use search query strings that are analogous to SQL queries, in that there’s usually special syntax to control how the search engine decides which documents to retrieve. For example, you might be able to specify which words are required and which are optional, or how far apart the words can be.

Now, all of this is fine for programmers and technical users, but it’s hardly ideal for typical consumers, who don’t even want to know that special syntax exists, let alone how to use it. Thankfully, a deep search query engine’s superior query parsing and understanding of natural language makes that special syntax unnecessary.

Facets provide ways of filtering search results. A faceted search could look for a specific value in the field, or something more complex like a date/time range or distance from a given point. Facets don’t usually affect a document’s score, but simply reduce the set of documents that get returned. Faceted search can also be called Faceted Navigation, because facets often enable a combination search/browse interface. A basic search system, if it offers facets at all, will generally do so via checkboxes, dropdowns or similar controls. ebay is a perfect example of this, offering many facets to drill down & filter your search. A deep search system, by contrast, moves facets to the background.

These text search engines are powerful, but in the context of deep search, they’re really just another kind of database. Both text search and deep search use indexes to optimize retrieval, but instead of using SQL to retrieve data, text search uses specially formatted query strings and facet specifications. A SQL database is optimized for row-based or column-based data, while a text search engine is optimized for plain text data, using inverted indexes. Either way, from the developer standpoint, both are low-level data stores, best suited for different use cases.

Advanced Search

An advanced search system uses a text search engine at its lowest levels, but integrates additional ranking signals. An obvious example of this is Google’s Page Rank, which combines text search with keyword relevance, website authority, and many other signals in order to sort results. Where basic text search only knows about individual documents, and statistics about collections of documents, an advanced search system also considers external signals like trustworthiness, popularity and link strength. Amazon, for instance, lets users sort results by average rating or popularity. But this still isn’t deep search, because there’s no deeper understanding of the data or the query, just more powerful controls for sorting results.

Deep Search

I believe deep search has four fundamental requirements:

  1. A simple search input. This means natural language understanding (NLU) of queries, so that lower levels of the system know which facets to invoke.
  2. Multi-category search. If you’re only searching for one thing, your search system can be relatively simple. But as soon as a search contains multiple variables with no explicit facets given by a user, you need NLU to know precisely what’s being searched for, and how to search for it. You also need to effectively and automatically integrate multiple data sets into one system.
  3. Feature engineering for deep data understanding. Contrary to popular belief, big data isn’t enough. Simply having access to tons of data doesn’t automatically mean you know how to get meaningful insights out of it. A good metaphor is that of an iceberg: users can only see the tip, while most of the berg lies hidden below the water. In this metaphor, data is the ice, and feature engineering is how you shape the ice below the water, in order to surface the best results where users will see them.
  4. Contextual understanding. The more you know about the user, the more knowledge you have with which to tailor search results. This could mean knowing the user’s location, their past search history, and/or explicit preferences. Context is king!

Many of today’s search systems don’t meet any of these requirements. Some implement one or two, but very few meet them all. Siri has device context and does NLU to understand queries, but instead of actually doing the search, it routes it to another application or search engine. Google and Weotta meet all the requirements, but have very different implementation, approaches, and use cases.

How does one build a deep search system? As with simple text search, there are two major stages: indexing and querying. Here’s an overview of both, from a deep search perspective.

Deep Search

Deep Indexing

Deep search requires a deep understanding of your data: what it is, what it looks like, what it’s good for, and how to transform it into a format that machines can understand. Here are a few examples:

  • places have addresses and geographic points
  • products have a weight and size
  • movies have actors and directors

Once you’ve got your low-level data structure, you transform it into a document structure suitable for text and facet indexing. But deep search also requires higher level knowledge and understanding, which is where feature engineering comes into play. You have to think deeply about what kinds of searches your customers may do, and what level of quality they expect in the results. Then you have to figure out how to translate that into indexable document features.

Here are two examples of this thinking.

A restaurant serves chicken wings. Okay, but are they any good. How much do people like or dislike them? Are they the best in the city? Questions like this could be answered through a twist on menu-based sentiment analysis.

A specific concert may be a one-time event, but the bands have probably played other shows before. How did people like those previous gigs? What are their fan’s general demographics? What’s the venue like? Answering these questions may require combining multiple datasets in order to cross-correlate performers with concerts and venues.

Deep indexing is all about answering these kinds of questions, and converting the answers into values that are usable for ranking and/or filtering search results. This may involve applied data science, linear regression or sentiment analysis. There’s no specific methodology, because the questions and answers depend on the nature of your data and what kind of results you need. But with the proper methods, you can achieve insights that weren’t possible before. For example, with latent semantic analysis you can discover features that aren’t explicit in the data, which allows queries that would be impossible with basic text indexing. Unsurprisingly, you can expect to spend most of your time deep in the data trenches. To quote Pedro Domingos, from his paper A Few Useful Things to Know about Machine Learning:

“First-timers are often surprised by how little time in a machine learning project is spent actually doing machine learning. But it makes sense if you consider how time-consuming it is to gather data, integrate it, clean it and pre-process it, and how much trial and error can go into feature design.”

“70% of the project’s time goes into feature engineering, 20% goes towards figuring out what comprises a proper and comprehensive evaluation of the algorithm, and only 10% goes into algorithm selection and tuning.”

A major part of feature engineering is getting more data and better data. You need large, diverse datasets to get the necessary context. In Weotta’s case, that includes geographic info, demographic profiles, POI and location databases, and the social graph. But you also need a deep understanding of how to integrate and correlate this data, which machine learning algorithms to apply, and most important, which questions to ask of it and which can be answered. All of this goes into engineering an integrated system that can do so automatically. “We don’t have better algorithms,” says Google Research Director Peter Norvig. “We just have more data.”

At Weotta, we believe that high-quality data is paramount, so we spend a surprising amount of effort filtering out noisy data to extract meaningful signals. A huge part of any significant feature engineering, in fact, is data cleansing. After all, garbage in, garbage out.

You also need an automated process for continuous learning. As data comes in and is integrated, your system should automatically improve. “Machine learning isn’t a one-shot process of building a data set and running a learner,” says Pedro Domingos, “but rather an iterative process of running the learner, analyzing the results, modifying the data and/or the learner, and repeating.”

And people are an essential part of this process. You must be able to incorporate human knowledge and expertise into your data pipeline at almost every level; it is the right balance and combination of humans and machines that will determine a deep search system’s true capabilities and ability to adapt to change.

Deep Querying

Once you’ve got a deep index powered by deep data, you need to use it effectively. Simple text queries won’t suffice; you need to understand exactly what you’re searching for in order to get the right results. That means query parsing and natural language understanding.

We’ve spent a lot of time at Weotta refining our query parsing to handle queries such as restaurants for my anniversary or concerts happening this weekend for a date. Other search systems have different query parsing abilities: Siri recognizes the word call plus a name, while Google Knowledge Graph can recognize almost any entity in Wikipedia.

Once you’ve parsed the query and know what to search for, the next step is retrieving results. Since we’re doing multi-category search, that means querying multiple indexes. At this point the NLU query parsing becomes essential, because you need to know what kinds of query parameters each index supports, so the system can slice and dice the query intelligently.

But if you’re retrieving different kinds of information, how do you compose them into one set of results? How do you rank and order different kinds of things? These are fundamentally interface design and user experience questions. Google uses different parts of their results page for different kinds of results, such as maps and knowledge graph.

At Weotta, we’ve decided the card analogy makes a lot of sense. On mobile we have one stack, and on web up to five cards in a row. Why? This presentation visually focuses the user on a few results at a time while letting us show multi- category results. That’s how you can do a search like dinner drinks and a movie and get three different kinds of results, all mixed together.

Remember facets from earlier? With deep search, facets are hidden to the user, but they’re still essential to the query engine. Instead of relying on explicit checkboxes, the query parser uses natural language understanding to decide which facets to use based on the query. This decision can also be driven by the nature of the data and the product. At Weotta, when we know a query is about food, we use a facet to restrict the results to restaurants. Google does things differently; while they may know that a query has food words, because their data are so much larger and more diverse, they are unable or unwilling to make a clear decision about what kinds of results to show, so you often end up with a mix. For example, I just tried a search for sushi and along with a list of web pages, I got a ribbon of local restaurants, a map, and a knowledge graph box. Since Weotta is focused on local search and “what to do,” we know you’re looking for sushi restaurants, and that’s what we’ll produce for you. Better yet, with Weotta Deep Search, a user can be even more specific and get relevant results for restaurants that have hamachi sushi.

Another key to our deep query understanding is context: Who is doing the search? Where are they? What time is it? What’s the weather there right now? What searches have they done in past? Who are their friends or contacts? What are their stated preferences? What are their implicit preferences?

The answers to these questions could have a significant effect on results. If you know someone is in New York, you may not want to show places or events happening elsewhere. If it’s raining outside, you may want to feature indoor events or nearby places. If you know someone dislikes fast food, you don’t want to show them McDonald’s.

People tend to like what their friends like. It may not be a strong signal, but social proof does matter to almost everyone. Plus, people often do things with their friends and family, so if you take all their preferences into account, you may be able to find more relevant results. In fact, if you use Facebook to signup for Weotta, you’ll be able to search for places and events your friends like.

Deep Search Stack

Summary

A deep search system goes beyond basic text search and advanced search with the following requirements:

  1. No explicit facets
  2. Multi-category search
  3. Deep feature engineering
  4. Context

To implement these, you’ll need to make use of natural language understanding, machine learning, and big data. It’s even more work to implement than you’d think, but the benefits are quite clear: you can do natural language queries with a simpler interface and get more relevant, personalized results.

As we build ever more machines to adapt to human needs, I believe deep search technology will become an integral part of our daily lives in countless ways. For now, you can get a taste of its capabilities with Weotta.