My company, Insight Engines, recently announced Series A funding, to make big data easily queryable by everyone. We’re bringing natural language technology to the cybersecurity domain, so you can use plain english search queries to navigate large datasets for security investigations. If you’re also interested in the intersection between NLP and cybersecurity, we’re hiring.
We introduce a model for constructing vector representations of words by composing characters using bidirectional LSTMs
Below are more highlights from Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation
our model requires only a single vector per character type and a fixed set of parameters for the compositional model. Despite the compactness of this model and, more importantly, the arbitrary nature of the form–function relationship in language, our “composed” word representations yield state-of-the-art results in language modeling and part-of-speech tagging. Benefits over traditional baselines are particularly pronounced in morphologically rich languages
it is manifestly clear that similarity in form is neither a necessary nor sufficient condition for similarity in function: small orthographic differences may correspond to large semantic or syntactic differences (butter vs. batter), and large orthographic differences may obscure nearly perfect functional correspondence (rich vs. affluent). Thus, any orthographically aware model must be able to capture non-compositional effects in addition to more regular effects due to, e.g., morphological processes. To model the complex form–function relationship, we turn to long short-term memories (LSTMs), which are designed to be able to capture complex non-linear and non-local dynamics in sequences
our character-based model is able to generate similar representations for words that are semantically and syntactically similar, even for words are orthographically distant (e.g., October and January)
The goal of our work is not to overcome existing benchmarks, but show that much of the feature engineering done in the benchmarks can be learnt automatically from the task specific data. More importantly, we wish to show large dimensionality word look tables can be compacted into a lookup table using characters and a compositional model allowing the model scale better with the size of the training data. This is a desirable property of the model as data becomes more abundant in many NLP tasks.
The authors have also released Java code for training neural networks.
word2vec is an algorithm for constructing vector representations of words, also known as word embeddings. The vector for each word is a semantic description of how that word is used in context, so two words that are used similarly in text will get similar vector represenations. Once you map words into vector space, you can then use vector math to find words that have similar semantics.
gensim provides a nice Python implementation of Word2Vec that works perfectly with NLTK corpora. The model takes a list of sentences, and each sentence is expected to be a list of words. This is exactly what is returned by the
sents() method of NLTK corpus readers. So let’s compare the semantics of a couple words in a few different NLTK corpora:
>>> from gensim.models import Word2Vec >>> from nltk.corpus import brown, movie_reviews, treebank >>> b = Word2Vec(brown.sents()) >>> mr = Word2Vec(movie_reviews.sents()) >>> t = Word2Vec(treebank.sents()) >>> b.most_similar('money', topn=5) [('pay', 0.6832243204116821), ('ready', 0.6152011156082153), ('try', 0.5845392942428589), ('care', 0.5826011896133423), ('move', 0.5752171277999878)] >>> mr.most_similar('money', topn=5) [('unstoppable', 0.6900672316551208), ('pain', 0.6289106607437134), ('obtain', 0.62665855884552), ('jail', 0.6140228509902954), ('patients', 0.6089504957199097)] >>> t.most_similar('money', topn=5) [('short-term', 0.9459682106971741), ('-LCB-', 0.9449775218963623), ('rights', 0.9442864656448364), ('interested', 0.9430986642837524), ('national', 0.9396077990531921)] >>> b.most_similar('great', topn=5) [('new', 0.6999611854553223), ('experience', 0.6718623042106628), ('social', 0.6702290177345276), ('group', 0.6684836149215698), ('life', 0.6667487025260925)] >>> mr.most_similar('great', topn=5) [('wonderful', 0.7548679113388062), ('good', 0.6538234949111938), ('strong', 0.6523671746253967), ('phenomenal', 0.6296845078468323), ('fine', 0.5932096242904663)] >>> t.most_similar('great', topn=5) [('won', 0.9452997446060181), ('set', 0.9445616006851196), ('target', 0.9342271089553833), ('received', 0.9333916306495667), ('long', 0.9224691390991211)] >>> b.most_similar('company', topn=5) [('industry', 0.6164317727088928), ('technical', 0.6059585809707642), ('orthodontist', 0.5982754826545715), ('foamed', 0.5929019451141357), ('trail', 0.5763031840324402)] >>> mr.most_similar('company', topn=5) [('colony', 0.6689200401306152), ('temple', 0.6546304225921631), ('arrival', 0.6497283577919006), ('army', 0.6339291334152222), ('planet', 0.6184555292129517)] >>> t.most_similar('company', topn=5) [('panel', 0.7949466705322266), ('Herald', 0.7674347162246704), ('Analysts', 0.7463694214820862), ('amendment', 0.7282689809799194), ('Treasury', 0.719698429107666)]
I hope it’s pretty clear from the above examples that the semantic similarity of words can vary greatly depending on the textual context. In this case, we’re comparing a wide selection of text from the brown corpus with movie reviews and financial news from the treebank corpus.
Note that if you call
most_similar() with a word that was not present in the sentences, you will get a
KeyError exception. This can be a common occurrence with smaller corpora like
NLTK 3 has quite a number of changes from NLTK 2, many of which will break old code. You can see a list of documented changes in the wiki page, Porting your code to NLTK 3.0. Below are the major changes I encountered while working on the NLTK 3 Cookbook.
The FreqDist api has changed. It now inherits from collections.Counter, which implements most of the previous functionality, but in a different way. So instead of
<span class="pre">fd.inc(tag)</span>, you now need to do
<span class="pre">fd[tag]</span> <span class="pre">+=</span> <span class="pre">1</span>.
<span class="pre">fd.samples()</span> doesn’t exist anymore. Instead, you can use
<span class="pre">fd.most_common()</span>, which is a method of collections.Counter that returns a list that looks like
<span class="pre">[(word,</span> <span class="pre">count)]</span>.
NLTK 3 has changed many wordnet Synset attributes to methods:
Same goes for the Lemma class. For example,
<span class="pre">lemma.antonyms()</span> is now a method.
<span class="pre">batch_tag()</span> method is now
<span class="pre">tag_sents()</span>. The brill tagger API has changed significantly:
<span class="pre">brill.FastBrillTaggerTrainer</span> is now
<span class="pre">brill_trainer.BrillTaggerTrainer</span>, and the brill templates have been replaced by the tbl.feature.Feature interface with
<span class="pre">brill.Pos</span> or
<span class="pre">brill.Word</span> as implementations of the interface.
Simplified tags have been replaced with the universal tagset. So
<span class="pre">tagged_corpus.tagged_sents(simplify_tags=True)</span> becomes
<span class="pre">tagged_corpus.tagged_sents(tagset='universal')</span>. In order to make this work, TaggedCorpusReader should be initialized with a known tagset, using the
<span class="pre">tagset</span> kwarg, so that its tags can be mapped to the universal tagset. Known tagset mappings are stored in
<span class="pre">nltk_data/taggers/universal_tagset</span>. The
<span class="pre">treebank</span> tagset is called
<span class="pre">en-ptb</span> (PennTreeBank) and the
<span class="pre">brown</span> tagset is called
<span class="pre">en-brown</span>. These files are simply 2 column, tab separated mappings of source tag to universal tag. The function
<span class="pre">nltk.tag.mapping.map_tag(source,</span> <span class="pre">target,</span> <span class="pre">source</span> <span class="pre">tag)</span> is used to perform the mapping.
Chunking & Parse Trees
The main change in chunkers & parsers is replacing the term node with label. RegexpChunkParser now takes a chunk
<span class="pre">chunk_label</span> argument instead of
<span class="pre">chunk_node</span>, while in the Tree class, the
<span class="pre">node</span> attribute has been replaced with the
<span class="pre">label()</span> method.
The SVM classifiers and scipy based
<span class="pre">MaxentClassifier</span> algorithms (like
<span class="pre">CG</span>) have been removed, but the addition of the SklearnClassifier more than makes up for it. This classifier allows you to make use of most scikit-learn classification algorithms, which are generally faster and more memory efficient than the other NLTK classifiers, while being at least as accurate.
NLTK 3 is compatible with both Python 2 and Python 3. If you are new to Python 3, then you’ll likely be puzzled when you find that training the same model on the same data can result in slightly different accuracy metrics, because dictionary ordering is random in Python 3. This is a deliberate decision to improve security, but you can control it with the
<span class="pre">PYTHONHASHSEED</span> environment variable. Just run
<span class="pre">$</span> <span class="pre">PYTHONHASHSEED=0</span> <span class="pre">python</span> to get consistent dictionary ordering & accuracy metrics.
Python 3 has also removed the separate
<span class="pre">unicode</span> string object, so that now all strings are unicode. But some of the NLTK corpus functions return byte strings, which look like
<span class="pre">b"raw</span> <span class="pre">string"</span>, so you may need convert these to normal strings before doing any further string processing.
Here’s a few other Python 3 changes I ran into:
<span class="pre">dict.iteritems()</span>doesn’t exist, use
<span class="pre">dict.keys()</span>does not produce a list (it returns a view). If you want a list, use
Because of the above switching costs, upgrading right away may not be worth it. I’m still running plenty of NLTK 2 code, because it’s stable and works great. But if you’re starting a new project, or want to take advantage of new functionality, you should definitely start with NLTK 3.
After many weekend writing sessions, the 2nd edition of the NLTK Cookbook, updated for NLTK 3 and Python 3, is available at Amazon and Packt. Code for the book is on github at nltk3-cookbook. Here’s some details on the changes & updates in the 2nd edition:
First off, all the code in the book is for Python 3 and NLTK 3. Most of it should work for Python 2, but not all of it. And NLTK 3 has made many backwards incompatible changes since version 2.0.4. One of the nice things about Python 3 is that it’s unicode all the way. No more issues with ASCII versus unicode strings. However, you do have to deal with byte strings in a few cases. Another interesting change is that hash randomization is on by default, which means that if you don’t set the PYTHONHASHSEED environment variable, training accuracy can change slightly on each run, because the iteration order of dictionaries is no longer consistent by default.
In Chapter 1, Tokenizing Text and WordNet Basics, I added a recipe for training a sentence tokenizer using the PunktSentenceTokenizer. This is surprisingly easy, and you can find the code in chapter1.py.
Chapter 2, Replacing and Correcting Words, shows the additional languages supported by the SnowballStemmer. An unfortunate removal from this chapter is
<span class="pre">babelizer</span>, which was a fun library to use, but is no longer supported by Yahoo.
NLTK 3 replaced
<span class="pre">simplify_tags</span> with universal tagset mappings, so I updated Chapter 3, Creating Custom Corpora to show how to use these tagset mappings to get the universal tags.
In Chapter 4, Part-of-Speech Tagging, the last recipe shows how to use train_tagger.py from NLTK-Trainer to replicate most of the tagger training recipes detailed earlier in the chapter. NLTK-Trainer was largely inspired by my experience writing Python Text Processing with NLTK 2.0 Cookbook, after realizing that many aspects of training part-of-speech taggers could be encapsulated in a command line script.
Chapter 5, Extracing Chunks, adds examples for using train_chunker.py to train phrase chunkers.
Chapter 7, Text Classification, adds coverage of train_classifier.py, along with examples of using the SklearnClassifier, which provides access to many of the scikit-learn classification algorithms. The scikit-learn classifiers tend to be at least as accurate as NLTK’s classifiers, are often faster to train, and have much smaller memory & disk footprints. And since NLTK 3 removed support for scipy based
<span class="pre">MaxentClassifier</span> algorithms and SVM classifiers, the choice of which classifers to use has become very easy: when in doubt, choose SklearnClassifier (code examples can be found in chapter7.py).
There are a few library changes in Chapter 9, Parsing Specific Data Types:
<span class="pre">timex</span>and SimpleParse recipes have been removed due to lack of Python 3 compatibility
- uses beautifulsoup4 with examples of UnicodeDammit
- chardet was replaced with charade, which is compatible with both Python 2 & 3. But since publication, charade was merged back into chardet and is no longer maintained. I recommend installing chardet and replacing all instances of the
<span class="pre">charade</span>module name with
So if you want to learn the latest & greatest NLTK 3, pickup your copy of Python 3 Text Processing with NLTK 3 Cookbook, and checkout the code at nltk3-cookbook. If you like the book, please review it at Amazon or goodreads.
As we prepare to launch Weotta, we’ve struggled with how to describe what we’ve built. Is our technology big data? Yes. Do we use machine learning and natural language processing? Yes. Could you call us a search engine? Absolutely. But we think the sum is more than those parts.
We finally decided that the term that best describes what we do is deep search — a concise description of a complex search system that goes far beyond basic text search. Needless to say, we aren’t the only ones in this area by any means; Google, and plenty of other companies do various aspects of deep search. But no one has created a deep search system quite like ours — a search technology built to handle the kinds of everyday queries that don’t make sense to a normal text search engine.
Text search engines such as sphinx or lucene/solr, use faceted filtering: collections of documents which each have a set of fields, often specified in an XML format and each indexed so that they can be efficiently retrieved given a search query and optional facet parameters. (I recommend reading Introduction to Information Retrieval to learn specific implementation details).
In text indexing you can usually specify different fields, each with its own weight. For instance, if you choose a heavily weighted title field and a lower weighted text field, documents with your search term in their title will get a higher score than documents that just have it in the text field.
To retrieve indexed documents you use search query strings that are analogous to SQL queries, in that there’s usually special syntax to control how the search engine decides which documents to retrieve. For example, you might be able to specify which words are required and which are optional, or how far apart the words can be.
Now, all of this is fine for programmers and technical users, but it’s hardly ideal for typical consumers, who don’t even want to know that special syntax exists, let alone how to use it. Thankfully, a deep search query engine’s superior query parsing and understanding of natural language makes that special syntax unnecessary.
Facets provide ways of filtering search results. A faceted search could look for a specific value in the field, or something more complex like a date/time range or distance from a given point. Facets don’t usually affect a document’s score, but simply reduce the set of documents that get returned. Faceted search can also be called Faceted Navigation, because facets often enable a combination search/browse interface. A basic search system, if it offers facets at all, will generally do so via checkboxes, dropdowns or similar controls. ebay is a perfect example of this, offering many facets to drill down & filter your search. A deep search system, by contrast, moves facets to the background.
These text search engines are powerful, but in the context of deep search, they’re really just another kind of database. Both text search and deep search use indexes to optimize retrieval, but instead of using SQL to retrieve data, text search uses specially formatted query strings and facet specifications. A SQL database is optimized for row-based or column-based data, while a text search engine is optimized for plain text data, using inverted indexes. Either way, from the developer standpoint, both are low-level data stores, best suited for different use cases.
An advanced search system uses a text search engine at its lowest levels, but integrates additional ranking signals. An obvious example of this is Google’s Page Rank, which combines text search with keyword relevance, website authority, and many other signals in order to sort results. Where basic text search only knows about individual documents, and statistics about collections of documents, an advanced search system also considers external signals like trustworthiness, popularity and link strength. Amazon, for instance, lets users sort results by average rating or popularity. But this still isn’t deep search, because there’s no deeper understanding of the data or the query, just more powerful controls for sorting results.
I believe deep search has four fundamental requirements:
- A simple search input. This means natural language understanding (NLU) of queries, so that lower levels of the system know which facets to invoke.
- Multi-category search. If you’re only searching for one thing, your search system can be relatively simple. But as soon as a search contains multiple variables with no explicit facets given by a user, you need NLU to know precisely what’s being searched for, and how to search for it. You also need to effectively and automatically integrate multiple data sets into one system.
- Feature engineering for deep data understanding. Contrary to popular belief, big data isn’t enough. Simply having access to tons of data doesn’t automatically mean you know how to get meaningful insights out of it. A good metaphor is that of an iceberg: users can only see the tip, while most of the berg lies hidden below the water. In this metaphor, data is the ice, and feature engineering is how you shape the ice below the water, in order to surface the best results where users will see them.
- Contextual understanding. The more you know about the user, the more knowledge you have with which to tailor search results. This could mean knowing the user’s location, their past search history, and/or explicit preferences. Context is king!
Many of today’s search systems don’t meet any of these requirements. Some implement one or two, but very few meet them all. Siri has device context and does NLU to understand queries, but instead of actually doing the search, it routes it to another application or search engine. Google and Weotta meet all the requirements, but have very different implementation, approaches, and use cases.
How does one build a deep search system? As with simple text search, there are two major stages: indexing and querying. Here’s an overview of both, from a deep search perspective.
Deep search requires a deep understanding of your data: what it is, what it looks like, what it’s good for, and how to transform it into a format that machines can understand. Here are a few examples:
- places have addresses and geographic points
- products have a weight and size
- movies have actors and directors
Once you’ve got your low-level data structure, you transform it into a document structure suitable for text and facet indexing. But deep search also requires higher level knowledge and understanding, which is where feature engineering comes into play. You have to think deeply about what kinds of searches your customers may do, and what level of quality they expect in the results. Then you have to figure out how to translate that into indexable document features.
Here are two examples of this thinking.
A restaurant serves chicken wings. Okay, but are they any good. How much do people like or dislike them? Are they the best in the city? Questions like this could be answered through a twist on menu-based sentiment analysis.
A specific concert may be a one-time event, but the bands have probably played other shows before. How did people like those previous gigs? What are their fan’s general demographics? What’s the venue like? Answering these questions may require combining multiple datasets in order to cross-correlate performers with concerts and venues.
Deep indexing is all about answering these kinds of questions, and converting the answers into values that are usable for ranking and/or filtering search results. This may involve applied data science, linear regression or sentiment analysis. There’s no specific methodology, because the questions and answers depend on the nature of your data and what kind of results you need. But with the proper methods, you can achieve insights that weren’t possible before. For example, with latent semantic analysis you can discover features that aren’t explicit in the data, which allows queries that would be impossible with basic text indexing. Unsurprisingly, you can expect to spend most of your time deep in the data trenches. To quote Pedro Domingos, from his paper A Few Useful Things to Know about Machine Learning:
“First-timers are often surprised by how little time in a machine learning project is spent actually doing machine learning. But it makes sense if you consider how time-consuming it is to gather data, integrate it, clean it and pre-process it, and how much trial and error can go into feature design.”
“70% of the project’s time goes into feature engineering, 20% goes towards figuring out what comprises a proper and comprehensive evaluation of the algorithm, and only 10% goes into algorithm selection and tuning.”
A major part of feature engineering is getting more data and better data. You need large, diverse datasets to get the necessary context. In Weotta’s case, that includes geographic info, demographic profiles, POI and location databases, and the social graph. But you also need a deep understanding of how to integrate and correlate this data, which machine learning algorithms to apply, and most important, which questions to ask of it and which can be answered. All of this goes into engineering an integrated system that can do so automatically. “We don’t have better algorithms,” says Google Research Director Peter Norvig. “We just have more data.”
At Weotta, we believe that high-quality data is paramount, so we spend a surprising amount of effort filtering out noisy data to extract meaningful signals. A huge part of any significant feature engineering, in fact, is data cleansing. After all, garbage in, garbage out.
You also need an automated process for continuous learning. As data comes in and is integrated, your system should automatically improve. “Machine learning isn’t a one-shot process of building a data set and running a learner,” says Pedro Domingos, “but rather an iterative process of running the learner, analyzing the results, modifying the data and/or the learner, and repeating.”
And people are an essential part of this process. You must be able to incorporate human knowledge and expertise into your data pipeline at almost every level; it is the right balance and combination of humans and machines that will determine a deep search system’s true capabilities and ability to adapt to change.
Once you’ve got a deep index powered by deep data, you need to use it effectively. Simple text queries won’t suffice; you need to understand exactly what you’re searching for in order to get the right results. That means query parsing and natural language understanding.
We’ve spent a lot of time at Weotta refining our query parsing to handle queries such as restaurants for my anniversary or concerts happening this weekend for a date. Other search systems have different query parsing abilities: Siri recognizes the word call plus a name, while Google Knowledge Graph can recognize almost any entity in Wikipedia.
Once you’ve parsed the query and know what to search for, the next step is retrieving results. Since we’re doing multi-category search, that means querying multiple indexes. At this point the NLU query parsing becomes essential, because you need to know what kinds of query parameters each index supports, so the system can slice and dice the query intelligently.
But if you’re retrieving different kinds of information, how do you compose them into one set of results? How do you rank and order different kinds of things? These are fundamentally interface design and user experience questions. Google uses different parts of their results page for different kinds of results, such as maps and knowledge graph.
At Weotta, we’ve decided the card analogy makes a lot of sense. On mobile we have one stack, and on web up to five cards in a row. Why? This presentation visually focuses the user on a few results at a time while letting us show multi- category results. That’s how you can do a search like dinner drinks and a movie and get three different kinds of results, all mixed together.
Remember facets from earlier? With deep search, facets are hidden to the user, but they’re still essential to the query engine. Instead of relying on explicit checkboxes, the query parser uses natural language understanding to decide which facets to use based on the query. This decision can also be driven by the nature of the data and the product. At Weotta, when we know a query is about food, we use a facet to restrict the results to restaurants. Google does things differently; while they may know that a query has food words, because their data are so much larger and more diverse, they are unable or unwilling to make a clear decision about what kinds of results to show, so you often end up with a mix. For example, I just tried a search for sushi and along with a list of web pages, I got a ribbon of local restaurants, a map, and a knowledge graph box. Since Weotta is focused on local search and “what to do,” we know you’re looking for sushi restaurants, and that’s what we’ll produce for you. Better yet, with Weotta Deep Search, a user can be even more specific and get relevant results for restaurants that have hamachi sushi.
Another key to our deep query understanding is context: Who is doing the search? Where are they? What time is it? What’s the weather there right now? What searches have they done in past? Who are their friends or contacts? What are their stated preferences? What are their implicit preferences?
The answers to these questions could have a significant effect on results. If you know someone is in New York, you may not want to show places or events happening elsewhere. If it’s raining outside, you may want to feature indoor events or nearby places. If you know someone dislikes fast food, you don’t want to show them McDonald’s.
People tend to like what their friends like. It may not be a strong signal, but social proof does matter to almost everyone. Plus, people often do things with their friends and family, so if you take all their preferences into account, you may be able to find more relevant results. In fact, if you use Facebook to signup for Weotta, you’ll be able to search for places and events your friends like.
A deep search system goes beyond basic text search and advanced search with the following requirements:
- No explicit facets
- Multi-category search
- Deep feature engineering
To implement these, you’ll need to make use of natural language understanding, machine learning, and big data. It’s even more work to implement than you’d think, but the benefits are quite clear: you can do natural language queries with a simpler interface and get more relevant, personalized results.
As we build ever more machines to adapt to human needs, I believe deep search technology will become an integral part of our daily lives in countless ways. For now, you can get a taste of its capabilities with Weotta.
This is a review of the book Instant Pygame for Python Game Development How-to, by Ivan Idris. Packt asked me to review the book, and I agreed because like many developers, I’ve thought about writing my own game, and I’ve been curious about the capabilities of pygame. It’s a short book, ~120 pages, so this is a short review.
The book covers pygame basics like drawing images, rendering text, playing sounds, creating animations, and altering the mouse cursor. The author has helpfully posted some video demos of some of the exercises, which are linked from the book. I think this is a great way to show what’s possible, while also giving the reader a clear idea of what they are creating & what should happen. After the basic intro exercises, I think the best content was how to manipulate pixel arrays with numpy (the author has also written two books on numpy: NumPy Beginner’s Guide & NumPy Cookbook), how to create & use sprites, and how to make your own version of the game of life.
There were 3 chapters whose content puzzled me. When you’ve got such a short book on a specific topic, why bring up matplotlib, profiling, and debugging? These chapters seemed off-topic and just thrown in there randomly. The organization of the book could have been much better too, leading the reader from the basics all the way to a full-fledged game, with each chapter adding to the previous chapters. Instead, the chapters sometimes felt like unrelated low-level examples.
Overall, the book was a quick & easy read, that rapidly introduces you to basic pygame functionality, and leads you on to more complex activities. My main takeaway is that pygame provides an easy to use & low-level framework for building simple games, and can be used to create more complex games (but probably not FPS or similar graphically intensive games). The ideal games would probably be puzzle based and/or dialogue heavy, and only require simple interactions from the user. So if you’re interested in building such a game in Python, you should definitely get a copy of Instant Pygame for Python Game Development How-to.
Avogadro Corp: The Singularity Is Closer Than It Appears, by William Hertling, is the first sci-fi book I’ve read with a semi-plausible AI origin story. That’s because the premise isn’t so simple as “increased computing power -> emergent AI”. It’s a much more well defined formula: ever increasing computing power + powerful language processing + never ending stream of training data + goal oriented behavior + deep integration into internet infrastructure -> AI. The AI in the story is called ELOPe, which stands for Email Language Optimization Program, and its function is essentially to improve the quality of emails. WARNING there will be spoilers below, but only enough to describe ELOPe and speculate about how it might be implemented.
What is ELOPe
The idea behind ELOPe is to provide writing suggestions as a feature of a popular web-based email service. These writing suggestions are designed to improve the outcome of your email, whatever that may be. To take an example from the book, if you’re requesting more compute resources for a project, then ELOPe’s job is to offer writing suggestions that are most likely to get your request approved. By taking into account your own past writings, who you’re sending the email to, and what you’re asking for, it can go as far as completely re-writing the email to achieve the optimal outcome.
Using the existence of ELOPe as a given, the author writes a enjoyable story that is (mostly) technically accurate with plenty of details, without being boring. If you liked Daemon by Daniel Suarez, or you work with any kind of natural language / text-processing technology, you’ll probably enjoy the story. I won’t get into how an email writing suggestion program goes from that to full AI & takes over the world as a benevolent ghost in the wires – for that you need to read the book. What I do want to talk about is how this email optimization system could be implemented.
How ELOPe might work
Let’s start by defining the high-level requirements. ELOPe is an email optimizer, so we have the sender, the receiver, and the email being written as inputs. The output is a re-written email that preserves the “voice” of the sender while using language that will be much more likely to achieve the sender’s desired outcome, given who they’re sending the email to. That means we need the following:
- ability to analyze the email to determine what outcome is desired
- prior knowledge of how the receiver has responded to other emails with similar outcome topics, in order to know what language produced the best outcomes (and what language produced bad outcomes)
- ability to re-write (or generate) an email whose language is consistent with the sender, while also using language optimized to get the best response from the receiver
Determining the desired outcome for an email seems to me like a sophisticated combination of topic modeling and deep linguistic parsing. The goal would be to identify the core reason for the email: what is the sender asking for, and what would be an optimal response?
Being able to do this from a single email is probably impossible, but if you have access to thousands, or even millions of email chains, accurate topic modeling is much more do-able. Nearly every email someone sends will have some similarity to past emails sent by other people in similar situations. So you could create feature vectors for every email chain (using deep semantic parsing), then cluster the chains using feature similarity. Now you have topic clusters, and from that you could create training data for thousands of topic classifiers. Once you have the classifiers, you can run those in parallel to determine the most likely topic(s) of a single email.
Obviously it would be very difficult to create accurate clusters, and even harder to do so at scale. Language is very fuzzy, humans are inconsistent, and a huge fraction of email is spam. But the core of the necessary technology exists, and can work very well in limited conditions. The ability to parse emails, extract textual features, and cluster & classify feature vectors are functionality that’s available in at least a few modern programming libraries today (i.e. Python, NLTK & scikit-learn). These are areas of software technology that are getting a lot of attention right now, and all signs indicate that attention will only increase over time, so it’s quite likely that the difficulty level will decrease significantly over the next 10 years. Moving on, let’s assume we can do accurate email topic analysis. The next hurdle is outcome analysis.
Once you can determine topics, now you need to learn about outcomes. Two email chains about acquiring compute resources have the same topic, but one chain ends with someone successfully getting access to more compute resources, while the other ends in failure. How do you differentiate between these? This sounds like next-generation sentiment analysis. You need to go deeper than simple failure vs. success, positive vs. negative, since you want to know which email chains within a given topic produced the best responses, and what language they have in common. In other words, you need a language model that weights successful outcome language much higher than failure outcome language. The only way I can think of doing this with a decent level of accuracy is massive amounts of human verified training data. Technically do-able, but very expensive in terms of time and effort.
What really pushes the bounds of plausibility is that the language model can’t be universal. Everyone has their own likes, dislikes, biases, and preferences. So you need language models that are specific to individuals, or clusters of individuals that respond similarly on the same topic. Since these clusters are topic specific, every individual would belong to many
(topic, cluster) pairs. Given
N topics and an average of
M clusters within each topic, that’s
N*M language models that need to be created. And one of the major plot points of the book falls out naturally: ELOPe needs access to huge amounts of high end compute resources.
This is definitely the least do-able aspect of ELOPe, and I’m ignoring all the implicit conceptual knowledge that would be required to know what an optimal outcome is, but let’s move on 🙂
Assuming that we can do topic & outcome analysis, the final step is using language models to generate more persuasive emails. This is perhaps the simplest part of ELOPe, assuming everything else works well. That’s because natural language generation is the kind of technology that works much better with more data, and it already exists in various forms. Google translate is a kind of language generator, chatbots have been around for decades, and spammers use software to spin new articles & text based on existing writings. The differences in this case are that every individual would need their own language generator, and it would have to be parameterized with pluggable language models based on the topic, desired outcome, and receiver. But assuming we have good topic & receiver specific outcome analysis, plus hundreds or thousands of emails from the sender to learn from, then generating new emails, or just new phrases within an email, seems almost trivial compared to what I’ve outlined above.
I’m still highly skeptical that strong AI will ever exist. We humans barely understand the mechanisms of own intelligence, so to think that we can create comparable artificial intelligence smells of hubris. But it can be fun to think about, and the point of sci-fi is to tell stories about possible futures, so I have no doubt various forms of AI will play a strong role in sci-fi stories for years to come.