Avogadro Corp Book Review / AI Speculation

Avogadro CorpAvogadro Corp: The Singularity Is Closer Than It Appears, by William Hertling, is the first sci-fi book I’ve read with a semi-plausible AI origin story. That’s because the premise isn’t so simple as “increased computing power -> emergent AI”. It’s a much more well defined formula: ever increasing computing power + powerful language processing + never ending stream of training data + goal oriented behavior + deep integration into internet infrastructure -> AI. The AI in the story is called ELOPe, which stands for Email Language Optimization Program, and its function is essentially to improve the quality of emails. WARNING there will be spoilers below, but only enough to describe ELOPe and speculate about how it might be implemented.

What is ELOPe

The idea behind ELOPe is to provide writing suggestions as a feature of a popular web-based email service. These writing suggestions are designed to improve the outcome of your email, whatever that may be. To take an example from the book, if you’re requesting more compute resources for a project, then ELOPe’s job is to offer writing suggestions that are most likely to get your request approved. By taking into account your own past writings, who you’re sending the email to, and what you’re asking for, it can go as far as completely re-writing the email to achieve the optimal outcome.

Using the existence of ELOPe as a given, the author writes a enjoyable story that is (mostly) technically accurate with plenty of details, without being boring. If you liked Daemon by Daniel Suarez, or you work with any kind of natural language / text-processing technology, you’ll probably enjoy the story. I won’t get into how an email writing suggestion program goes from that to full AI & takes over the world as a benevolent ghost in the wires – for that you need to read the book. What I do want to talk about is how this email optimization system could be implemented.

How ELOPe might work

Let’s start by defining the high-level requirements. ELOPe is an email optimizer, so we have the sender, the receiver, and the email being written as inputs. The output is a re-written email that preserves the “voice” of the sender while using language that will be much more likely to achieve the sender’s desired outcome, given who they’re sending the email to. That means we need the following:

  1. ability to analyze the email to determine what outcome is desired
  2. prior knowledge of how the receiver has responded to other emails with similar outcome topics, in order to know what language produced the best outcomes (and what language produced bad outcomes)
  3. ability to re-write (or generate) an email whose language is consistent with the sender, while also using language optimized to get the best response from the receiver

Topic Analysis

Determining the desired outcome for an email seems to me like a sophisticated combination of topic modeling and deep linguistic parsing. The goal would be to identify the core reason for the email: what is the sender asking for, and what would be an optimal response?

Being able to do this from a single email is probably impossible, but if you have access to thousands, or even millions of email chains, accurate topic modeling is much more do-able. Nearly every email someone sends will have some similarity to past emails sent by other people in similar situations. So you could create feature vectors for every email chain (using deep semantic parsing), then cluster the chains using feature similarity. Now you have topic clusters, and from that you could create training data for thousands of topic classifiers. Once you have the classifiers, you can run those in parallel to determine the most likely topic(s) of a single email.

Obviously it would be very difficult to create accurate clusters, and even harder to do so at scale. Language is very fuzzy, humans are inconsistent, and a huge fraction of email is spam. But the core of the necessary technology exists, and can work very well in limited conditions. The ability to parse emails, extract textual features, and cluster & classify feature vectors are functionality that’s available in at least a few modern programming libraries today (i.e. Python, NLTK & scikit-learn). These are areas of software technology that are getting a lot of attention right now, and all signs indicate that attention will only increase over time, so it’s quite likely that the difficulty level will decrease significantly over the next 10 years. Moving on, let’s assume we can do accurate email topic analysis. The next hurdle is outcome analysis.

Outcome Analysis

Once you can determine topics, now you need to learn about outcomes. Two email chains about acquiring compute resources have the same topic, but one chain ends with someone successfully getting access to more compute resources, while the other ends in failure. How do you differentiate between these? This sounds like next-generation sentiment analysis. You need to go deeper than simple failure vs. success, positive vs. negative, since you want to know which email chains within a given topic produced the best responses, and what language they have in common. In other words, you need a language model that weights successful outcome language much higher than failure outcome language. The only way I can think of doing this with a decent level of accuracy is massive amounts of human verified training data. Technically do-able, but very expensive in terms of time and effort.

What really pushes the bounds of plausibility is that the language model can’t be universal. Everyone has their own likes, dislikes, biases, and preferences. So you need language models that are specific to individuals, or clusters of individuals that respond similarly on the same topic. Since these clusters are topic specific, every individual would belong to many (topic, cluster) pairs. Given N topics and an average of M clusters within each topic, that’s N*M language models that need to be created. And one of the major plot points of the book falls out naturally: ELOPe needs access to huge amounts of high end compute resources.

This is definitely the least do-able aspect of ELOPe, and I’m ignoring all the implicit conceptual knowledge that would be required to know what an optimal outcome is, but let’s move on :)

Language Generation

Assuming that we can do topic & outcome analysis, the final step is using language models to generate more persuasive emails. This is perhaps the simplest part of ELOPe, assuming everything else works well. That’s because natural language generation is the kind of technology that works much better with more data, and it already exists in various forms. Google translate is a kind of language generator, chatbots have been around for decades, and spammers use software to spin new articles & text based on existing writings. The differences in this case are that every individual would need their own language generator, and it would have to be parameterized with pluggable language models based on the topic, desired outcome, and receiver. But assuming we have good topic & receiver specific outcome analysis, plus hundreds or thousands of emails from the sender to learn from, then generating new emails, or just new phrases within an email, seems almost trivial compared to what I’ve outlined above.

Final Words

I’m still highly skeptical that strong AI will ever exist. We humans barely understand the mechanisms of own intelligence, so to think that we can create comparable artificial intelligence smells of hubris. But it can be fun to think about, and the point of sci-fi is to tell stories about possible futures, so I have no doubt various forms of AI will play a strong role in sci-fi stories for years to come.

  • Mika Schiller

    Hey Jacob, thanks for all your valuable work in the NLTK community. It’s become an indispensible tool for me. I see you mention NLG in this post. I happen to be doing some work in NLG and I’m using NLTK, however, I’m trying to do sentence realization, but can’t with NLTK because there’s no FUF module in my NLTK folder. I tried importing nltk.fuf and got an error saying there’s no fuf module. I was under the impression that fuf came standard with NLTK. I have version 2.0.4. I’ve been asking around about this and nobody seems to have an answer. Actually there’s frighteningly little on NLG across the Web in general(at least in relation to NLU). Do you know how I can get access to the fuf package? Thnx!

  • http://streamhacker.com/ Jacob Perkins

    Hi Mika,

    FUF is part of NLTK Contrib, which you can find here: https://github.com/nltk/nltk_contrib/tree/master/nltk_contrib/fuf

    As for NLG, I think most of the research is around chat bots.

  • Mika Schiller

    Please excuse my ignorance on this, but where exactly do I find this package? I see Steven Bird mentions that they moved it outside of NLTK. Where can I get it?

  • http://streamhacker.com/ Jacob Perkins

    You have to download the source from github, it’s not a package you can install with pip or easy_install.

  • Mika Schiller

    Hey Jacob, I have question about what seems to be an incomplete and incorrectly-functioning module in the fuf package. I’ve asked around on stackoverflow and nltk_users group two days ago and nobody has answered, which I’m thinking might be because barely anybody works with nltk fuf, so I’m hoping you might be able to help me out.

    I’m trying to output the simple sentence “The man eats the meal.” using the NLTK FUF linearize() function here: https://github.com/nltk/nltk_contrib/blob/master/nltk_contrib/fuf/linearizer.py

    All I do is pass a unified feature structure input/grammar to the function linearize(). But when I do that I get the output “The man eat the meal” rather than “The man eats the meal”. I took a look at the morphology.py module here: https://github.com/nltk/nltk_contrib/blob/master/nltk_contrib/fuf/morphology.py

    and it seems to have all the appropriate functions to morphologize sentences. The default tense for verbs in FUF is the present tense, so “eat” should be automatically converted to “eats”, but it’s not.

    I did notice, however, that linearizer.py hasn’t imported any of the functions in morphology.py and I’m thinking that perhaps that is why nothing is being morphologized. I tried adding [number = 'plural'] to the feature structure for the direct object in the input sentence so that it would output “The man eats the meals”, but that doesn’t work either. I have a feeling that the linearizer.py module is incomplete, but first I want to rule out that I’m not doing anything wrong. I’d like to fix the code myself, but I’m having a tough time figuring it out and don’t want mess anything up.

    You can test out an example yourself by scrolling down to the test at the bottom of the linearizer.py module that I provided a link to in the second paragraph up above. Please let me know what result you get and please advise or point me to somewhere where I might be able to get an answer to what is going on here. Thanks! Mika

  • http://streamhacker.com/ Jacob Perkins

    Hi Mika,

    I can’t really help you because I’m unfamiliar with FUF & the morhphology module, and that’s probably the same reason you haven’t gotten any response elsewhere – they’re both unsupported code. Plus, I’m not really sure what you’re trying to do. It’s probably worthwhile for you to outline exactly what you want to accomplish, and conceptually how you would do it yourself. Then look around to see if there’s any code that can help you do what you want, without much modification. If there isn’t, that’s when you should strongly consider implementing the required functionality yourself. Here’s another library that might be helpful: http://www.clips.ua.ac.be/pages/pattern-en

  • Mika Schiller

    Ok, thanks. I’ll buckle down and try to figure it out. What I’m actually working on is an NLG system that converts Facebook analytics data into text.