Ansible is a framework that helps with automating deployments, among other things. It has a feature called Ansible Vault that enables you to encrypt secrets in your ansible files. These vault encrypted secrets can only be decrypted if you provide the correct password. This means you can store things like database passwords and other sensitive settings in your repository, in a secure manner. For password access to your secrets, you are given 3 options:
- Ansible asks you to enter a password every time the secrets are needed
- You provide a file that has the password in it
- You leave everything decrypted until you’re ready to commit your changes, then you encrypt them using option 1 or 2 (and later decrypt when you want to make changes).
Entering a password all the time gets annoying real quick, but having a password file laying around does not seem all that secure. Plus it’s hard to share securely if you’re collaborating with others. Option 3 requires you to not make a mistake and accidentally commit decrypted secrets. What if there was a better way?
Lastpass is a great place to store your passwords, and generate secure ones, but it is annoying to lookup, copy, then paste the password back in ansible, and you need to add
—ask-vault-pass to every ansible command. However, Lastpass has a neat command line utility that you can use to get a password saved in Lastpass. With some minor scripting, you can integrate this with the ansible password file, so that you don’t have a plaintext password file laying around. I learned a lot about how to do this from How to use Ansible Vault with LastPass but decided that simple scripting worked better for me than install a ruby gem.
- Install lastpass-cli
- Create a bash script we’ll call
lpass_vault.sh. This must be located wherever you run
ansiblefrom, and be executable
- Create an entry in your Lastpass account with the Name
"ansible vault". This is what is referenced in the script above.
- Add the following to environment. You could add it to the bottom of
bin/activateif you’re using python virtualenv:
- Then run
lpass loginto ensure lastpass is setup
- Now you can run
ansiblewith vault encrypted secrets, and at worst you’ll be prompted for your lastpass master password.
#!/bin/bash PASSWORD=`lpass show --password "ansible vault"` echo $PASSWORD
export ANSIBLE_VAULT_PASSWORD_FILE=`command -v ./lpass_vault.sh`
This isn’t only more convenient for an individual, it can also be great for teams: you can check vault encrypted secrets into a shared repository, then share the password in Lastpass. Now nothing is exposed in the repository, and the only people that can access the secrets are those with the Lasspass password.
This is part 1 of a series of posts based on a presentation I gave at the Silicon Valley Cyber Security Meetup on behalf of my company, Insight Engines. Some of the ideas are speculative and I do not know if they are used in practice. If you have any experience applying these techniques on logs, please share in the comments below.
Natural language processing is the art of applying software algorithms to human language. However, the techniques operate on text, and there’s a lot of text that is not natural language. These techniques have been applied to code authorship classification, so why not apply them to log analysis?
To process any kind of text, you need to tokenize it. For natural language, this means splitting the text into sentences and words. But for logs, the tokens are different. Some tokens may be words, but other tokens could be symbols, timestamps, numbers, and more.
Another difference is punctuation. For many human languages, punctuation is mostly regular and predictable, although social media & short text writing has been challenging this assumption.
Logs come in a whole variety of formats. If you only have 1 type of log, then you may be able to tokenize it with a regular expression, like apache access logs for example. But when you have multiple types of logs, regular expressions can become overwhelming or even unusable. Many logs are written by humans, and there’s few rules or conventions when it comes to formatting or use of punctuation. A generic tokenizer could be a useful first pass at parsing arbitrary logs.
Tokenizing on whitespace is an obvious thing to try first. Here’s an example log and the result when run through my NLTK tokenization demo.
Sep 19 19:18:40 acmepayroll syslog: 04/30/10 12:18:51 39480627 wksh: HANDLING TELNET CALL (User: root, Branch: ABCDE, Client: 10101) pid=9644
My preferred NLTK tokenizer is the
WordPunctTokenizer, since it’s fast and the behavior is predictable: split on whitespace and punctuation. But this is a terrible choice for log tokenization.
Out of curiosity, I tried the
TreebankWordTokenizer on the log example. This tokenizer uses a statistical model trained on news text, and it does surprisingly well.
There’s no weird punctuation in the tokens, some is separated out and other punctuation is within tokens. It all looks pretty logical & useful. This was an unexpected result, and indicates that perhaps logs are often closer to natural language text than one might think.
After tokenization, you’ll want to do something with the log tokens. Maybe extract certain features, and then cluster or classify the log. These topics will be covered in a future post.
This article was original published at The New Stack under the title “How Deep Learning Supercharges Natural Language Processing“.
Voice search, intelligent assistants, and chatbots are becoming common features of modern technology. Users and customers are demanding a better, more human experience when interacting with computers. According to Tableau’s business trends report, IDC predicts that by 2019, intelligent assistants will become commonly accessible to enterprise workers, while Gartner predicts that by 2020, 50 percent of analytics queries will involve some form of natural language processing. Chatbots, intelligent assistants, natural language queries, and voice-enabled applications all involve various forms of natural language processing. To fully realize these new user experiences, we will need to build upon the latest methods, some of which I will cover here.
Let’s start with the basics: what is natural language processing? Natural language processing (NLP), is a collection of techniques for helping machines understand human language. For example, one of the essential techniques is tokenization: breaking up text into “tokens,” such as words. Given individual words in sequence, you can start to apply reason to them, and do things like sentiment analysis to determine if a piece of text is positive or negative. But even a task as simple as word identification can be quite tricky. Is the word what’s really one word or two (what + is, or what + was)? What about languages that use characters to represent multi-word concepts, like Kanjii?
Deep learning is an advanced type of machine learning using neural networks. It became popular due to the success of the techniques at solving problems such as image classification (labeling an image based on visual content) and speech recognition (converting sounds into text). Many people thought that deep learning techniques, when applied to natural language, would quickly achieve similar levels of performance. But because of all the idiosyncrasies of natural language, the field has not seen the same kind of breakthrough success with deep learning as other fields, like image processing. However, that appears to be changing. In the past few years, researchers have been applying newer deep learning methods to natural language processing, and I will share some of these recent successes.
Deep learning — through recent improvements to word embeddings, a focus on attention, mobile enablement, and its appearance in the home — is starting to capture natural language processing like it previously captured image processing. In this article, I will cover some recent deep learning-based NLP research successes that have made an impact on the field. Because of these improvements, we will see simpler and more natural user experiences, better software performance, and more powerful home and mobile applications.
Words are essential to every natural language processing system. Traditional NLP looks at words as strings, but deep learning techniques can only process numeric vectors. Word embeddings were invented as a way to transform words into vectors, enabling new kinds of mathematical feature analysis. But the vector representation of words is only as good as the text it was trained on.
The more common word embeddings are trained on Wikipedia, but Wikipedia text may not be representative of whatever text you’re processing. It’s generally written as well structured factual statements, which is nothing like text found on twitter, and both of these are different than restaurant reviews. So vectors trained on Wikipedia might be mathematically misleading if you use those vectors to analyze a different style of text. Text from the Common Crawl provides a more diverse set of text for training a word embedding model. The FastText library provides some great pre-trained English word vectors, along with tools for training your own. Training your own vectors is essential if you’re processing any language other than English.
Character level embeddings have also shown surprising results. This technique tries to learn vectors for individual characters, where words would be represented as a composition of the individual character vectors. In an effort to learn how to predict the next character in reviews, researchers discovered a sentiment neuron, which they could control to produce positive or negative review output. Using the sentiment neuron, they were able to beat the previous top accuracy score on the sentiment treebank. This is quite an impressive result for something discovered as a side effect of other research.
CNNs, RNNs, and Attention
Moving beyond vectors, deep learning requires training neural networks for various tasks. Vectors are the input and output, in between are layers of nodes connected together in a network. The nodes represent functions on the input data, with each function taking the input from the previous layer and producing output for the next layer. The structure of the network and how the nodes are connected very much determines the learning capabilities and performance.
In general, the deeper and more complicated a network, the longer it takes to train. When using large datasets, many networks can only be effectively trained using clusters of graphics processors (GPUs), because GPUs are optimized for the necessary floating point math. This puts some types of deep learning outside the reach of anyone not at large companies or institutions that can afford the expensive GPU clusters necessary for deep learning on big data.
Standard neural networks are feedforward networks, where each node in a layer is forward connected to every node in the next layer. A Recurrent Neural Network (RNN) is a network where the nodes in each layer also connect back to the previous layer. This creates a kind of memory that can be great for learning from sequences, such as words in a sentence.
A Convolutional Neural Networks (CNN) is a type feedforward network, but with more layers, and where the forward connections have been manipulated, or convoluted, to achieve certain properties. CNNs tend to be good at extracting position invariant features, meaning they do not care so much about sequence ordering. Because of this, CNNs can be trained in a more parallel manner, leading to faster training and optimization compared to RNNs.
While CNNs may win in raw speed, both types of neural networks tend to have comparable performance characteristics. In fact, RNNs have a slight edge when it comes to sequence oriented tasks like Part-of-Speech tagging, where you are trying to identify the part of speech (such as “noun” or “verb”) for each word in a sentence. For a detailed performance comparison of CNNs and RNNs applied to NLP see: Comparative Study of CNN and RNN for Natural Language Processing.
The most successful RNN models are the LSTM (Long short-term memory) and GRU (gated recurrent unit). These use attention gates, which act as a kind of short-term memory for the network. However, a newer research paper implies that attention may be all you need. By doing away with recurrence networks and convolution, and keeping only attention mechanisms, these models can be trained in parallel like a CNN, but even faster, and have comparable better performance than RNNs on some sequence learning tasks, such machine translation.
Reducing the training cost while maintaining comparable performance means that smaller companies and individuals can throw more data at their deep learning models, and potentially compete more effectively with larger companies and institutions.
One of the nice properties of neural network models is that the core algorithms and math are mostly the same. Once you have the infrastructure, model definition, and training algorithms all setup, these models are very reusable. “Software 2.0” is the idea that significant components of an application or system can be replaced by neural network models. Instead of writing code, developers:
- Collect training data
- Clean and label the data
- Train a model
- Integrate the model
While the most interesting parts are often steps three and four, most of the work happens in the data preparation steps one and two. Collecting and curating good, useful, clean data can be a significant amount of work, which is why methods like corpus bootstrapping are important for getting to good data faster. In the long run, it is often easier to make better data than it is to design better algorithms.
The past few years have demonstrated that neural networks can achieve much better performance than many alternatives, sometimes even in areas not traditionally touched by machine learning. One of the most recent and interesting advances is in learning data indexing structures. B-tree indexes are a commonly used data structure that provides an efficient way of finding data, assuming the tree is structured well. However, these newly learned indexes significantly outperformed the traditional B-tree indexes in both speed and memory usage. Such low-level data structure performance improvements could have far-reaching impacts if it can be integrated into standard development practices.
As research progresses, and the necessary infrastructure becomes cheaper and more available, deep learning models are likely to be used in more and more parts of the software stack, including mobile applications.
Mobile Machine Learning
Most deep learning requires clusters of expensive GPUs and lots of RAM. This level of compute power is only accessible to those who can afford it, usually in the cloud. But consumers are increasingly using mobile devices, and much of the world does not have reliable and affordable full-time wireless connectivity. Getting machine learning into mobile devices will enable more developers to create all sorts of new applications.
- Apple’s CoreML framework enables a number of NLP capabilities on iOS devices, such as language identification and named entity recognition.
- Baidu developed a CNN library for mobile deep learning that works on both iOS and Android.
- Qualcomm created a Neural Processing Engine for its mobile processors, enabling popular deep learning frameworks to operate on mobile devices.
Expect a lot more of this in the near future, as mobile devices continue to become more powerful and ubiquitous. Marc Andreessen famously said that “software is eating the world,” and now machine learning appears to be eating software. Not only is it in our pocket, it is also in our homes.
Deep Learning in the Home
Alexa and other voice assistants became mainstream in 2017, bringing NLP into millions of homes. Mobile users are already familiar with Siri and Google Assistant, but the popularity of Alexa and Google Home shows how many people have become comfortable having conversations with voice-activated dialogue systems. How much these systems rely on deep learning is somewhat unknown, but it is fairly certain that significant parts of their dialogue systems use deep learning models for core functions such as speech to text, part of speech tagging, natural language generation, and text to speech.
As research advances and these companies collect increasing amounts of data from their users, deep learning capabilities will improve as well, and implementations of “software 2.0” will become pervasive. While a few large companies are creating powerful data moats, there is always room on the edges for highly specialized, domain-specific applications of natural languages, such as cybersecurity, IT operations, and data analytics.
Deep learning has become a core component of modern natural language processing systems.
However, many traditional natural language processing techniques are still quite effective and useful, especially in areas that lack the huge amounts of training data necessary for deep learning. I will cover these traditional statistical techniques in an upcoming article.
My company, Insight Engines, recently announced Series A funding, to make big data easily queryable by everyone. We’re bringing natural language technology to the cybersecurity domain, so you can use plain english search queries to navigate large datasets for security investigations. If you’re also interested in the intersection between NLP and cybersecurity, we’re hiring.
We introduce a model for constructing vector representations of words by composing characters using bidirectional LSTMs
Below are more highlights from Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation
our model requires only a single vector per character type and a fixed set of parameters for the compositional model. Despite the compactness of this model and, more importantly, the arbitrary nature of the form–function relationship in language, our “composed” word representations yield state-of-the-art results in language modeling and part-of-speech tagging. Benefits over traditional baselines are particularly pronounced in morphologically rich languages
it is manifestly clear that similarity in form is neither a necessary nor sufficient condition for similarity in function: small orthographic differences may correspond to large semantic or syntactic differences (butter vs. batter), and large orthographic differences may obscure nearly perfect functional correspondence (rich vs. affluent). Thus, any orthographically aware model must be able to capture non-compositional effects in addition to more regular effects due to, e.g., morphological processes. To model the complex form–function relationship, we turn to long short-term memories (LSTMs), which are designed to be able to capture complex non-linear and non-local dynamics in sequences
our character-based model is able to generate similar representations for words that are semantically and syntactically similar, even for words are orthographically distant (e.g., October and January)
The goal of our work is not to overcome existing benchmarks, but show that much of the feature engineering done in the benchmarks can be learnt automatically from the task specific data. More importantly, we wish to show large dimensionality word look tables can be compacted into a lookup table using characters and a compositional model allowing the model scale better with the size of the training data. This is a desirable property of the model as data becomes more abundant in many NLP tasks.
The authors have also released Java code for training neural networks.
word2vec is an algorithm for constructing vector representations of words, also known as word embeddings. The vector for each word is a semantic description of how that word is used in context, so two words that are used similarly in text will get similar vector represenations. Once you map words into vector space, you can then use vector math to find words that have similar semantics.
gensim provides a nice Python implementation of Word2Vec that works perfectly with NLTK corpora. The model takes a list of sentences, and each sentence is expected to be a list of words. This is exactly what is returned by the
sents() method of NLTK corpus readers. So let’s compare the semantics of a couple words in a few different NLTK corpora:
>>> from gensim.models import Word2Vec >>> from nltk.corpus import brown, movie_reviews, treebank >>> b = Word2Vec(brown.sents()) >>> mr = Word2Vec(movie_reviews.sents()) >>> t = Word2Vec(treebank.sents()) >>> b.most_similar('money', topn=5) [('pay', 0.6832243204116821), ('ready', 0.6152011156082153), ('try', 0.5845392942428589), ('care', 0.5826011896133423), ('move', 0.5752171277999878)] >>> mr.most_similar('money', topn=5) [('unstoppable', 0.6900672316551208), ('pain', 0.6289106607437134), ('obtain', 0.62665855884552), ('jail', 0.6140228509902954), ('patients', 0.6089504957199097)] >>> t.most_similar('money', topn=5) [('short-term', 0.9459682106971741), ('-LCB-', 0.9449775218963623), ('rights', 0.9442864656448364), ('interested', 0.9430986642837524), ('national', 0.9396077990531921)] >>> b.most_similar('great', topn=5) [('new', 0.6999611854553223), ('experience', 0.6718623042106628), ('social', 0.6702290177345276), ('group', 0.6684836149215698), ('life', 0.6667487025260925)] >>> mr.most_similar('great', topn=5) [('wonderful', 0.7548679113388062), ('good', 0.6538234949111938), ('strong', 0.6523671746253967), ('phenomenal', 0.6296845078468323), ('fine', 0.5932096242904663)] >>> t.most_similar('great', topn=5) [('won', 0.9452997446060181), ('set', 0.9445616006851196), ('target', 0.9342271089553833), ('received', 0.9333916306495667), ('long', 0.9224691390991211)] >>> b.most_similar('company', topn=5) [('industry', 0.6164317727088928), ('technical', 0.6059585809707642), ('orthodontist', 0.5982754826545715), ('foamed', 0.5929019451141357), ('trail', 0.5763031840324402)] >>> mr.most_similar('company', topn=5) [('colony', 0.6689200401306152), ('temple', 0.6546304225921631), ('arrival', 0.6497283577919006), ('army', 0.6339291334152222), ('planet', 0.6184555292129517)] >>> t.most_similar('company', topn=5) [('panel', 0.7949466705322266), ('Herald', 0.7674347162246704), ('Analysts', 0.7463694214820862), ('amendment', 0.7282689809799194), ('Treasury', 0.719698429107666)]
I hope it’s pretty clear from the above examples that the semantic similarity of words can vary greatly depending on the textual context. In this case, we’re comparing a wide selection of text from the brown corpus with movie reviews and financial news from the treebank corpus.
Note that if you call
most_similar() with a word that was not present in the sentences, you will get a
KeyError exception. This can be a common occurrence with smaller corpora like
NLTK 3 has quite a number of changes from NLTK 2, many of which will break old code. You can see a list of documented changes in the wiki page, Porting your code to NLTK 3.0. Below are the major changes I encountered while working on the NLTK 3 Cookbook.
The FreqDist api has changed. It now inherits from collections.Counter, which implements most of the previous functionality, but in a different way. So instead of
fd.inc(tag), you now need to do
fd[tag] += 1.
fd.samples() doesn’t exist anymore. Instead, you can use
fd.most_common(), which is a method of collections.Counter that returns a list that looks like
NLTK 3 has changed many wordnet Synset attributes to methods:
Same goes for the Lemma class. For example,
lemma.antonyms() is now a method.
batch_tag() method is now
tag_sents(). The brill tagger API has changed significantly:
brill.FastBrillTaggerTrainer is now
brill_trainer.BrillTaggerTrainer, and the brill templates have been replaced by the tbl.feature.Feature interface with
brill.Word as implementations of the interface.
Simplified tags have been replaced with the universal tagset. So
tagged_corpus.tagged_sents(tagset='universal'). In order to make this work, TaggedCorpusReader should be initialized with a known tagset, using the
tagset kwarg, so that its tags can be mapped to the universal tagset. Known tagset mappings are stored in
treebank tagset is called
en-ptb (PennTreeBank) and the
brown tagset is called
en-brown. These files are simply 2 column, tab separated mappings of source tag to universal tag. The function
nltk.tag.mapping.map_tag(source, target, source tag) is used to perform the mapping.
Chunking & Parse Trees
The main change in chunkers & parsers is replacing the term node with label. RegexpChunkParser now takes a chunk
chunk_label argument instead of
chunk_node, while in the Tree class, the
node attribute has been replaced with the
The SVM classifiers and scipy based
MaxentClassifier algorithms (like
CG) have been removed, but the addition of the SklearnClassifier more than makes up for it. This classifier allows you to make use of most scikit-learn classification algorithms, which are generally faster and more memory efficient than the other NLTK classifiers, while being at least as accurate.
NLTK 3 is compatible with both Python 2 and Python 3. If you are new to Python 3, then you’ll likely be puzzled when you find that training the same model on the same data can result in slightly different accuracy metrics, because dictionary ordering is random in Python 3. This is a deliberate decision to improve security, but you can control it with the
PYTHONHASHSEED environment variable. Just run
$ PYTHONHASHSEED=0 python to get consistent dictionary ordering & accuracy metrics.
Python 3 has also removed the separate
unicode string object, so that now all strings are unicode. But some of the NLTK corpus functions return byte strings, which look like
b"raw string", so you may need convert these to normal strings before doing any further string processing.
Here’s a few other Python 3 changes I ran into:
dict.iteritems()doesn’t exist, use
dict.keys()does not produce a list (it returns a view). If you want a list, use
Because of the above switching costs, upgrading right away may not be worth it. I’m still running plenty of NLTK 2 code, because it’s stable and works great. But if you’re starting a new project, or want to take advantage of new functionality, you should definitely start with NLTK 3.