3D Mutant Mfers

3D Mutant Mfers Contract Review

3DMutantMfers is a new NFT collection by @scott_visuals. He is also the artist and creator of 3DMfers and Cosmo Creatures. For this mint, the team wanted to reward holders of 3DMfers and Cosmo Creatures with free mints, while still keeping public mints very affordable at 0.029 ETH. The idea with the free mint was that you would be able to get 1 3DMutantMfer for each 3DMfer or Cosmo Creature you held. Unfortunately, there was a bug in the contract function that caused the free claim to fail for most people. Let’s dig in to the contract code.

Public Mint

Let’s start with the public mint function, which works great and is very simple.

  function mint( uint numberOfTokens ) external payable mintCompliance( numberOfTokens ) {
    require( msg.value >= price * numberOfTokens, "Ether value sent is not correct" );

    _mintLoop( msg.sender, numberOfTokens );    
  }

The mint function takes in the number of tokens you want to mint, ensures that you’ve sent the correct amount of ETH (0.029 * numberOfTokens), then calls _mintLoop with msg.sender (your wallet address).

Mint Compliance

Before you can mint, there’s a number of conditions that are checked by mintCompliance.

  modifier mintCompliance( uint numberOfTokens ) {
    require( isActive,                            "Sale must be active to mint 3DMutantMfers" );
    require( numberOfTokens <= maxOrder,          "Can only mint 20 tokens at a time" );
    require( numberOfTokens > 0,                  "Token Mint Count must be > 0" );

    uint256 supply = _owners.length;
    require( supply + numberOfTokens <= MAX_SUPPLY, "Purchase would exceed max supply of 3DMutantMfers" );
    
    uint256 mintedCount = addressMintedBalance[msg.sender];
    require(mintedCount + numberOfTokens <= MAX_PER_WALLET, "Max NFT per address exceeded");

    _;
  }

The mintCompliance modifier function is also used by the freeClaimMint function covered below. It checks for the following conditions:

  • Minting is active
  • You are minting at least one token but not more than 20
  • You can’t exceed the total supply of 4444 tokens
  • You can’t mint more than 1000 tokens into 1 wallet

All of these numbers can be changed by the contract owner, with the restriction that total supply cannot be set lower than the number of tokens. The functions that allow these modifications are below:

 function setActive(bool isActive_) external onlyOwner {
    if( isActive != isActive_ )
      isActive = isActive_;
  }

  function setMaxOrder(uint maxOrder_) external onlyOwner {
    if( maxOrder != maxOrder_ )
      maxOrder = maxOrder_;
  }

  function setPrice(uint price_ ) external onlyOwner {
    if( price != price_ )
      price = price_;
  }

  function setMaxSupply(uint maxSupply_ ) external onlyOwner {
    if( MAX_SUPPLY != maxSupply_ ){
      require(maxSupply_ >= _owners.length, "Specified supply is lower than current balance" );
      MAX_SUPPLY = maxSupply_;
    }
  }

  // Update Max Tokens A Wallet can mint
  function setMaxPerWallet(uint maxPerWallet_ ) external onlyOwner {
    if( MAX_PER_WALLET != maxPerWallet_ ){
      MAX_PER_WALLET = maxPerWallet_;
    }
  }

Mint Loop

  function _mintLoop(address _receiver, uint256 numberOfTokens) internal {
    uint256 supply = _owners.length;

    for (uint256 i = 0; i < numberOfTokens; i++) {
      addressMintedBalance[_receiver]++;
      _safeMint( _receiver, supply++, "" );
    }
  }

This function simply iterates through the number of tokens to mint, increments that number of tokens for your wallet (_receiver) then mints each token. This contract uses ERC721B, which is an implementation of ERC721 optimized to reduce gas when minting multiple tokens. Because this is relatively new code, and not from OpenZeppelin (the defacto standard library), it’s possible there’s some issues that haven’t been identified yet. However, the code in each function is relatively simple and very similar to standard ERC721 code.

Free Claim Mint

Let’s look at the freeClaimMint, which had an issue where most transactions attempts failed with “Warning! Error encountered during contract execution [Out of gas]“.

function freeClaimMint( uint numberOfTokens, bytes memory signature, string[] memory contract1TokenIds, string[] memory contract2TokenIds ) external mintCompliance( numberOfTokens ) {
    
    require(verifySender(signature, contract1TokenIds, contract2TokenIds), "Invalid Access");

    // Check to make sure there are token ids
    require(contract1TokenIds.length > 0 || contract2TokenIds.length > 0, "Empty Token IDs");

    uint totalTokenIds = contract1TokenIds.length + contract2TokenIds.length;
    require(totalTokenIds == numberOfTokens, "Token IDs and Mint Count mismatch");

    // Lets make sure we are not claiming for already claimed tokens of contract 1
    bool isValidTokenIds = true;
    for (uint i = 0; isValidTokenIds && i < contract1TokenIds.length; i++) {
      for (uint j = 0; isValidTokenIds && j < contract1ClaimedTokensCount; j++) {
        string memory contractClaimedToken = contract1ClaimedTokens[j];
        string memory tokenToClaim = contract1TokenIds[i];

        if (keccak256(bytes(tokenToClaim)) == keccak256(bytes(contractClaimedToken))) {
          isValidTokenIds = false;
        }
      } 
    } 
    require(isValidTokenIds, "Cosmo Creatures Token ID passed is already claimed");

    // Lets make sure we are not claiming for already claimed tokens of contract 2
    for (uint i = 0; isValidTokenIds && i < contract2TokenIds.length; i++) {
      for (uint j = 0; isValidTokenIds && j < contract2ClaimedTokensCount; j++) {
        string memory contractClaimedToken = contract2ClaimedTokens[j];
        string memory tokenToClaim = contract2TokenIds[i];

        if (keccak256(bytes(tokenToClaim)) == keccak256(bytes(contractClaimedToken))) {
          isValidTokenIds = false;
        }
      } 
    } 
    require(isValidTokenIds, "3D Mfrs Token ID passed is already claimed");


    for (uint i = 0; i < contract1TokenIds.length; i++) {
      contract1ClaimedTokensCount++;
      contract1ClaimedTokens.push(contract1TokenIds[i]);
    }
    
    for (uint i = 0; i < contract2TokenIds.length; i++) {
      contract2ClaimedTokensCount++;
      contract2ClaimedTokens.push(contract2TokenIds[i]);
    }

    _mintLoop( msg.sender, numberOfTokens );
  }

As you can see, this is quite a complicated function. You can compare it to the BackgroundMfers minting functions to see how much simpler their implementation is for each free claim function. In this case, the intention was to provide a single function that would mint 1 or more tokens depending on how many you owned from 3DMfers and Cosmo Creatures. But for most people that tried this, the transaction failed, losing them gas. The team is very supportive of their community, though, and they very quickly responded by:

  1. announcing the problem
  2. removing the free claim button from their website
  3. refunding gas fees to everyone that had a failed transaction
  4. creating a claim form and then airdropping 3DMutantMfers, at their own expense

Over 1.3 ETH is a somewhat expensive bug to pay for, but it builds a lot of good will, and makes it clear this is a quality team that’s not out for a quick cash grab. So what went wrong? I’m reminded of the Zen of Python: explicit is better than implicit. In the BackgroundMfers minting functions, they made everything very explicit:

  • each contract has its own minting function (mfers, dadmfers1, dadmfers2)
  • you have to select precisely which tokens from the external contract to mint with

For 3DMutantMfers, the team tried to make a very simple user experience, requiring a lot of implicit/hidden complexity that didn’t quite work. Let’s look at the code more.

Verify Signature

The very first line, after mintCompliance, is verifySignature, which checks that the function arguments have been signed by a known signerAddress.

  function verifySender(bytes memory signature, string[] memory contract1TokenIds, string[] memory contract2TokenIds) internal view returns (bool) {

    string memory contract1TokensString = "";
    string memory contract2TokensString = "";

    for (uint i = 0; i < contract1TokenIds.length; i++) {
      contract1TokensString = string(abi.encodePacked(contract1TokensString, contract1TokenIds[i], i < contract1TokenIds.length - 1 ? "," : ""));
    }
    
    for (uint i = 0; i < contract2TokenIds.length; i++) {
      contract2TokensString = string(abi.encodePacked(contract2TokensString, contract2TokenIds[i], i < contract2TokenIds.length - 1 ? "," : ""));
    }

    bytes32 hash = ECDSA.toEthSignedMessageHash(keccak256(abi.encodePacked(msg.sender, contract1TokensString, contract2TokensString)));
    return ECDSA.recover(hash, signature) == signerAddress;
  }

This means the website must be generating these signatures, hopefully on a backend server where the private key for the signerAddress is kept secure. There is a function to change the signer address, which could be used in case the key is compromised, or just as good practice key rotation.

  function setSignerAddress(address _newSignerAddress) external onlyOwner {
    signerAddress = _newSignerAddress;
  }

Token Validation

Next the function tries to validate the following:

  • There is at least 1 external contract token to check (3DMfers or Cosmo Creatures)
  • The number of tokens to mint matches the number of tokens that you own
  • The tokens to mint haven’t been claimed already

For the last point, this is done for each contract, and I’ve copied the first block below.

    // Lets make sure we are not claiming for already claimed tokens of contract 1
    bool isValidTokenIds = true;
    for (uint i = 0; isValidTokenIds && i < contract1TokenIds.length; i++) {
      for (uint j = 0; isValidTokenIds && j < contract1ClaimedTokensCount; j++) {
        string memory contractClaimedToken = contract1ClaimedTokens[j];
        string memory tokenToClaim = contract1TokenIds[i];

        if (keccak256(bytes(tokenToClaim)) == keccak256(bytes(contractClaimedToken))) {
          isValidTokenIds = false;
        }
      } 
    } 
    require(isValidTokenIds, "Cosmo Creatures Token ID passed is already claimed");

Minting

Finally, there’s 2 loops, 1 for each contract, to increment a counter and record which contract tokens have been claimed. I only included 1 loop below for brevity. Once the loops are completed, there’s a call to _mintLoop to finally mint the tokens.

    for (uint i = 0; i < contract2TokenIds.length; i++) {
      contract2ClaimedTokensCount++;
      contract2ClaimedTokens.push(contract2TokenIds[i]);
    }

    _mintLoop( msg.sender, numberOfTokens );

This final part seems simple enough, but it turns out this where the bug is. Let’s see what we can learn from slither.

Slither Analysis

Slither is a python tool for static analysis of Solidity contracts. You can use it to get a quick summary of the contract code, and then look for any deeper issues.

$ slither 0x09f589e03381b767939ce118a209d474cc6d52fc --print human-summary

As we’ve seen above, there’s definitely complex code. One less important issue identified is that there’s no check for a zero address when setting the signerAddress. However, signerAddress is only used for the free claim, which shouldn’t be used anyway due to the bug. Almost all other issues are for the 11 dependency contracts, and none are significant. Except when you look at all the analysis, slither does identify the likely cause for the out of gas error:

$ slither 0x09f589e03381b767939ce118a209d474cc6d52fc
MutantMfers.freeClaimMint(uint256,bytes,string[],string[]) (MutantMfers.sol#1401-1450) has costly operations inside a loop:
- contract1ClaimedTokensCount ++ (MutantMfers.sol#1440)
MutantMfers.freeClaimMint(uint256,bytes,string[],string[]) (MutantMfers.sol#1401-1450) has costly operations inside a loop:
- contract2ClaimedTokensCount ++ (MutantMfers.sol#1445)

Magmar, the creator of the BackgroundMfers contract, explained the issue in the Discord:

The issue here is out of gas. The problem is the free mint contract has two variable-gas for loops and is screwing with Metamask’s infura gas estimation fees. These functions were likely tested with low quantities but the gas usage will increase as minting increases, costing more and more gas, with a higher likelihood of failing. They will likely work if you gas up your transaction heavily, but you will spend 0.02+ for a free mint.

In other words, for users that had many 3DMfers and/or Cosmo Creatures, their wallet underestimated the gas fee required, causing the transaction to fail. Some people in the discord said they successfully got free claims by increasing the gas fee, which is a more advanced user behavior. Perhaps if a Counter was used instead, everything would have worked great – lower gas fees and better wallet gas estimation. This is an excellent example of how different Solidity and the EVM are different from most other programming languages and environments, where incrementing a counter is simple and very cheap.

Conclusion

So what could have been done differently? As stated above, if contract1ClaimedTokensCount and contract2ClaimedTokensCount were Counters, then maybe everything would have worked as intended. Also, it’s possible the counters were not even necessary. In all the instances that one of the counters is used, the corresponding array length (such as contract2ClaimedTokens.length) could have been used instead. Alternatively (or in addition to) they could have done the following:

  • separate claiming functions for 3DMfers and Cosmo Creatures
  • let users choose how many free tokens to mint, and which tokens to claim with
  • address any user experience changes on the website, such as an option to mint with all tokens from a contract (some people own quite a few Cosmo Creatures, so a “claim with all” button could be helpful)

The user experience would be a little different, but I doubt free minters would complain. And two more things that every contract creator should do:

  1. Run slither on your contract
  2. Test every public function in many ways, with many inputs

Given the situation, what went right?

  • the team quickly notified everyone and removed the free claim button from the website
  • they refunded all the failed transactions
  • they provided a simple form for claiming free tokens, then airdropped them, paying the gas fee themselves
  • the public mint function worked great at a very affordable price
  • art was revealed 24 hours later with ~25% of the collection minted, and many happy owners
BackgroundMfers Banner

Dadmfers Background Mfers Contract Review

dadmfers is the first mfers derivative project, and they just recently released BackgroundMfers, a series of dadmfers inspired banner images. This required a new contract, with some additional complexity because of the number of mint options. The contract provides 5 separate minting functions for getting your background mfers NFTs:

  1. Public mint for anyone
  2. Mfer mint, for anyone that has mfers, at a reduced price
  3. Whitelist mint, free anyone on their list
  4. Dadmfers v1 mint for free
  5. Dadmfers v2 mint for free

Why some many minting options? It does add complexity, but there are good reasons, such as more favorable pricing for mfers holders, or free mints for dadmfers holders, while still allowing anyone to participate with the public option. There are two dadmfers options, v1 & v2, because the original v1 contract for dadmfers had very high gas fees. This turned a lot of people away, but the team quickly responded with a much more gas optimized contract. However, since the original contract had already been deployed, a new one was needed, because smart contracts cannot be upgraded once deployed (unless you use a proxy contract). So a v2 gas optimized contract was deployed, and everyone that minted on v1 was given free airdrops for v2. Ok, enough backstory, lets look at the background mfers contract code, available on etherscan.

  function mintPublic(uint256 _mintAmount) public payable mintCompliance(_mintAmount) {
    require(msg.value >= publicCost * _mintAmount, "Not enough eth sent!");
    require(_mintAmount < maxMintAmountPlusOne, "trying to mint too many!");
    _mintLoop(msg.sender, _mintAmount);
  }

Public Mint

This mintPublic function is very simple at first glance, but has a lot of dependencies to go into. It takes a _mintAmount, checks it with mintCompliance, then does 2 more checks before minting. Let’s look at mintCompliance since it’s also used by the other mint functions.

mintCompliance

  modifier mintCompliance(uint256 _mintAmount) {
    require(_mintAmount > 0, "Invalid mint amount!");
    require(supply.current() + _mintAmount < maxSupplyPlusOne, "Max supply exceeded!");
    require (saleIsActive, "Public sale inactive");
    _;
  }

Here we can see 3 requirements:

  1. _mintAmount must be a positive integer
  2. You can’t mint more than is available
  3. Minting must be active

These have some implications:

  • There is a max supply
  • Minting can be de-activated or re-activated

The max supply is defined by uint256 public maxSupplyPlusOne = 10_001 at the top of the contract. However, there’s also the following function at the bottom of the contract, which allows the contract owner to lower the supply.

  function lowerSupply(uint256 newSupply) public onlyOwner {
      if (newSupply < maxSupplyPlusOne) {
          maxSupplyPlusOne = newSupply;
      }
  }

Supply lowering can be a good thing to do if the collection doesn’t mint out in a certain period of time. By lowering the supply, you can preserve the current rarities and NFT values, potentially making the existing NFTs more valuable. Dadmfers v2 also lowered the supply after some time passed and it hadn’t sold out.

The contract owner can also disable or enable minting with the setSale function, which sets the salesIsActive variable.

  function setSale(bool newState) public onlyOwner {
    saleIsActive = newState;
  }

_mintLoop

  function _mintLoop(address _receiver, uint256 _mintAmount) internal {
    for (uint256 i = 0; i < _mintAmount; i++) {
      supply.increment();
      _safeMint(_receiver, supply.current());
    }
  }

This function is called with msg.sender (i.e. your wallet address) and the amount you want to mint. It does a simple for loop to increment the used supply, then mints a token. _safeMint is a standard function in OpenZeppelin’s ERC721, so we won’t go into that here.

Here’s a sample transaction for mintPublic, which transfers 1 token for 0.0169 Ether.

Mfers Mint

  function mintWithMfers(uint256 [] memory nftIds) public payable mintCompliance(nftIds.length) {
    require(msg.value >= mferCost * nftIds.length, "Not enough eth sent!");
    
    for (uint256 i = 0; i < nftIds.length; i++) {
      require(mfersContract.ownerOf(nftIds[i]) == msg.sender, "You must own all the mfers!");
      require(usedMferIds[nftIds[i]] == false, "One of the mfer IDs has already been used!");
      supply.increment();
      _safeMint(msg.sender, supply.current());
      usedMferIds[nftIds[i]] = true;
    }
  }

For this minting function, it expects a list of mfers tokens. These come from the selectors on the website. After checking mintCompliance and amount of eth sent, it loops through the tokens. If any token is not owned by you, or has already been used to mint, this function will fail. But if you are the owner of all the mfers tokens, it will do a mint for each one, and record the mfers token as used.

If you’re wondering why you have to check token ownership in the contract, when it’s already done on the website, that’s because smart contracts can be called directly, without going through the website. For example, you could go to the Write Contract section on etherscan, find the mintWithMfers function, and enter values directly there. This is something that all smart contract developers need to be aware of – you can’t assume people will only interact with the contract through your website.

Let’s dig into require(mfersContract.ownerOf(nftIds[i]) == msg.sender a bit more. mfersContract is defined at the top of the contract as nftInterface mfersContract = nftInterface(0x79FCDEF22feeD20eDDacbB2587640e45491b757f);.

So what’s nftInterface?

interface nftInterface {
    function ownerOf(uint256 tokenId) external view returns (address owner);
    function balanceOf(address owner) external view returns (uint256);
    function totalSupply() external view returns (uint256);
}

This is defined outside the contract, and provides a way to interact with another contract, by defining some functions that the contract should support. So mfersContract is an interface to the smart contract behind mfers (which you can see at this address on etherscan: 0x79FCDEF22feeD20eDDacbB2587640e45491b757f), and the background mfers contract is calling the mfers contract to check token ownership. These 3 functions (ownerOf, balanceOf, and totalSupply) are all standard ERC721 functions.

Here’s a sample transaction for mintWithMfers, which transfers 1 token for 0.0069 Ether.

Dadmfer v1 & v2 Mint

  function mintWithDadmfersV1(uint256 [] memory nftIds) public mintCompliance(nftIds.length) {
    for (uint256 i = 0; i < nftIds.length; i++) {
      require(dadmfersV1Contract.ownerOf(nftIds[i]) == msg.sender, "You must own all the dadmfer V1s!");
      require(usedV1Ids[nftIds[i]] == false, "One of the dadmfer IDs has already been used!");
      supply.increment();
      _safeMint(msg.sender, supply.current());
      usedV1Ids[nftIds[i]] = true;
    }
  }

  function mintWithDadmfersV2(uint256 [] memory nftIds) public mintCompliance(nftIds.length) {
    for (uint256 i = 0; i < nftIds.length; i++) {
      require(dadmfersV2Contract.ownerOf(nftIds[i]) == msg.sender, "You must own all the dadmfer V2s!");
      require(usedV2Ids[nftIds[i]] == false, "One of the dadmfer IDs has already been used!");
      supply.increment();
      _safeMint(msg.sender, supply.current());
      usedV2Ids[nftIds[i]] = true;
    }
  }

Both of these functions are very similar to each other and the mintWithMfers function. The main difference is dadmfersV1Contract vs dadmfersV2Contract. Just like with mfersContract, these are interfaces to the dadmfers contracts. The difference with mintWithMfers is that the initial requirement check for eth sent is gone, because these are free mints.

Here’s an example transaction for mintWithDadmfersV2, getting 3 tokens for 0 Ether.

Slither Analysis

Slither is a python tool for static analysis of Solidity contracts. You can use it to get a quick summary of the contract code, and then look for any deeper issues.

$ slither 0xc0a5393aA132DE6a66369Fe6e490cAc768991Ea5 --print human-summary

This fits with what we’ve seen above, the BackgroundMfers contract is complex code for minting ERC721 NFTs. The 11 medium issues are very technical to describe, but reduce down to “not a problem”. The contract does not implement onERC721Received, so there’s no real reentrency concerns, and the other issues look more like syntax & style choices in foreignNftsForWallet, which is a read-only function that is not used for minting.

Conclusion

While somewhat complicated, the background mfers contract looks quite safe for minting, and appears to be gas optimized. It allows batch minting to receive multiple tokens, which usually saves in gas fees. And there’s nothing in the minting functions that seems unnecessary. If you have any questions about the project, you can ask in the dadmfers discord; everyone is very friendly. As of publish time, they have not revealed the images yet, but when they do, you’ll be able to see the background mfers on looksrare or opensea.

I didn’t cover the whitelistMint because most people won’t be using that one. There’s also a complicated looking function foreignNftsForWallet that returns all the token IDs owned by a wallet, for a given contract. You can test this for yourself in the Read Contract section on etherscan, if you know a wallet address that owns one or more of mfers, dadmfers v1 or v2. For example, the address 0x4873f1768e1833fa6fb720b183715c7f57ecf953 is the wallet of the contract creator, so if you enter that and input 1, you can see it owns token 908 for dadmfers v2. Use input 0 for dadmfers v1 or input 2 for mfers.

Python Vulnerability Checking Links

Using Lastpass with Ansible Vault

Ansible is a framework that helps with automating deployments, among other things. It has a feature called Ansible Vault that enables you to encrypt secrets in your ansible files. These vault encrypted secrets can only be decrypted if you provide the correct password. This means you can store things like database passwords and other sensitive settings in your repository, in a secure manner. For password access to your secrets, you are given 3 options:

  1. Ansible asks you to enter a password every time the secrets are needed
  2. You provide a file that has the password in it
  3. You leave everything decrypted until you’re ready to commit your changes, then you encrypt them using option 1 or 2 (and later decrypt when you want to make changes).

Entering a password all the time gets annoying real quick, but having a password file laying around does not seem all that secure. Plus it’s hard to share securely if you’re collaborating with others. Option 3 requires you to not make a mistake and accidentally commit decrypted secrets. What if there was a better way?

Lastpass is a great place to store your passwords, and generate secure ones, but it is annoying to lookup, copy, then paste the password back in ansible, and you need to add —ask-vault-pass to every ansible command. However, Lastpass has a neat command line utility that you can use to get a password saved in Lastpass. With some minor scripting, you can integrate this with the ansible password file, so that you don’t have a plaintext password file laying around. I learned a lot about how to do this from How to use Ansible Vault with LastPass but decided that simple scripting worked better for me than install a ruby gem.

  1. Install lastpass-cli
  2. Create a bash script we’ll call lpass_vault.sh. This must be located wherever you run ansible from, and be executable
  3. #!/bin/bash
    PASSWORD=`lpass show --password "ansible vault"`
    echo $PASSWORD
    
  4. Create an entry in your Lastpass account with the Name "ansible vault". This is what is referenced in the script above.
  5. Add the following to environment. You could add it to the bottom of bin/activate if you’re using python virtualenv:
  6. export ANSIBLE_VAULT_PASSWORD_FILE=`command -v ./lpass_vault.sh`
    
  7. Then run lpass login to ensure lastpass is setup
  8. Now you can run ansible with vault encrypted secrets, and at worst you’ll be prompted for your lastpass master password.

This isn’t only more convenient for an individual, it can also be great for teams: you can check vault encrypted secrets into a shared repository, then share the password in Lastpass. Now nothing is exposed in the repository, and the only people that can access the secrets are those with the Lasspass password.

NLP for Log Analysis – Tokenization

This is part 1 of a series of posts based on a presentation I gave at the Silicon Valley Cyber Security Meetup on behalf of my company, Insight Engines. Some of the ideas are speculative and I do not know if they are used in practice. If you have any experience applying these techniques on logs, please share in the comments below.

Natural language processing is the art of applying software algorithms to human language. However, the techniques operate on text, and there’s a lot of text that is not natural language. These techniques have been applied to code authorship classification, so why not apply them to log analysis?

Tokenization

To process any kind of text, you need to tokenize it. For natural language, this means splitting the text into sentences and words. But for logs, the tokens are different. Some tokens may be words, but other tokens could be symbols, timestamps, numbers, and more.

Another difference is punctuation. For many human languages, punctuation is mostly regular and predictable, although social media & short text writing has been challenging this assumption.

Logs come in a whole variety of formats. If you only have 1 type of log, then you may be able to tokenize it with a regular expression, like apache access logs for example. But when you have multiple types of logs, regular expressions can become overwhelming or even unusable. Many logs are written by humans, and there’s few rules or conventions when it comes to formatting or use of punctuation. A generic tokenizer could be a useful first pass at parsing arbitrary logs.

Whitespace Tokenizer

Tokenizing on whitespace is an obvious thing to try first. Here’s an example log and the result when run through my NLTK tokenization demo.

Sep 19 19:18:40 acmepayroll syslog: 04/30/10 12:18:51 39480627 wksh: HANDLING TELNET CALL (User: root, Branch: ABCDE, Client: 10101) pid=9644

NLTK Whitespace Tokenizer log exampleAs you can see, it does get some tokens, but there’s punctuation in weird places that would have to be cleaned up later.

Wordpunct Tokenizer

My preferred NLTK tokenizer is the WordPunctTokenizer, since it’s fast and the behavior is predictable: split on whitespace and punctuation. But this is a terrible choice for log tokenization.

NLTK WordPunct Tokenizer log exampleLogs are filled with punctuation, so this produces far too many tokens.

Treebank Tokenizer

Out of curiosity, I tried the TreebankWordTokenizer on the log example. This tokenizer uses a statistical model trained on news text, and it does surprisingly well.

NLTK Treebank Word Tokenizer log exampleThere’s no weird punctuation in the tokens, some is separated out and other punctuation is within tokens. It all looks pretty logical & useful. This was an unexpected result, and indicates that perhaps logs are often closer to natural language text than one might think.

Next Steps

After tokenization, you’ll want to do something with the log tokens. Maybe extract certain features, and then cluster or classify the log. These topics will be covered in a future post.

Recent Advances in Deep Learning for Natural Language Processing

This article was original published at The New Stack under the title “How Deep Learning Supercharges Natural Language Processing“.

Voice search, intelligent assistants, and chatbots are becoming common features of modern technology. Users and customers are demanding a better, more human experience when interacting with computers. According to Tableau’s business trends report, IDC predicts that by 2019, intelligent assistants will become commonly accessible to enterprise workers, while Gartner predicts that by 2020, 50 percent of analytics queries will involve some form of natural language processing. Chatbots, intelligent assistants, natural language queries, and voice-enabled applications all involve various forms of natural language processing. To fully realize these new user experiences, we will need to build upon the latest methods, some of which I will cover here.

Let’s start with the basics: what is natural language processing? Natural language processing (NLP), is a collection of techniques for helping machines understand human language. For example, one of the essential techniques is tokenization: breaking up text into “tokens,” such as words. Given individual words in sequence, you can start to apply reason to them, and do things like sentiment analysis to determine if a piece of text is positive or negative. But even a task as simple as word identification can be quite tricky. Is the word what’s really one word or two (what + is, or what + was)? What about languages that use characters to represent multi-word concepts, like Kanjii?

Deep learning is an advanced type of machine learning using neural networks. It became popular due to the success of the techniques at solving problems such as image classification (labeling an image based on visual content) and speech recognition (converting sounds into text). Many people thought that deep learning techniques, when applied to natural language, would quickly achieve similar levels of performance. But because of all the idiosyncrasies of natural language, the field has not seen the same kind of breakthrough success with deep learning as other fields, like image processing. However, that appears to be changing. In the past few years, researchers have been applying newer deep learning methods to natural language processing, and I will share some of these recent successes.

Deep learning — through recent improvements to word embeddings, a focus on attention, mobile enablement, and its appearance in the home — is starting to capture natural language processing like it previously captured image processing. In this article, I will cover some recent deep learning-based NLP research successes that have made an impact on the field. Because of these improvements, we will see simpler and more natural user experiences, better software performance, and more powerful home and mobile applications.

Word Embeddings

Words are essential to every natural language processing system. Traditional NLP looks at words as strings, but deep learning techniques can only process numeric vectors. Word embeddings were invented as a way to transform words into vectors, enabling new kinds of mathematical feature analysis. But the vector representation of words is only as good as the text it was trained on.

The more common word embeddings are trained on Wikipedia, but Wikipedia text may not be representative of whatever text you’re processing. It’s generally written as well structured factual statements, which is nothing like text found on twitter, and both of these are different than restaurant reviews. So vectors trained on Wikipedia might be mathematically misleading if you use those vectors to analyze a different style of text. Text from the Common Crawl provides a more diverse set of text for training a word embedding model. The FastText library provides some great pre-trained English word vectors, along with tools for training your own. Training your own vectors is essential if you’re processing any language other than English.

Character level embeddings have also shown surprising results. This technique tries to learn vectors for individual characters, where words would be represented as a composition of the individual character vectors. In an effort to learn how to predict the next character in reviews, researchers discovered a sentiment neuron, which they could control to produce positive or negative review output. Using the sentiment neuron, they were able to beat the previous top accuracy score on the sentiment treebank. This is quite an impressive result for something discovered as a side effect of other research.

CNNs, RNNs, and Attention

Moving beyond vectors, deep learning requires training neural networks for various tasks. Vectors are the input and output, in between are layers of nodes connected together in a network. The nodes represent functions on the input data, with each function taking the input from the previous layer and producing output for the next layer. The structure of the network and how the nodes are connected very much determines the learning capabilities and performance.

In general, the deeper and more complicated a network, the longer it takes to train. When using large datasets, many networks can only be effectively trained using clusters of graphics processors (GPUs), because GPUs are optimized for the necessary floating point math. This puts some types of deep learning outside the reach of anyone not at large companies or institutions that can afford the expensive GPU clusters necessary for deep learning on big data.

Standard neural networks are feedforward networks, where each node in a layer is forward connected to every node in the next layer. A Recurrent Neural Network (RNN) is a network where the nodes in each layer also connect back to the previous layer. This creates a kind of memory that can be great for learning from sequences, such as words in a sentence.

A Convolutional Neural Networks (CNN) is a type feedforward network, but with more layers, and where the forward connections have been manipulated, or convoluted, to achieve certain properties. CNNs tend to be good at extracting position invariant features, meaning they do not care so much about sequence ordering. Because of this, CNNs can be trained in a more parallel manner, leading to faster training and optimization compared to RNNs.

While CNNs may win in raw speed, both types of neural networks tend to have comparable performance characteristics.  In fact, RNNs have a slight edge when it comes to sequence oriented tasks like Part-of-Speech tagging, where you are trying to identify the part of speech (such as “noun” or “verb”) for each word in a sentence. For a detailed performance comparison of CNNs and RNNs applied to NLP see: Comparative Study of CNN and RNN for Natural Language Processing.

The most successful RNN models are the LSTM (Long short-term memory) and GRU (gated recurrent unit). These use attention gates, which act as a kind of short-term memory for the network. However, a newer research paper implies that attention may be all you need. By doing away with recurrence networks and convolution, and keeping only attention mechanisms, these models can be trained in parallel like a CNN, but even faster, and have comparable better performance than RNNs on some sequence learning tasks, such machine translation.

Reducing the training cost while maintaining comparable performance means that smaller companies and individuals can throw more data at their deep learning models, and potentially compete more effectively with larger companies and institutions.

Software 2.0

One of the nice properties of neural network models is that the core algorithms and math are mostly the same. Once you have the infrastructure, model definition, and training algorithms all setup, these models are very reusable. “Software 2.0” is the idea that significant components of an application or system can be replaced by neural network models. Instead of writing code, developers:

  1. Collect training data
  2. Clean and label the data
  3. Train a model
  4. Integrate the model

While the most interesting parts are often steps three and four, most of the work happens in the data preparation steps one and two. Collecting and curating good, useful, clean data can be a significant amount of work, which is why methods like corpus bootstrapping are important for getting to good data faster. In the long run, it is often easier to make better data than it is to design better algorithms.

The past few years have demonstrated that neural networks can achieve much better performance than many alternatives, sometimes even in areas not traditionally touched by machine learning. One of the most recent and interesting advances is in learning data indexing structures. B-tree indexes are a commonly used data structure that provides an efficient way of finding data, assuming the tree is structured well. However, these newly learned indexes significantly outperformed the traditional B-tree indexes in both speed and memory usage. Such low-level data structure performance improvements could have far-reaching impacts if it can be integrated into standard development practices.

As research progresses, and the necessary infrastructure becomes cheaper and more available, deep learning models are likely to be used in more and more parts of the software stack, including mobile applications.

Mobile Machine Learning

Most deep learning requires clusters of expensive GPUs and lots of RAM. This level of compute power is only accessible to those who can afford it, usually in the cloud. But consumers are increasingly using mobile devices, and much of the world does not have reliable and affordable full-time wireless connectivity. Getting machine learning into mobile devices will enable more developers to create all sorts of new applications.

  • Apple’s CoreML framework enables a number of NLP capabilities on iOS devices, such as language identification and named entity recognition.
  • Baidu developed a CNN library for mobile deep learning that works on both iOS and Android.
  • Qualcomm created a Neural Processing Engine for its mobile processors, enabling popular deep learning frameworks to operate on mobile devices.

Expect a lot more of this in the near future, as mobile devices continue to become more powerful and ubiquitous. Marc Andreessen famously said that “software is eating the world,” and now machine learning appears to be eating software. Not only is it in our pocket, it is also in our homes.

Deep Learning in the Home

Alexa and other voice assistants became mainstream in 2017, bringing NLP into millions of homes. Mobile users are already familiar with Siri and Google Assistant, but the popularity of Alexa and Google Home shows how many people have become comfortable having conversations with voice-activated dialogue systems. How much these systems rely on deep learning is somewhat unknown, but it is fairly certain that significant parts of their dialogue systems use deep learning models for core functions such as speech to text, part of speech tagging, natural language generation, and text to speech.

As research advances and these companies collect increasing amounts of data from their users, deep learning capabilities will improve as well, and implementations of “software 2.0” will become pervasive. While a few large companies are creating powerful data moats, there is always room on the edges for highly specialized, domain-specific applications of natural languages, such as cybersecurity, IT operations, and data analytics.

Deep learning has become a core component of modern natural language processing systems.

However, many traditional natural language processing techniques are still quite effective and useful, especially in areas that lack the huge amounts of training data necessary for deep learning. I will cover these traditional statistical techniques in an upcoming article.

Insight Engines Series A

My company, Insight Engines, recently announced Series A funding, to make big data easily queryable by everyone. We’re bringing natural language technology to the cybersecurity domain, so you can use plain english search queries to navigate large datasets for security investigations. If you’re also interested in the intersection between NLP and cybersecurity, we’re hiring.

%d bloggers like this: