Static Analysis of Erlang Code with Dialyzer

Dialyzer is a tool that does static analysis of your erlang code. It’s great for identifying type errors and unreachable code. Here’s how to use it from the command line.

dialyzer -r PATH/TO/APP -I PATH/TO/INCLUDE

Pretty simple! PATH/TO/APP should be an erlang application directory containing your ebin/ and/or src/ directories. PATH/TO/INCLUDE should be a path to a directory that contains any .hrl files that need to be included. The -I is optional if you have no include files. You can have as many -r and -I options as you need. If you add -q, then dialyzer runs more quietly, succeeding silently or reporting any errors found.

If you have a test/ directory with Common Test suites, then you’ll want to add “-I /usr/lib/erlang/lib/test_server*/include/” and “-I /usr/lib/erlang/lib/common_test*/include/”. I’ve actually set this up in my Makefile to run as make check. It’s been great for catching bad return types, misspellings, and wrong function parameters.

Programming as Design

Some say programming is engineering, others call it an art. A few might (mistakenly) think it’s a science. But both the art and engineering can be encapsulated under the umbrella of design. The best design is functional art, and a huge part of the artistic beauty of a product is a result of carefully engineered functionality. Products that are not carefully designed and engineered generally suck to use, and that applies to everything from cameras to software APIs.

Design Principles

There are 4 major principles of graphic design.

  1. alignment
  2. proximity
  3. contrast
  4. consistency / repetition

Alignment

The principle of alignment is that everything on a page should be connected to something else on the page. The goal of alignment in graphic design is to create visual associations, often using a grid based layout. In software, we can apply this principle to the connectedness of data, such as the object inheritance hierarchy and relational data structures. Ideally, all your objects fit nicely into a well-defined hierarchy and your data structures relate to each other in an intuitive fashion. Of course, the real world of programming is never as clean as you’d like, but keep this principle in mind whenever you create a new object, add a new dependency, or modify relational structures.

  • Does the object cleanly fit within the existing hierarchy? If not, do you need to change the new object, or re-align the hierarchy?
  • How does this data structure relate to that other data structure? What will happen if the relations change? Will you need to re-align the relational structures?

Keeping your objects and data structures neatly aligned will result in easier to understand relations and hierarchies.

Proximity

The principle of proximity is that related items should be grouped together. Grouping things together is simple way to show relatedness. In software development, that generally means putting related functions into the same module, and related modules into the same package. Helper functions should be located near the functions that call them. Basically, group blocks of code in a logically consistent manner. And if possible, put documentation and tests close to the code too (python’s doctest module provides a great way to do that). One of the major benefits of following this principle is that it reduces the amount of time you’ll spend searching thru and understanding your own code. If all your code is organized in logical groups, and related functions are near each other in the same file, then it’s much easier to find a particular block of code.

Contrast

The principle of contrast is that if two things are not the same, then make them very different. The goal with contrast is to make different things distinctive from each other. Naming is great place to apply this principle. Names are all you have to distinguish between objects, functions, variables, and modules, so make sure that your names are distinctive and descriptive. Good names can tell you exactly what something is, and even imply its properties and behavior. Use different naming styles for different types of things. Private variables could be prefixed with an underscore, like _private, versus public variables like public, and CamelCase class names, as in MyClassName. Having a distinctive naming style lets you know at a glance whether something is a class, variable, or function, making your code much more readable. Whatever naming style you choose, use it consistently.

Consistency

The principle of consistency, or repetition, is that you repeat design elements. Repetition helps patterns become internalized and instantly recognizable. For programming, that means keeping a consistent code style, with consistent naming practices. Also, try to use well known standard conventions and protocols, shared libraries, and design patterns. Your code should make sense, or at least be readable, to those familiar with the language and domain. You’re not just writing code for yourself, you might be writing code for other programmers, maybe your manager, but most importantly, you’re writing code for your future self. It always sucks coming back to code you haven’t touched in months and not knowing what the hell is going on. Consistent style and software design can save you from that headache.

Programming as Design

If all this seems like obvious common sense to you, then great! But common sense isn’t always so common. The point of this article is make you aware that everyday software programming is filled with design choices. Naming a variable is a design choice. Creating a new module is a design choice. The layout of your working directory is a design choice. Be conscious of these choices and use the above principles and to inform your decisions. The choices you make communicate how the software works and how the code fits together. Make every choice deliberate and justifiable. Use refactoring to improve the design without affecting the functionality.

How to Train a NLTK Chunker

In NLTK, chunking is the process of extracting short, well-formed phrases, or chunks, from a sentence. This is also known as partial parsing, since a chunker is not required to capture all the words in a sentence, and does not produce a deep parse tree. But this is a good thing because it’s very hard to create a complete parse grammar for natural language, and full parsing is usually all or nothing. So chunking allows you to get at the bits you want and ignore the rest.

Training a Chunker

The general approach to chunking and parsing is to define rules or expressions that are then matched against the input sentence. But this is a very manual, tedious, and error-prone process, likely to get very complicated real fast. The alternative approach is to train a chunker the same way you train a part-of-speech tagger. Except in this case, instead of training on (word, tag) sequences, we train on (tag, iob) sequences, where iob is a chunk tag defined in the the conll2000 corpus. Here’s a function that will take a list of chunked sentences (from a chunked corpus like conll2000 or treebank), and return a list of (tag, iob) sequences.

import nltk.chunk

def conll_tag_chunks(chunk_sents):
    tag_sents = [nltk.chunk.tree2conlltags(tree) for tree in chunk_sents]
    return [[(t, c) for (w, t, c) in chunk_tags] for chunk_tags in tag_sents]

Chunker Accuracy

So how accurate is the trained chunker? Here’s the rest of the code, followed by a chart of the accuracy results. Note that I’m only using Ngram Taggers. You could additionally use the BrillTagger, but the training takes a ridiculously long time for very minimal gains in accuracy.

import nltk.corpus, nltk.tag

def ubt_conll_chunk_accuracy(train_sents, test_sents):
    train_chunks = conll_tag_chunks(train_sents)
    test_chunks = conll_tag_chunks(test_sents)

    u_chunker = nltk.tag.UnigramTagger(train_chunks)
    print 'u:', nltk.tag.accuracy(u_chunker, test_chunks)

    ub_chunker = nltk.tag.BigramTagger(train_chunks, backoff=u_chunker)
    print 'ub:', nltk.tag.accuracy(ub_chunker, test_chunks)

    ubt_chunker = nltk.tag.TrigramTagger(train_chunks, backoff=ub_chunker)
    print 'ubt:', nltk.tag.accuracy(ubt_chunker, test_chunks)

    ut_chunker = nltk.tag.TrigramTagger(train_chunks, backoff=u_chunker)
    print 'ut:', nltk.tag.accuracy(ut_chunker, test_chunks)

    utb_chunker = nltk.tag.BigramTagger(train_chunks, backoff=ut_chunker)
    print 'utb:', nltk.tag.accuracy(utb_chunker, test_chunks)

# conll chunking accuracy test
conll_train = nltk.corpus.conll2000.chunked_sents('train.txt')
conll_test = nltk.corpus.conll2000.chunked_sents('test.txt')
ubt_conll_chunk_accuracy(conll_train, conll_test)

# treebank chunking accuracy test
treebank_sents = nltk.corpus.treebank_chunk.chunked_sents()
ubt_conll_chunk_accuracy(treebank_sents[:2000], treebank_sents[2000:])
Accuracy for Trained Chunker
Accuracy for Trained Chunker

The ub_chunker and utb_chunker are slight favorites with equal accuracy, so in practice I suggest using the ub_chunker since it takes slightly less time to train.

Conclusion

Training a chunker this way is much easier than creating manual chunk expressions or rules, it can approach 100% accuracy, and the process is re-usable across data sets. As with part-of-speech tagging, the training set really matters, and should be as similar as possible to the actual text that you want to tag and chunk.

How to Fix Erlang Out of Memory Crashes When Using Mnesia

If you’re getting erlang out of memory crashes when using mnesia, chances are you’re doing it wrong, for various values of it. These out of memory crashes look something like this:

Crash dump was written to: erl_crash.dump
eheap_alloc: Cannot allocate 999999999 bytes of memory (of type "heap")

Possible Causes

  1. You’re doing it wrong
  2. Someone else is doing it wrong
  3. You don’t have enough RAM

While it’s possible that the crash is due to not having enough RAM, or that some other program or process is using too much RAM for itself, chances are it’s your fault.

One of the reasons these crashes can catch you by surprise is that the erlang VM is using a lot more memory than you might think. Erlang is a functional language with single assignment and no shared memory. A major consequence is that when you change a variable or send a message to another process, a new copy of the variable is created. So an operation as simple as dict:update_counter(“foo”, 1, Dict1) consumes twice the memory of Dict1 since Dict1 is copied to create the return value. And anything you do with ets, dets, or mnesia will result in at least 2 copies of every term: 1 copy for your process, and 1 copy for each table. This is because mnesia uses ets and/or dets for storage, which both use 1 process per table. That means every table operation results in a message pass, sending your term to the table or vice-versa. So that’s why erlang may be running out of memory. Here’s how to fix it.

Use Dirty Operations

If you’re doing anything in a transaction, try to figure out how to do it dirty, or at least move as many operations as possible out of the transaction. Mnesia transactions are separate processes with their own temporary ets tables. That means there’s the original term(s) that must be passed in to the transaction or read from other tables, any updated copies that your code creates, copies of terms that are written to the temporary ets table, the final copies of terms that are written to actual table(s) at the end of the transaction, and copies of any terms that are returned from the transaction process. Here’s an example to illustrate:

example() ->
    T = function() ->
        Var1 = mnesia:read(example_table, "foo"),
        Var2 = update(Var2), % a user-defined function to update Var1
        ok = mnesia:write(Var2),
        Var2
    end,
    {atomic, Var2} = mnesia:transaction(T),
    Var2.

First off, we already have a copy of Var1 in example_table. It gets sent to the transaction process when you do mnesia:read, creating a second copy. Var1 is then updated, resulting in Var2, which I’ll assume has the same memory footprint of Var1. So now we have 3 copies. Var2 is then written to a temporary ets table because mnesia:write is called within a transaction, creating a fourth copy. The transaction ends, sending Var2 back to the original process, and also overwriting Var1 with Var2 in example_table. That’s 2 more copies, resulting in a total of 6 copies. Let’s compare that to a dirty operation.

example() ->
    Var1 = mnesia:dirty_read(example_table, "foo"),
    Var2 = update(Var1),
    ok = mnesia:dirty_write(Var2),
    Var2.

Doing it dirty results in only 4 copies: the original Var1 in example_table, the copy sent to your process, the updated Var2, and the copy sent to mnesia to be written. Dirty operations like this will generally have 2/3 the memory footprint of operations done in a transaction.

Reduce Record Size

Figuring out how to reduce your record size by using different data structures can create huge gains by drastically reducing the memory footprint of each operation, and possibly removing the need to use transaction. For example, let’s say you’re storing a large record in mnesia, and using transactions to update it. If the size of the record grows by 1 byte, then each transactional operation like the above will require an additional 5 bytes of memory, or dirty operations will require an additional 3 bytes. For multi-megabyte records, this adds up very quickly. The solution is to figure how to break that record up into many small records. Mnesia can use any term as a key, so for example, if you’re storing a record with a dict in mnesia such as {dict_record, “foo”, Dict}, you can split that up into many records like [{tuple_record, {“foo”, Key1}, Val1}]. Each of these small records can be accessed independently, which could eliminate the need to use transactions, or at least drastically reduce the memory footprint of each transaction.

Iterate in Batches

Instead of getting a whole bunch of records from mnesia all at once, using mnesia:dirty_match_object or mnesia:dirty_select, iterate over the records in batches. This is analagous to using lists operations on mnesia tables. The match_object methods may return a huge number of records, and all those records have to be sent from the table process to your process, doubling the amount of memory required. By iteratively doing operations on batches of records, you’re only accessing a portion at a time, reducing the amount of memory being used at once. Here’s some code examples that only access 1 record at a time. Note that if the table changes during iteration, the behavior is undefined. You could also use the select operations to process records in batches of NObjects at a time.

Dirty Mnesia Foldl

dirty_foldl(F, Acc0, Table) ->
    dirty_foldl(F, Acc0, Table, mnesia:dirty_first(Table)).

dirty_foldl(_, Acc, _, '$end_of_table') ->
    Acc;
dirty_foldl(F, Acc, Table, Key) ->
    Acc2 = lists:foldl(F, Acc, mnesia:dirty_read(Table, Key)),
    dirty_foldl(F, Acc2, Table, mnesia:dirty_next(Table, Key)).

Dirty Mnesia Foreach

dirty_foreach(F, Table) ->
    dirty_foreach(F, Table, mnesia:dirty_first(Table)).

dirty_foreach(_, _, '$end_of_table') ->
    ok;
dirty_foreach(F, Table, Key) ->
    lists:foreach(F, mnesia:dirty_read(Table, Key)),
    dirty_foreach(F, Table, mnesia:dirty_next(Table, Key)).

Conclusion

  1. It’s probably your fault
  2. Do as little as possible inside transactions
  3. Use dirty operations instead of transactions
  4. Reduce record size
  5. Iterate in small batches

How to Eliminate Mnesia Overload Events

If you’re using mnesia disc_copies tables and doing a lot of writes all at once, you’ve probably run into the following message

=ERROR REPORT==== 10-Dec-2008::18:07:19 ===
Mnesia(node@host): ** WARNING ** Mnesia is overloaded: {dump_log, write_threshold}

This warning event can get really annoying, especially when they start happening every second. But you can eliminate them, or at least drastically reduce their occurance.

Synchronous Writes

The first thing to do is make sure to use sync_transaction or sync_dirty. Doing synchronous writes will slow down your writes in a good way, since the functions won’t return until your record(s) have been written to the transaction log. The alternative, which is the default, is to do asynchronous writes, which can fill transaction log far faster than it gets dumped, causing the above error report.

Mnesia Application Configuration

If synchronous writes aren’t enough, the next trick is to modify 2 obscure configuration parameters. The mnesia_overload event generally occurs when the transaction log needs to be dumped, but the previous transaction log dump hasn’t finished yet. Tweaking these parameters will make the transaction log dump less often, and the disc_copies tables dump to disk more often. NOTE: these parameters must be set before mnesia is started; changing them at runtime has no effect. You can set them thru the command line or in a config file.

dc_dump_limit

This variable controls how often disc_copies tables are dumped from memory. The default value is 4, which means if the size of the log is greater than the size of table / 4, then a dump occurs. To make table dumps happen more often, increase the value. I’ve found setting this to 40 works well for my purposes.

dump_log_write_threshold

This variable defines the maximum number of writes to the transaction log before a new dump is performed. The default value is 100, so a new transaction log dump is performed after every 100 writes. If you’re doing hundreds or thousands of writes in a short period of time, then there’s no way mnesia can keep up. I set this value to 50000, which is a huge increase, but I have enough RAM to handle it. If you’re worried that this high value means the transaction log will rarely get dumped when there’s very few writes occuring, there’s also a dump_log_time_threshold configuration variable, which by default dumps the log every 3 minutes.

How it Works

I might be wrong on the theory since I didn’t actually write or design mnesia, but here’s my understanding of what’s happening. Each mnesia activity is recorded to a single transaction log. This transaction log then gets dumped to table logs, which in turn are dumped to the table file on disk. By increasing the dump_log_write_threshold, transaction log dumps happen much less often, giving each dump more time to complete before the next dump is triggered. And increasing dc_dump_limit helps ensure that the table log is also dumped to disk before the next transaction dump occurs.

Part of Speech Tagging with NLTK Part 3 – Brill Tagger

In regexp and affix pos tagging, I showed how to produce a Python NLTK part-of-speech tagger using Ngram pos tagging in combination with Affix and Regex pos tagging, with accuracy approaching 90%. In part 3, I’ll use the brill tagger to get the accuracy up to and over 90%.

NLTK Brill Tagger

The BrillTagger is different than the previous part of speech taggers. For one, it’s not a SequentialBackoffTagger, though it does use an initial pos tagger, which in our case will be the raubt_tagger from part 2. The brill tagger uses the initial pos tagger to produce initial part of speech tags, then corrects those pos tags based on brill transformational rules. These rules are learned by training the brill tagger with the FastBrillTaggerTrainer and rules templates. Here’s an example, with templates copied from the demo() function in nltk.tag.brill.py. Refer to ngram part of speech tagging for the backoff_tagger function and the train_sents, and regexp part of speech tagging for the word_patterns.

import nltk.tag
from nltk.tag import brill

raubt_tagger = backoff_tagger(train_sents, [nltk.tag.AffixTagger,
    nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger],
    backoff=nltk.tag.RegexpTagger(word_patterns))

templates = [
    brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,1)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (2,2)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,2)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,3)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,1)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (2,2)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,2)),
    brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,3)),
    brill.ProximateTokensTemplate(brill.ProximateTagsRule, (-1, -1), (1,1)),
    brill.ProximateTokensTemplate(brill.ProximateWordsRule, (-1, -1), (1,1))
]

trainer = brill.FastBrillTaggerTrainer(raubt_tagger, templates)
braubt_tagger = trainer.train(train_sents, max_rules=100, min_score=3)

NLTK Brill Tagger Accuracy

So now we have a braubt_tagger. You can tweak the max_rules and min_score params, but be careful, as increasing the values will exponentially increase the training time without significantly increasing accuracy. In fact, I found that increasing the min_score tended to decrease the accuracy by a percent or 2. So here’s how the braubt_tagger fares against the other NLTK part of speech taggers.

Conclusion

There’s certainly more you can do for part-of-speech tagging with nltk & python, but the brill tagger based braubt_tagger should be good enough for many purposes. The most important component of part-of-speech tagging is using the correct training data. If you want your pos tagger to be accurate, you need to train it on a corpus similar to the text you’ll be tagging. The brown, conll2000, and treebank corpora are what they are, and you shouldn’t assume that a pos tagger trained on them will be accurate on a different corpus. For example, a pos tagger trained on one part of the brown corpus may be 90% accurate on other parts of the brown corpus, but only 50% accurate on the conll2000 corpus. But a pos tagger trained on the conll2000 corpus will be accurate for the treebank corpus, and vice versa, because conll2000 and treebank are quite similar. So make sure you choose your training data carefully.

If you’d like to try to push NLTK part of speech tagging accuracy even higher, see part 4, where I compare the brill tagger to classifier based pos taggers, and nltk.tag.pos_tag.

Unit Testing with Erlang's Common Test Framework

One of the first things people look for when getting started with Erlang is a unit testing framework, and EUnit tends to be the framework of choice. But I always had trouble getting EUnit to play nice with my code since it does parse transforms, which screws up the handling of include files and record definitions. And because Erlang has pattern matching, there’s really no reason for assert macros. So I looked around for alternatives and found that a testing framework called common_test has been included since Erlang/OTP-R12B. common_test (and test_server), are much more heavy duty than EUnit, but don’t let that scare you away. Once you’ve set everything up, writing and running unit tests is quite painless.

Directory Setup

I’m going to assume an OTP compliant directory setup, specifically:

  1. a top level directory we’ll call project/
  2. a lib/ directory containing your applications at project/lib/
  3. application directories inside lib/, such as project/lib/app1/
  4. code files are in app1/src/ and beam files are in app1/ebin/

So we end up with a directory structure like this:

project/
    lib/
        app1/
            src/
            ebin/

Test Suites

Inside the app1/ directory, create a directory called test/. This is where your test suites will go. Generally, you’ll have 1 test suite per code module, so if you have app1/src/module1.erl, then you’ll create app1/test/module1_SUITE.erl for all your module1 unit tests. Each test suite should look something like this: (unfortunately, wordpress doesn’t do syntax highlighting for erlang, so it looks kinda crappy)

-module(module1_SUITE).

% easier than exporting by name
-compile(export_all).

% required for common_test to work
-include("ct.hrl").

%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% common test callbacks %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%

% Specify a list of all unit test functions
all() -> [test1, test2].

% required, but can just return Config. this is a suite level setup function.
init_per_suite(Config) ->
    % do custom per suite setup here
    Config.

% required, but can just return Config. this is a suite level tear down function.
end_per_suite(Config) ->
    % do custom per suite cleanup here
    Config.

% optional, can do function level setup for all functions,
% or for individual functions by matching on TestCase.
init_per_testcase(TestCase, Config) ->
    % do custom test case setup here
    Config.

% optional, can do function level tear down for all functions,
% or for individual functions by matching on TestCase.
end_per_testcase(TestCase, Config) ->
    % do custom test case cleanup here
    Config.

%%%%%%%%%%%%%%%%
%% test cases %%
%%%%%%%%%%%%%%%%

test1(Config) ->
    % write standard erlang code to test whatever you want
    % use pattern matching to specify expected return values
    ok.

test2(Config) -> ok.

Test Specification

Now the we have a test suite at project/app1/test/module1_SUITE.erl, we can make a test specification so common_test knows where to find the test suites, and which suites to run. Something I found out that hard way is that common_test requires absolute paths in its test specifications. So instead of creating a file called test.spec, we’ll create a file called test.spec.in, and use make to generate the test.spec file with absolute paths.

test.spec.in

{logdir, "@PATH@/log"}.
{alias, app1, "@PATH@/lib/app1"}.
{suites, app1, [module1_SUITE]}.

Makefile

src:
    erl -pa lib/*/ebin -make

test.spec: test.spec.in
    cat test.spec.in | sed -e "s,@PATH@,$(PWD)," > $(PWD)/test.spec

test: test.spec src
    run_test -pa $(PWD)/lib/*/ebin -spec test.spec

Running the Tests

As you can see above, I also use the Makefile for running the tests with the command make test. For this command to work, run_test must be installed in your PATH. To do so, you need to run /usr/lib/erlang/lib/common_test-VERSION/install.sh (where VERSION is whatever version number you currently have). See the common_test installation instructions for more information. I’m also assuming you have an Emakefile for compiling the code in lib/app1/src/ with the make src command.

Final Thoughts

So there you have it, an example test suite, a test specification, and a Makefile for running the tests. The final file and directory structure should look something like this:

project/
    Emakefile
    Makefile
    test.spec.in
    lib/
        app1/
            src/
                module1.erl
            ebin/
            test/
                module1_SUITE.erl

Now all you need to do is write your unit tests in the form of test suites and add those suites to test.spec.in. There’s a lot more you can get out of common_test, such as code coverage analysis, HTML logging, and large scale testing. I’ll be covering some of those topics in the future, but for now I’ll end with some parting thoughts from the Common Test User’s Guide:

It’s not possible to prove that a program is correct by testing. On the contrary, it has been formally proven that it is impossible to prove programs in general by testing.

There are many kinds of test suites. Some concentrate on calling every function or command… Some other do the same, but uses all kinds of illegal parameters.

Aim for finding bugs. Write whatever test that has the highest probability of finding a bug, now or in the future. Concentrate more on the critical parts. Bugs in critical subsystems are a lot more expensive than others.

Aim for functionality testing rather than implementation details. Implementation details change quite often, and the test suites should be long lived.

Aim for testing everything once, no less, no more

Part of Speech Tagging with NLTK Part 2 – Regexp and Affix Taggers

Following up on Part of Speech Tagging with NLTK – Ngram Taggers, I test the accuracy of adding an Affix Tagger and a Regexp Tagger to the SequentialBackoffTagger chain.

NLTK Affix Tagger

The AffixTagger learns prefix and suffix patterns to determine the part of speech tag for word. I tried inserting the affix tagger into every possible position of the ubt_tagger to see which method increased accuracy the most. As you’ll see in the results, the aubt_tagger had the highest accuracy.

ubta_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger, nltk.tag.AffixTagger])
ubat_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.AffixTagger, nltk.tag.TrigramTagger])
uabt_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.AffixTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger])
aubt_tagger = backoff_tagger(train_sents, [nltk.tag.AffixTagger, nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger])

NLTK Regexp Tagger

The RegexpTagger allows you to define your own word patterns for determining the part of speech tag. Some of the patterns defined below were taken from chapter 3 of the NLTK book, others I added myself. Since I had already determined that the aubt_tagger was the most accurate, I only tested the regexp tagger at the beginning and end of the pos tagger chain.

word_patterns = [
	(r'^-?[0-9]+(.[0-9]+)?$', 'CD'),
	(r'.*ould$', 'MD'),
	(r'.*ing$', 'VBG'),
	(r'.*ed$', 'VBD'),
	(r'.*ness$', 'NN'),
	(r'.*ment$', 'NN'),
	(r'.*ful$', 'JJ'),
	(r'.*ious$', 'JJ'),
	(r'.*ble$', 'JJ'),
	(r'.*ic$', 'JJ'),
	(r'.*ive$', 'JJ'),
	(r'.*ic$', 'JJ'),
	(r'.*est$', 'JJ'),
	(r'^a$', 'PREP'),
]

aubtr_tagger = nltk.tag.RegexpTagger(word_patterns, backoff=aubt_tagger)
raubt_tagger = backoff_tagger(train_sents, [nltk.tag.AffixTagger, nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger],
    backoff=nltk.tag.RegexpTagger(word_patterns))

NLTK Affix and Regexp Tagging Accuracy

Conclusion

As you can see, the aubt_tagger provided the most gain over the ubt_tagger, and the raubt_tagger had a slight gain on top of that. In Part of Speech Tagging with NLTK – Brill Tagger I discuss the results of using the BrillTagger to push the accuracy even higher.

Part of Speech Tagging with NLTK Part 1 – Ngram Taggers

Part of speech tagging is the process of identifying nouns, verbs, adjectives, and other parts of speech in context. NLTK provides the necessary tools for tagging, but doesn’t actually tell you what methods work best, so I decided to find out for myself.

Training and Test Sentences

NLTK has a data package that includes 3 part of speech tagged corpora: brown, conll2000, and treebank. I divided each of these corpora into 2 sets, the training set and the testing set. The choice and size of your training set can have a significant effect on the pos tagging accuracy, so for real world usage, you need to train on a corpus that is very representative of the actual text you want to tag. In particular, the brown corpus has a number of different categories, so choose your categories wisely. I chose these categories primarily because they have a higher occurance of the word food than other categories.

import nltk.corpus, nltk.tag, itertools
brown_review_sents = nltk.corpus.brown.tagged_sents(categories=['reviews'])
brown_lore_sents = nltk.corpus.brown.tagged_sents(categories=['lore'])
brown_romance_sents = nltk.corpus.brown.tagged_sents(categories=['romance'])

brown_train = list(itertools.chain(brown_review_sents[:1000], brown_lore_sents[:1000], brown_romance_sents[:1000]))
brown_test = list(itertools.chain(brown_review_sents[1000:2000], brown_lore_sents[1000:2000], brown_romance_sents[1000:2000]))

conll_sents = nltk.corpus.conll2000.tagged_sents()
conll_train = list(conll_sents[:4000])
conll_test = list(conll_sents[4000:8000])

treebank_sents = nltk.corpus.treebank.tagged_sents()
treebank_train = list(treebank_sents[:1500])
treebank_test = list(treebank_sents[1500:3000])

(Updated 4/15/2010 for new brown categories. Also note that the best way to use conll2000 is with conll2000.tagged_sents('train.txt') and conll2000.tagged_sents('test.txt'), but changing that above may change the accuracy.)

NLTK Ngram Taggers

I started by testing different combinations of the 3 NgramTaggers: UnigramTagger, BigramTagger, and TrigramTagger. These taggers inherit from SequentialBackoffTagger, which allows them to be chained together for greater accuracy. To save myself a little pain when constructing and training these pos taggers, I created a utility method for creating a chain of backoff taggers.

def backoff_tagger(tagged_sents, tagger_classes, backoff=None):
	if not backoff:
		backoff = tagger_classes[0](tagged_sents)
		del tagger_classes[0]

	for cls in tagger_classes:
		tagger = cls(tagged_sents, backoff=backoff)
		backoff = tagger

	return backoff

ubt_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger])
utb_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.TrigramTagger, nltk.tag.BigramTagger])
but_tagger = backoff_tagger(train_sents, [nltk.tag.BigramTagger, nltk.tag.UnigramTagger, nltk.tag.TrigramTagger])
btu_tagger = backoff_tagger(train_sents, [nltk.tag.BigramTagger, nltk.tag.TrigramTagger, nltk.tag.UnigramTagger])
tub_tagger = backoff_tagger(train_sents, [nltk.tag.TrigramTagger, nltk.tag.UnigramTagger, nltk.tag.BigramTagger])
tbu_tagger = backoff_tagger(train_sents, [nltk.tag.TrigramTagger, nltk.tag.BigramTagger, nltk.tag.UnigramTagger])

Tagger Accuracy Testing

To test the accuracy of a pos tagger, we can compare it to the test sentences using the nltk.tag.accuracy function.

nltk.tag.accuracy(tagger, test_sents)

Ngram Tagging Accuracy

Ngram Tagging Accuracy
Ngram Tagging Accuracy

Conclusion

The ubt_tagger and utb_taggers are extremely close to each other, but the ubt_tagger is the slight favorite (note that the backoff sequence is in reverse order, so for the ubt_tagger, the trigram tagger backsoff to the bigram tagger, which backsoff to the unigram tagger). In Part of Speech Tagging with NLTK – Regexp and Affix Tagging, I do further testing using the AffixTagger and the RegexpTagger to get the accuracy up past 80%.