Avogadro Corp Book Review / AI Speculation
Avogadro Corp: The Singularity Is Closer Than It Appears, by William Hertling, is the first sci-fi book I've read with a semi-plausible AI origin story. That's because the premise isn't so simple as "increased computing power -> emergent AI". It's a much more well defined formula: ever increasing computing power + powerful language processing + never ending stream of training data + goal oriented behavior + deep integration into internet infrastructure -> AI. The AI in the story is called ELOPe, which stands for Email Language Optimization Program, and its function is essentially to improve the quality of emails. WARNING there will be spoilers below, but only enough to describe ELOPe and speculate about how it might be implemented.
What is ELOPe
The idea behind ELOPe is to provide writing suggestions as a feature of a popular web-based email service. These writing suggestions are designed to improve the outcome of your email, whatever that may be. To take an example from the book, if you're requesting more compute resources for a project, then ELOPe's job is to offer writing suggestions that are most likely to get your request approved. By taking into account your own past writings, who you're sending the email to, and what you're asking for, it can go as far as completely re-writing the email to achieve the optimal outcome.
Using the existence of ELOPe as a given, the author writes a enjoyable story that is (mostly) technically accurate with plenty of details, without being boring. If you liked Daemon by Daniel Suarez, or you work with any kind of natural language / text-processing technology, you'll probably enjoy the story. I won't get into how an email writing suggestion program goes from that to full AI & takes over the world as a benevolent ghost in the wires - for that you need to read the book. What I do want to talk about is how this email optimization system could be implemented.
How ELOPe might work
Let's start by defining the high-level requirements. ELOPe is an email optimizer, so we have the sender, the receiver, and the email being written as inputs. The output is a re-written email that preserves the "voice" of the sender while using language that will be much more likely to achieve the sender's desired outcome, given who they're sending the email to. That means we need the following:
- ability to analyze the email to determine what outcome is desired
- prior knowledge of how the receiver has responded to other emails with similar outcome topics, in order to know what language produced the best outcomes (and what language produced bad outcomes)
- ability to re-write (or generate) an email whose language is consistent with the sender, while also using language optimized to get the best response from the receiver
Topic Analysis
Determining the desired outcome for an email seems to me like a sophisticated combination of topic modeling and deep linguistic parsing. The goal would be to identify the core reason for the email: what is the sender asking for, and what would be an optimal response?
Being able to do this from a single email is probably impossible, but if you have access to thousands, or even millions of email chains, accurate topic modeling is much more do-able. Nearly every email someone sends will have some similarity to past emails sent by other people in similar situations. So you could create feature vectors for every email chain (using deep semantic parsing), then cluster the chains using feature similarity. Now you have topic clusters, and from that you could create training data for thousands of topic classifiers. Once you have the classifiers, you can run those in parallel to determine the most likely topic(s) of a single email.
Obviously it would be very difficult to create accurate clusters, and even harder to do so at scale. Language is very fuzzy, humans are inconsistent, and a huge fraction of email is spam. But the core of the necessary technology exists, and can work very well in limited conditions. The ability to parse emails, extract textual features, and cluster & classify feature vectors are functionality that's available in at least a few modern programming libraries today (i.e. Python, NLTK & scikit-learn). These are areas of software technology that are getting a lot of attention right now, and all signs indicate that attention will only increase over time, so it's quite likely that the difficulty level will decrease significantly over the next 10 years. Moving on, let's assume we can do accurate email topic analysis. The next hurdle is outcome analysis.
Outcome Analysis
Once you can determine topics, now you need to learn about outcomes. Two email chains about acquiring compute resources have the same topic, but one chain ends with someone successfully getting access to more compute resources, while the other ends in failure. How do you differentiate between these? This sounds like next-generation sentiment analysis. You need to go deeper than simple failure vs. success, positive vs. negative, since you want to know which email chains within a given topic produced the best responses, and what language they have in common. In other words, you need a language model that weights successful outcome language much higher than failure outcome language. The only way I can think of doing this with a decent level of accuracy is massive amounts of human verified training data. Technically do-able, but very expensive in terms of time and effort.
What really pushes the bounds of plausibility is that the language model can't be universal. Everyone has their own likes, dislikes, biases, and preferences. So you need language models that are specific to individuals, or clusters of individuals that respond similarly on the same topic. Since these clusters are topic specific, every individual would belong to many (topic, cluster) pairs. Given N topics and an average of M clusters within each topic, that's N*M language models that need to be created. And one of the major plot points of the book falls out naturally: ELOPe needs access to huge amounts of high end compute resources.
This is definitely the least do-able aspect of ELOPe, and I'm ignoring all the implicit conceptual knowledge that would be required to know what an optimal outcome is, but let's move on
Language Generation
Assuming that we can do topic & outcome analysis, the final step is using language models to generate more persuasive emails. This is perhaps the simplest part of ELOPe, assuming everything else works well. That's because natural language generation is the kind of technology that works much better with more data, and it already exists in various forms. Google translate is a kind of language generator, chatbots have been around for decades, and spammers use software to spin new articles & text based on existing writings. The differences in this case are that every individual would need their own language generator, and it would have to be parameterized with pluggable language models based on the topic, desired outcome, and receiver. But assuming we have good topic & receiver specific outcome analysis, plus hundreds or thousands of emails from the sender to learn from, then generating new emails, or just new phrases within an email, seems almost trivial compared to what I've outlined above.
Final Words
I'm still highly skeptical that strong AI will ever exist. We humans barely understand the mechanisms of own intelligence, so to think that we can create comparable artificial intelligence smells of hubris. But it can be fun to think about, and the point of sci-fi is to tell stories about possible futures, so I have no doubt various forms of AI will play a strong role in sci-fi stories for years to come.
NLTK 2 Release Highlights
NLTK 2.0.1, a.k.a NLTK 2, was recently released, and what follows is my favorite changes, new features, and highlights from the ChangeLog.
New Classifiers
The SVMClassifier adds support vector machine classification thru SVMLight with PySVMLight. This is a much needed addition to the set of supported classification algorithms. But even more interesting...
The SklearnClassifier provides a general interface to text classification with scikit-learn. While scikit-learn is still pre-1.0, it is rapidly becoming one of the most popular machine learning toolkits, and provides more advanced feature extraction methods for classification.
Github
NLTK has moved development and hosting to github, replacing google code and SVN. The primary motivation is to make new development easier, and already a Python 3 branch is under active development. I think this is great, since github makes forking & pull requests quite easy, and it's become the de-facto "social coding" site.
Sphinx
Coinciding with the github move, the documentation was updated to use Sphinx, the same documentation generator used by Python and many other projects. While I personally like Sphinx and restructured text (which I used to write this post), I'm not thrilled with the results. The new documentation structure and NLTK homepage seem much less approachable. While it works great if you know exactly what you're looking for, I worry that new/interested users will have a harder time getting started.
New Corpora
Since the 0.9.9 release, a number of new corpora and corpus readers have been added:
ChangeLog Highlights
And here's a few final highlights:
- The HunposTagger, which wraps hunpos.
- The StanfordTagger plus 2 subclasses for NER and POS tagging with the Stanford POS Tagger.
- The SnowballStemmer, which supports 13 different languages. You can try it out at my online stemming demo.
The Future
I think NLTK's ideal role is be a standard interface between corpora and NLP algorithms. There are many different corpus formats, and every algorithm has its own data structure requirements, so providing common abstract interfaces to connect these together is very powerful. It allows you to test the same algorithm on disparate corpora, or try multiple algorithms on a single corpus. This is what NLTK already does best, and I hope that becomes even more true in the future.
Upcoming Talks
At the end of February and the beginning of March, I'll be giving 3 talks in the SF Bay Area and one in St Louis, MO. In chronological order...
How Weotta uses MongoDB
Grant and I will be helping 10gen celebrate the opening of their new San Francisco office on Tuesday, February 21, by talking about
How Weotta uses MongoDB. We'll cover some of our favorite features of MongoDB and how we use it for local place & events search. Then we'll finish with a preview of Weotta's upcoming MongoDB powered local search APIs.
NLTK Jam Session at NICAR 2012
On Thursday, February 23, in St Louis, MO, I'll be demonstrating how to use NLTK as part of the NewsCamp workshop at NICAR 2012. This will be a version of my PyCon NLTK Tutorial with a focus on news text and corpora like treebank.
Corpus Bootstrapping with NLTK at Strata 2012
As part of the Strata 2012 Deep Data program, I'll talk about Corpus Bootstrapping with NLTK on Tuesday, February 28. The premise of this talk is that while there's plenty of great algorithms and methods for natural language processing, most of them require a training corpus, and chances are the training corpus you really need doesn't exist. So how can you quickly create a quality corpus at minimal cost? I'll cover specific real-world examples to answer this question.
NLTK Tutorial at PyCon 2012
Introduction to NLTK will be a 3 hour tutorial at PyCon on Thursday, March 8th. You'll get to know NLTK in depth, learn about corpus organization, and train your own models manually & with nltk-trainer. My goal is that you'll walk out with at least one new NLP superpower that you can put to use immediately.
Bay Area NLP Meetup
This Thursday, June 7 2011, will be the first meeting of the Bay Area NLP group, at Chomp HQ in San Francisco, where I will be giving a talk on NLTK titled "NLTK: the Good, the Bad, and the Awesome". I'll be sharing some of the things I've learned using NLTK, operating text-processing.com, and doing random consulting on natural language processing. I'll also explain why NLTK-Trainer exists and how awesome it is for training NLP models. So if you're in the area and have some time Thursday evening, come by and say hi.
Update on 07/10/2011: slides are online from my talk: NLTK: the Good, the Bad, and the Awesome.
Interview and Article about NLTK and Text-Processing
I recently did an interview with Zoltan Varju (@zoltanvarju) about Python, NLTK, and my demos & APIs at text-processing.com, which you can read here. There's even a bit about Erlang & functional programming, as well as some insight into what I've been working on at Weotta. And last week, the text-processing.com API got a write up (and a nice traffic boost) from Garrett Wilkin (@garrettwilkin) on programmableweb.com.
Analyzing Tagged Corpora and NLTK Part of Speech Taggers
NLTK Trainer includes 2 scripts for analyzing both a tagged corpus and the coverage of a part-of-speech tagger.
Analyze a Tagged Corpus
You can get part-of-speech tag statistics on a tagged corpus using analyze_tagged_corpus.py. Here's the tag counts for the treebank corpus:
$ python analyze_tagged_corpus.py treebank loading nltk.corpus.treebank 100676 total words 12408 unique words 46 tags Tag Count ======= ========= # 16 $ 724 '' 694 , 4886 -LRB- 120 -NONE- 6592 -RRB- 126 . 3874 : 563 CC 2265 CD 3546 DT 8165 EX 88 FW 4 IN 9857 JJ 5834 JJR 381 JJS 182 LS 13 MD 927 NN 13166 NNP 9410 NNPS 244 NNS 6047 PDT 27 POS 824 PRP 1716 PRP$ 766 RB 2822 RBR 136 RBS 35 RP 216 SYM 1 TO 2179 UH 3 VB 2554 VBD 3043 VBG 1460 VBN 2134 VBP 1321 VBZ 2125 WDT 445 WP 241 WP$ 14 WRB 178 `` 712 ======= =========
By default, analyze_tagged_corpus.py sorts by tags, but you can sort by the highest count using --sort count --reverse. You can also see counts for simplified tags using --simplify_tags:
$ python analyze_tagged_corpus.py treebank --simplify_tags
loading nltk.corpus.treebank
100676 total words
12408 unique words
31 tags
Tag Count
======= =========
7416
# 16
$ 724
'' 694
( 120
) 126
, 4886
. 3874
: 563
ADJ 6397
ADV 2993
CNJ 2265
DET 8192
EX 88
FW 4
L 13
MOD 927
N 19213
NP 9654
NUM 3546
P 9857
PRO 2698
S 1
TO 2179
UH 3
V 6000
VD 3043
VG 1460
VN 2134
WH 878
`` 712
======= =========
Analyze Tagger Coverage
You can analyze the coverage of a part-of-speech tagger against any corpus using analyze_tagger_coverage.py. Here's the results for the treebank corpus using NLTK's default part-of-speech tagger:
$ python analyze_tagger_coverage.py treebank loading tagger taggers/maxent_treebank_pos_tagger/english.pickle analyzing tag coverage of treebank with ClassifierBasedPOSTagger Tag Found ======= ========= # 16 $ 724 '' 694 , 4887 -LRB- 120 -NONE- 6591 -RRB- 126 . 3874 : 563 CC 2271 CD 3547 DT 8170 EX 88 FW 4 IN 9880 JJ 5803 JJR 386 JJS 185 LS 12 MD 927 NN 13166 NNP 9427 NNPS 246 NNS 6055 PDT 21 POS 824 PRP 1716 PRP$ 766 RB 2800 RBR 130 RBS 33 RP 213 SYM 1 TO 2180 UH 3 VB 2562 VBD 3035 VBG 1458 VBN 2145 VBP 1318 VBZ 2124 WDT 440 WP 241 WP$ 14 WRB 178 `` 712 ======= =========
If you want to analyze the coverage of your own pickled tagger, use --tagger PATH/TO/TAGGER.pickle. You can also get detailed metrics on Found vs Actual counts, as well as Precision and Recall for each tag by using the --metrics argument with a corpus that provides a tagged_sents method, like treebank:
$ python analyze_tagger_coverage.py treebank --metrics loading tagger taggers/maxent_treebank_pos_tagger/english.pickle analyzing tag coverage of treebank with ClassifierBasedPOSTagger Accuracy: 0.995689 Unknown words: 440 Tag Found Actual Precision Recall ======= ========= ========== ============= ========== # 16 16 1.0 1.0 $ 724 724 1.0 1.0 '' 694 694 1.0 1.0 , 4887 4886 1.0 1.0 -LRB- 120 120 1.0 1.0 -NONE- 6591 6592 1.0 1.0 -RRB- 126 126 1.0 1.0 . 3874 3874 1.0 1.0 : 563 563 1.0 1.0 CC 2271 2265 1.0 1.0 CD 3547 3546 0.99895833333 0.99895833333 DT 8170 8165 1.0 1.0 EX 88 88 1.0 1.0 FW 4 4 1.0 1.0 IN 9880 9857 0.99130434782 0.95798319327 JJ 5803 5834 0.99134948096 0.97892938496 JJR 386 381 1.0 0.91489361702 JJS 185 182 0.96666666666 1.0 LS 12 13 1.0 0.85714285714 MD 927 927 1.0 1.0 NN 13166 13166 0.99166034874 0.98791540785 NNP 9427 9410 0.99477911646 0.99398073836 NNPS 246 244 0.99029126213 0.95327102803 NNS 6055 6047 0.99515235457 0.99722414989 PDT 21 27 1.0 0.66666666666 POS 824 824 1.0 1.0 PRP 1716 1716 1.0 1.0 PRP$ 766 766 1.0 1.0 RB 2800 2822 0.99305555555 0.975 RBR 130 136 1.0 0.875 RBS 33 35 1.0 0.5 RP 213 216 1.0 1.0 SYM 1 1 1.0 1.0 TO 2180 2179 1.0 1.0 UH 3 3 1.0 1.0 VB 2562 2554 0.99142857142 1.0 VBD 3035 3043 0.990234375 0.98065764023 VBG 1458 1460 0.99650349650 0.99824868651 VBN 2145 2134 0.98852223816 0.99566473988 VBP 1318 1321 0.99305555555 0.98281786941 VBZ 2124 2125 0.99373040752 0.990625 WDT 440 445 1.0 0.83333333333 WP 241 241 1.0 1.0 WP$ 14 14 1.0 1.0 WRB 178 178 1.0 1.0 `` 712 712 1.0 1.0 ======= ========= ========== ============= ==========
These additional metrics can be quite useful for identifying which tags a tagger has trouble with. Precision answers the question "for each word that was given this tag, was it correct?", while Recall answers the question "for all words that should have gotten this tag, did they get it?". If you look at PDT, you can see that Precision is 100%, but Recall is 66%, meaning that every word that was given the PDT tag was correct, but 6 out of the 27 words that should have gotten PDT were mistakenly given a different tag. Or if you look at JJS, you can see that Precision is 96.6% because it gave JJS to 3 words that should have gotten a different tag, while Recall is 100% because all words that should have gotten JJS got it.
Training Part of Speech Taggers with NLTK Trainer
NLTK trainer makes it easy to train part-of-speech taggers with various algorithms using train_tagger.py.
Training Sequential Backoff Taggers
The fastest algorithms are the sequential backoff taggers. You can specify the backoff sequence using the --sequential argument, which accepts any combination of the following letters:
a: |
AffixTagger |
u: |
UnigramTagger |
b: |
BigramTagger |
t: |
TrigramTagger |
For example, to train the same kinds of taggers that were used in Part of Speech Tagging with NLTK Part 1 - Ngram Taggers, you could do the following:
python train_tagger.py treebank --sequential ubt
You can rearrange ubt any way you want to change the order of the taggers (though ubt is generally the most accurate order).
Training Affix Taggers
The --sequential argument also recognizes the letter a, which will insert an AffixTagger into the backoff chain. If you do not specify the --affix argument, then it will include one AffixTagger with a 3-character suffix. However, you can change this by specifying one or more --affix N options, where N should be a positive number for prefixes, and a negative number for suffixes. For example, to train an aubt tagger with 2 AffixTaggers, one that uses a 3 character suffix, and another that uses a 2 character prefix, specify the --affix argument twice:
python train_tagger.py treebank --sequential aubt --affix -3 --affix 2
The order of the --affix arguments is the order in which each AffixTagger will be trained and inserted into the backoff chain.
Training Brill Taggers
To train a BrillTagger in a similar fashion to the one trained in Part of Speech Tagging Part 3 - Brill Tagger (using FastBrillTaggerTrainer), use the --brill argument:
python train_tagger.py treebank --sequential aubt --brill
The default training options are a maximum of 200 rules with a minimum score of 2, but you can change that with the --max_rules and --min_score arguments. You can also change the rule template bounds, which defaults to 1, using the --template_bounds argument.
Training Classifier Based Taggers
Many of the arguments used by train_classifier.py can also be used to train a ClassifierBasedPOSTagger. If you don't want this tagger to backoff to a sequential backoff tagger, be sure to specify --sequential ''. Here's an example for training a NaiveBayesClassifier based tagger, similar to what was shown in Part of Speech Tagging Part 4 - Classifier Taggers:
python train_tagger.py treebank --sequential '' --classifier NaiveBayes
If you do want to backoff to a sequential tagger, be sure to specify a cutoff probability, like so:
python train_tagger.py treebank --sequential ubt --classifier NaiveBayes --cutoff_prob 0.4
Any of the NLTK classification algorithms can be used for the --classifier argument, such as Maxent or MEGAM, and every algorithm other than NaiveBayes has specific training options that can be customized.
Phonetic Feature Options
You can also include phonetic algorithm features using the following arguments:
--metaphone: |
Use metaphone feature |
--double-metaphone: |
Use double metaphone feature |
--soundex: |
Use soundex feature |
--nysiis: |
Use NYSIIS feature |
--caverphone: |
Use caverphone feature |
These options create phonetic codes that will be included as features along with the default features used by the ClassifierBasedPOSTagger. The --double-metaphone algorithm comes from metaphone.py, while all the other phonetic algorithm have been copied from the advas project (which appears to be abandoned).
I created these options after discussions with Michael D Healy about Twitter Linguistics, in which he explained the prevalence of regional spelling variations. These phonetic features may be able to reduce that variation where a tagger is concerned, as slightly different spellings might generate the same phonetic code.
A tagger trained with any of these phonetic features will be an instance of nltk_trainer.tagging.taggers.PhoneticClassifierBasedPOSTagger, which means nltk_trainer must be included in your PYTHONPATH in order to load & use the tagger. The simplest way to do this is to install nltk-trainer using python setup.py install.
Python Text Processing with NLTK Cookbook Chapter 2 Errata
It has come to my attention that there are two errors in Chapter 2, Replacing and Correcting Words of Python Text Processing with NLTK Cookbook. My thanks to the reader who went out of their way to verify my mistakes and send in corrections.
In Lemmatizing words with WordNet, on page 29, under How it works..., I said that "cooking" is not a noun and does not have a lemma. In fact, cooking is a noun, and as such is its own lemma. Of course, "cooking" is also a verb, and the verb form has the lemma "cook".
In Removing repeating characters, on page 35, under How it works..., I explained the repeat_regexp match groups incorrectly. The actual match grouping of the word "looooove" is (looo)(o)o(ve) because the pattern matching is greedy. The end result is still correct.
NLTK Default Tagger CoNLL2000 Tag Coverage
Following up on the previous post showing the tag coverage of the NLTK 2.0b9 default tagger on the treebank corpus, below are the same metrics applied to the conll2000 corpus, using the analyze_tagger_coverage.py script from nltk-trainer.
NLTK Default Tagger Performance on CoNLL2000
The default tagger is 93.9% accurate on the conll2000 corpus, which is to be expected since both treebank and conll2000 are based on the Wall Street Journal. You can see all the metrics shown below for yourself by running python analyze_tagger_coverage.py conll2000 --metrics. In many cases, the Precision and Recall metrics are significantly lower than 1, even when the Found and Actual counts are similar. This happens when words are given the wrong tag (creating false positives and false negatives) while the overall tag frequency remains about the same. The CC tag is a great example of this: the Found count is only 3 higher than the Actual count, yet Precision is 68.75% and Recall is 73.33%. This tells us that the number of words that were mis-tagged as CC, and the number of CC words that were not given the CC tag, are approximately equal, creating similar counts despite the false positives and false negatives.
| Tag | Found | Actual | Precision | Recall |
| # | 46 | 47 | 1 | 1 |
| $ | 2122 | 2134 | 1 | 0.6 |
| ' | 1811 | 1809 | 1 | 1 |
| ( | 0 | 351 | None | 0 |
| ) | 0 | 358 | None | 0 |
| , | 13160 | 13160 | 1 | 1 |
| -LRB- | 351 | 0 | 0 | None |
| -NONE- | 59 | 0 | 0 | None |
| -RRB- | 358 | 0 | 0 | None |
| . | 10800 | 10802 | 1 | 1 |
| : | 1288 | 1285 | 0.7143 | 1 |
| CC | 6589 | 6586 | 0.6875 | 0.7333 |
| CD | 10325 | 10233 | 0.972 | 0.9919 |
| DT | 22301 | 22355 | 0.7826 | 1 |
| EX | 229 | 254 | 1 | 1 |
| FW | 1 | 42 | 1 | 0.0455 |
| IN | 27798 | 27835 | 0.7315 | 0.7899 |
| JJ | 15370 | 16049 | 0.7372 | 0.7303 |
| JJR | 1114 | 1055 | 0.5412 | 0.575 |
| JJS | 611 | 451 | 0.6912 | 0.7966 |
| LS | 13 | 0 | 0 | None |
| MD | 2616 | 2637 | 0.7143 | 0.75 |
| NN | 38023 | 36789 | 0.7345 | 0.8441 |
| NNP | 24967 | 24690 | 0.8752 | 0.9421 |
| NNPS | 589 | 550 | 0.4553 | 0.3684 |
| NNS | 17068 | 16653 | 0.8572 | 0.9527 |
| PDT | 24 | 65 | 0.6667 | 1 |
| POS | 2224 | 2203 | 0.6667 | 1 |
| PRP | 4620 | 4634 | 0.8438 | 0.7941 |
| PRP$ | 2292 | 2302 | 0.6364 | 1 |
| RB | 7681 | 7961 | 0.8076 | 0.8582 |
| RBR | 288 | 392 | 0.5 | 0.3684 |
| RBS | 90 | 240 | 0.5 | 0.1667 |
| RP | 634 | 95 | 0.1176 | 1 |
| SYM | 0 | 6 | None | 0 |
| TO | 6257 | 6259 | 1 | 0.75 |
| UH | 2 | 17 | 1 | 0.1111 |
| VB | 6681 | 7286 | 0.9042 | 0.8313 |
| VBD | 8501 | 8424 | 0.7521 | 0.8605 |
| VBG | 3730 | 4000 | 0.8493 | 0.8603 |
| VBN | 5763 | 5867 | 0.8164 | 0.8721 |
| VBP | 3232 | 3407 | 0.6754 | 0.6638 |
| VBZ | 5224 | 5561 | 0.7273 | 0.6906 |
| WDT | 1156 | 1157 | 0.6 | 0.5 |
| WP | 637 | 639 | 1 | 1 |
| WP$ | 38 | 39 | 1 | 1 |
| WRB | 566 | 571 | 0.9 | 0.75 |
| `` | 1855 | 1854 | 0.6667 | 1 |
Unknown Words in CoNLL2000
The conll2000 corpus has 0 words tagged with -NONE-, yet the default tagger is unable to identify 50 unique words. Here's a sample: boiler-room, so-so, Coca-Cola, top-10, AC&R, F-16, I-880, R2-D2, mid-1992. For the most part, the unknown words are symbolic names, acronyms, or two separate words combined with a "-". You might think this can solved with better tokenization, but for words like F-16 and I-880, tokenizing on the "-" would be incorrect.
Missing Symbols and Rare Tags
The default tagger apparently does not recognize parentheses or the SYM tag, and has trouble with many of the more rare tags, such as FW, LS, RBS, and UH. These failures highlight the need for training a part-of-speech tagger (or any NLP object) on a corpus that is as similar as possible to the corpus you are analyzing. At the very least, your training corpus and testing corpus should share the same set of part-of-speech tags, and in similar proportion. Otherwise, mistakes will be made, such as not recognizing common symbols, or finding -LRB- and -RRB- tags where they do not exist.
NLTK Default Tagger Treebank Tag Coverage
For some research I'm doing with Michael D. Healy, I need to measure part-of-speech tagger coverage and performance. To that end, I've added a new script to nltk-trainer: analyze_tagger_coverage.py. This script will tag every sentence of a corpus and count how many times it produces each tag. If you also use the --metrics option, and the corpus reader provides a tagged_sents() method, then you can get detailed performance metrics by comparing the tagger's results against the actual tags.
NLTK Default Tagger Performance on Treebank
Below is a table showing the performance details of the NLTK 2.0b9 default tagger on the treebank corpus, which you can see for yourself by running python analyze_tagger_coverage.py treebank --metrics. The default tagger is 99.57% accurate on treebank, and below you can see exactly on which tags it fails. The Found column shows the number of occurrences of each tag produced by the default tagger, while the Actual column shows the actual number of occurrences in the treebank corpus. Precision and Recall, which I've explained in the context of classification, show the performance for each tag. If the Precision is less than 1, that means the tagger gave the tag to a word that it shouldn't have (a false positive). If the Recall is less than 1, it means the tagger did not give the tag to a word that it should have (a false negative).
| Tag | Found | Actual | Precision | Recall |
| # | 16 | 16 | 1 | 1 |
| $ | 724 | 724 | 1 | 1 |
| ' | 694 | 694 | 1 | 1 |
| , | 4887 | 4886 | 1 | 1 |
| -LRB- | 120 | 120 | 1 | 1 |
| -NONE- | 6591 | 6592 | 1 | 1 |
| -RRB- | 126 | 126 | 1 | 1 |
| . | 3874 | 3874 | 1 | 1 |
| : | 563 | 563 | 1 | 1 |
| CC | 2271 | 2265 | 1 | 1 |
| CD | 3547 | 3546 | 0.999 | 0.999 |
| DT | 8170 | 8165 | 1 | 1 |
| EX | 88 | 88 | 1 | 1 |
| FW | 4 | 4 | 1 | 1 |
| IN | 9880 | 9857 | 0.9913 | 0.958 |
| JJ | 5803 | 5834 | 0.9913 | 0.9789 |
| JJR | 386 | 381 | 1 | 0.9149 |
| JJS | 185 | 182 | 0.9667 | 1 |
| LS | 12 | 13 | 1 | 0.8571 |
| MD | 927 | 927 | 1 | 1 |
| NN | 13166 | 13166 | 0.9917 | 0.9879 |
| NNP | 9427 | 9410 | 0.9948 | 0.994 |
| NNPS | 246 | 244 | 0.9903 | 0.9533 |
| NNS | 6055 | 6047 | 0.9952 | 0.9972 |
| PDT | 21 | 27 | 1 | 0.6667 |
| POS | 824 | 824 | 1 | 1 |
| PRP | 1716 | 1716 | 1 | 1 |
| PRP$ | 766 | 766 | 1 | 1 |
| RB | 2800 | 2822 | 0.9931 | 0.975 |
| RBR | 130 | 136 | 1 | 0.875 |
| RBS | 33 | 35 | 1 | 0.5 |
| RP | 213 | 216 | 1 | 1 |
| SYM | 1 | 1 | 1 | 1 |
| TO | 2180 | 2179 | 1 | 1 |
| UH | 3 | 3 | 1 | 1 |
| VB | 2562 | 2554 | 0.9914 | 1 |
| VBD | 3035 | 3043 | 0.9902 | 0.9807 |
| VBG | 1458 | 1460 | 0.9965 | 0.9982 |
| VBN | 2145 | 2134 | 0.9885 | 0.9957 |
| VBP | 1318 | 1321 | 0.9931 | 0.9828 |
| VBZ | 2124 | 2125 | 0.9937 | 0.9906 |
| WDT | 440 | 445 | 1 | 0.8333 |
| WP | 241 | 241 | 1 | 1 |
| WP$ | 14 | 14 | 1 | 1 |
| WRB | 178 | 178 | 1 | 1 |
| `` | 712 | 712 | 1 | 1 |
Unknown Words in Treebank
Suprisingly, the treebank corpus contains 6592 words tags with -NONE-. But it's not that bad, since it's only 440 unique words, and they are not regular words at all: *EXP*-2, *T*-91, *-106, and many more similar looking tokens.




