# Fuzzy String Matching in Python

Fuzzy matching is a general term for finding strings that are almost equal, or mostly the same. Of course almost and mostly are ambiguous terms themselves, so you’ll have to determine what they really mean for your specific needs. The best way to do this is to come up with a list of test cases before you start writing any fuzzy matching code. These test cases should be pairs of strings that either should fuzzy match, or not. I like to create doctests for this, like so:

```def fuzzy_match(s1, s2):
'''
>>> fuzzy_match('Happy Days', ' happy days ')
True
False
'''
# TODO: fuzzy matching code
return s1 == s2
```

Once you’ve got a good set of test cases, then it’s much easier to tailor your fuzzy matching code to get the best results.

## Normalization

The first step before doing any string matching is normalization. The goal with normalization is to transform your strings into a normal form, which in some cases may be all you need to do. While `'Happy Days' != ' happy days '`, with simple normalization you can get `'Happy Days'.lower() == ' happy days '.strip()`.

The most basic normalization you can do is to lowercase and strip whitespace. But chances are you’ll want to more. For example, here’s a simple normalization function that also removes all punctuation in a string.

```import string

def normalize(s):
for p in string.punctuation:
s = s.replace(p, '')

return s.lower().strip()
```

Using this `normalize` function, we can make the above fuzzy matching function pass our simple tests.

```def fuzzy_match(s1, s2):
'''
>>> fuzzy_match('Happy Days', ' happy days ')
True
False
'''
return normalize(s1) == normalize(s2)
```

## Regular Expressions

Beyond just stripping whitespace from the ends of strings, it’s also a good idea replace all whitespace occurrences with a single space character. The regex function for doing this is `re.sub('\s+', s, ' ')`. This will replace every occurrence of one or more spaces, newlines, tabs, etc, essentially eliminating the significance of whitespace for matching.

You may also be able to use regular expressions for partial fuzzy matching. Maybe you can use regular expressions to identify significant parts of a string, or perhaps split a string into component parts for further matching. If you think you can create a simple regular expression to help with fuzzy matching, do it, because chances are, any other code you write to do fuzzy matching will be more complicated, less straightforward, and probably slower. You can also use more complicated regular expressions to handle specific edge cases. But beware of any expression that takes puzzling out every time you look at it, because you’ll probably be revisiting this code a number of times to tweak it for handling new cases, and tweaking complicated regular expressions is a sure way to induce headaches and eyeball-bleeding.

## Edit Distance

The edit distance (aka Levenshtein distance) is the number of single character edits it would take to transform one string into another. Thefore, the smaller the edit distance, the more similar two strings are.

If you want to do edit distance calculations, checkout the standalone editdist module. Its `distance` function takes 2 strings and returns the Levenshtein edit distance. It’s also implemented in C, and so is quite fast.

## Fuzzywuzzy

Fuzzywuzzy is a great all-purpose library for fuzzy string matching, built (in part) on top of Python’s difflib. It has a number of different fuzzy matching functions, and it’s definitely worth experimenting with all of them. I’ve personally found `ratio` and `token_set_ratio` to be the most useful.

## NLTK

If you want to do some custom fuzzy string matching, then NLTK is a great library to use. There’s word tokenizers, stemmers, and it even has its own edit distance implementation. Here’s a way you could combine all 3 to create a fuzzy string matching function.

```from nltk import metrics, stem, tokenize

stemmer = stem.PorterStemmer()

def normalize(s):
words = tokenize.wordpunct_tokenize(s.lower().strip())
return ' '.join([stemmer.stem(w) for w in words])

def fuzzy_match(s1, s2, max_dist=3):
return metrics.edit_distance(normalize(s1), normalize(s2)) <= max_dist
```

## Phonetics

Finally, an interesting and perhaps non-obvious way to compare strings is with phonetic algorithms. The idea is that 2 strings that sound same may be the same (or at least similar enough). One of the most well known phonetic algorithms is Soundex, with a python soundex algorithm here. Another is Double Metaphone, with a python metaphone module here. You can also find code for these and other phonetic algorithms in the nltk-trainer phonetics module (copied from a now defunct sourceforge project called advas). Using any of these algorithms, you get an encoded string, and then if 2 encodings compare equal, the original strings match. Theoretically, you could even do fuzzy matching on the phonetic encodings, but that’s probably pushing the bounds of fuzziness a bit too far.

# NLTK Overview at SF Python

On September 14, 2011, I’ll be giving a 20 minute overview of NLTK for the San Francisco Python Meetup Group. Since it’s only 20 minutes, I can’t get into too much detail, but I plan to quickly cover the basics of:

I’ll also be soliciting feedback for a NLTK Tutorial at PyCON 2012. So if you’ll be at the meetup and are interested in attending a NLTK tutorial, come find me and tell me what you’d want to learn.

Updated 9/15/2011: Slides from the talk are online – NLTK in 20 minutes

# Python 3 Web Development Review

The problem with Python 3 Web Development Beginner’s Guide, by Michel Anders, is one of expectations (disclaimer: I received a free eBook from Packt for review). Let’s start with the title… First we have Python 3 Web Development. This immediately sets the wrong expectations because:

1. There’s almost as much jQuery & Javascript as there is Python.
2. Most of the Python code is not Python 3 specific, and the code that is could easily be translate to Python 2.
3. Much of the Python code either uses CherryPy or is for generating HTML. This is not immediately obvious, but becomes apparent in Chapter 3 (which is available as a free PDF download: Chapter 3 – Tasklist I Persistence).

Second, this book is also supposed to be a Beginner’s Guide, but that is definitely not the case. To really grasp what’s going on, you need to already know the basics of HTML, jQuery interaction, and how HTTP works. Chapter 1 is an excellent introduction to HTTP and web application development, but the book as a whole is not beginner material. I think that anything that uses Python metaclasses automatically becomes at least intermediate level, if not expert, and the main thrust of Chapter 7 is refactoring all your straightforward database code to use complicated metaclasses.

However, if you mentally rewrite the title to be “Web Framework Development from scratch using CherryPy and jQuery”, then you’ve got the right idea. The book steps you through web app development with CherryPy, database models with sqlite3, and plenty of HTML and jQuery for interface generation and interaction. While creating example applications, you slowly build up a re-usable framework. It’s an interesting approach, but unfortunately it gets muddied up with inline HTML rendering. I never thought a language as simple and elegant as Python could be reduced to the ugliness of common PHP, but generating HTML with string interpolation inside the same functions that are accessing the database gets pretty close. I kept expecting the author to introduce template rendering, which is a major part of most modern web development frameworks, but it never happened, despite the plethora of excellent Python templating libraries.

While reading this book, I often had the recurring thought “I’m so glad I use Django“. If your aim is rapid application development, this is not the book for you. However, if you’re interested in creating your own web development framework, or would at least like to understand how a framework like Django could be created, then buy a copy Python 3 Web Development.

# PyCon NLTK Tutorial Suggestions

PyCon 2012 just released a CFP, and NLTK shows up 3 times in the suggested topics. While I’ve never done this before, I know stuff about Text Processing with NLTK so I’m going to submit a tutorial abstract. But I want your feedback: what exactly should this tutorial cover? If you could attend a 3 hour class on NLTK, what knowledge & skills would you like to come away with? Here are a few specific topics I could cover:

• part-of-speech tagging & chunking
• text classification
• creating a custom corpus and corpus reader
• training custom models (manually and/or with nltk-trainer)
• bootstrapping a custom corpus for text classification

Or I could do a high-level survey of many NLTK modules and corpora. Please let me know what you think in the comments, if you plan on going to PyCon 2012, and if you’d want to attend a tutorial on NLTK. You can also contact me directly if you prefer.

## Co-Hosting

If you’ve done this kind of thing before, have some teaching and/or speaking experience, and you feel you could add value (maybe you’re a computational linguist or NLP’er and/or have used NLTK professionally), I’d be happy to work with a co-host. Contact me if you’re interested, or leave a note in the comments.

# Python Testing Cookbook Review

Python Testing Cookbook, by Greg L Turnquist (@gregturn), goes far beyond Unit Testing, but overall it’s a mixed bag. Here’s a breakdown for each chapter (disclaimer: I received a free eBook from Packt for review):

1. Basic introduction to testing with unittest, which is great if you’re just starting with Python and testing.
2. Good coverage of nose. I was pleasantly surprised at how easy it is to write nose plugins.
3. Deep coverage of using doctest and writing testable docstrings. You can download a free PDF of Chapter 3 here.
4. BDD with a cool nose plugin, and how to use mock or mockito for testing with mock objects. I wish the author had expressed an opinion in favor of either mock or mockito, but he didn’t, so I will: use Fudge. Chapter 4 also covers the Lettuce DSL, which I think is pretty neat, but since every step requires writing a function handler, I’m not convinced it’s actually easier or better than writing doctests or unit tests.
5. Acceptance testing with Pyccuracy and Robot Framework, which both give you a way to use Selenium from Python. I thought the DSLs used seemed too “magic”, but I that’s probably because I didn’t know the command words, and they weren’t highlighted or adequately explained.
6. How to install and use Jenkins and TeamCity, and how to display XML reports produced using NoseXUnit. This is a very useful chapter for anyone thinking about or setting up continuous integration.
7. This chapter is supposed to be about test coverage, and does introduce coverage, but the examples get needlessly complicated. Previous chapters used a simple shopping cart example, but this chapter uses network events, which really distracts from the tests. The author also writes unittests that just print the results intead of actually testing results with assertions.
8. More network event complexity while trying to demonstrate smoke testing and load testing. This chapter would have made a lot more sense in a book about network programming and how to test network events. Pyro is used with very little explanation, and MySQL and SQLlite are brought in too, increasing the complexity even more.
9. This chapter is filled with useful advice, but there’s no actual code examples. Instead, the advice is shoehorned into the cookbook format, which I felt detracted from the otherwise great content.

Throughout the book, the author presents a kind of “main script” that he updates at the end of many of the chapters. However, the script also contains stub functions that are never touched and barely explained, making their existance completely unnecessary. There’s also far too many `import *`, which can make it difficult to understand the code. But I did learn enough new things that I think Python Testing Cookbook is worth buying and reading. Leaving out Chapters 7 and 8, I think the book is a great resource if you’re just getting started with testing, you want to do continuous integration, and/or you want to get non-programmers involved in the testing process. There’s also a blog about the book, which has links to other reviews.

# Programming Collective Intelligence Review

Programming Collective Intelligence is a great conceptual introduction to many common machine learning algorithms and techniques. It covers classification algorithms such as Naive Bayes and Neural Networks, and algorithmic optimization approaches like Genetic Programming. The book also manages to pick interesting example applications, such as stock price prediction and topic identification.

There are two chapters in particular that stand out to me. First is Chapter 6, which covers Naive Bayes classification. What stood out was that the algorithm presented is an online learner, which means it can be updated as data comes in, unlike the NLTK NaiveBayesClassifier, which can be trained only once. Another thing that caught my attention was Fisher’s method, which is not implemented in NLTK, but could be with a little work. Apparently Fisher’s method is great for spam filtering, and is used by the SpamBayes Outlook plugin (which is also written in Python).

Second, I found Chapter 9, which covers Support Vector Machines and Kernel Methods, to be quite intuitive. It explains the idea by starting with examples of linear classification and its shortfalls. But then the examples show that by scaling the data in a particular way first, linear classification suddenly becomes possible. And the kernel trick is simply a neat and efficient way to reduce the amount of calculation necessary to train a classifier on scaled data.

The final chapter summarizes all the key algorithms, and for many it includes commentary on their strengths and weaknesses. This seems like valuable reference material, especially for when you have a new data set to learn from, and you’re not sure which algorithms will help get the results you’re looking for. Overall, I found Programming Collective Intelligence to be an enjoyable read on my Kindle 3, and highly recommend it to anyone getting started with machine learning and Python, as well as anyone interested in a general survey of machine learning algorithms.

# Bay Area NLP Meetup

This Thursday, June 7 2011, will be the first meeting of the Bay Area NLP group, at Chomp HQ in San Francisco, where I will be giving a talk on NLTK titled “NLTK: the Good, the Bad, and the Awesome”. I’ll be sharing some of the things I’ve learned using NLTK, operating text-processing.com, and doing random consulting on natural language processing. I’ll also explain why NLTK-Trainer exists and how awesome it is for training NLP models. So if you’re in the area and have some time Thursday evening, come by and say hi.

Update on 07/10/2011: slides are online from my talk: NLTK: the Good, the Bad, and the Awesome.

# Upcoming Python Book Reviews

#### Programming Collective Intelligence

I recently finished reading Programming Collective Intellegince and will be posting a review soon. The TL;DR review is: get it if want an great introduction to machine learning with Python. It covers a lot of complex algorithms in a simple way, and provides some great example use cases.

#### Python Testing Cookbook

Testing is something nearly every developer can do more of, and this Python Testing Cookbook looks to be full of techniques for integrating testing at various levels of a project. As a preview, you can download a PDF of Chapter 3 – Creating Testable Documentation with doctest.

#### Python 3 Web Development Beginner’s Guide

I haven’t used Python 3 yet, so Python 3 Web Development Beginner’s Guide is a good excuse to do so. I also haven’t done any web development outside of Django in a few years, and I’m interested to see how it compares to doing it from scratch. As a preview, you can download a PDF of Chapter 3 – Tasklist I Persistence.

#### Kindle 3

I’m reading all of these on a Kindle 3, which has worked out surprisingly well. It’s obviously not good for copy & pasting code snippets, but that’s generally a bad idea anyway. And if don’t want to type code in yourself, you can always download it from the publisher’s site.

# Analyzing Tagged Corpora and NLTK Part of Speech Taggers

NLTK Trainer includes 2 scripts for analyzing both a tagged corpus and the coverage of a part-of-speech tagger.

## Analyze a Tagged Corpus

You can get part-of-speech tag statistics on a tagged corpus using `analyze_tagged_corpus.py`. Here’s the tag counts for the treebank corpus:

```\$ python analyze_tagged_corpus.py treebank
100676 total words
12408 unique words
46 tags

Tag      Count
=======  =========
#               16
\$              724
''             694
,             4886
-LRB-          120
-NONE-        6592
-RRB-          126
.             3874
:              563
CC            2265
CD            3546
DT            8165
EX              88
FW               4
IN            9857
JJ            5834
JJR            381
JJS            182
LS              13
MD             927
NN           13166
NNP           9410
NNPS           244
NNS           6047
PDT             27
POS            824
PRP           1716
PRP\$           766
RB            2822
RBR            136
RBS             35
RP             216
SYM              1
TO            2179
UH               3
VB            2554
VBD           3043
VBG           1460
VBN           2134
VBP           1321
VBZ           2125
WDT            445
WP             241
WP\$             14
WRB            178
``             712
=======  =========```

By default, `analyze_tagged_corpus.py` sorts by tags, but you can sort by the highest count using `--sort count --reverse`. You can also see counts for simplified tags using `--simplify_tags`:

```\$ python analyze_tagged_corpus.py treebank --simplify_tags
100676 total words
12408 unique words
31 tags

Tag      Count
=======  =========
7416
#               16
\$              724
''             694
(              120
)              126
,             4886
.             3874
:              563
CNJ           2265
DET           8192
EX              88
FW               4
L               13
MOD            927
N            19213
NP            9654
NUM           3546
P             9857
PRO           2698
S                1
TO            2179
UH               3
V             6000
VD            3043
VG            1460
VN            2134
WH             878
``             712
=======  =========```

## Analyze Tagger Coverage

You can analyze the coverage of a part-of-speech tagger against any corpus using `analyze_tagger_coverage.py`. Here’s the results for the treebank corpus using NLTK’s default part-of-speech tagger:

```\$ python analyze_tagger_coverage.py treebank
analyzing tag coverage of treebank with ClassifierBasedPOSTagger

Tag      Found
=======  =========
#               16
\$              724
''             694
,             4887
-LRB-          120
-NONE-        6591
-RRB-          126
.             3874
:              563
CC            2271
CD            3547
DT            8170
EX              88
FW               4
IN            9880
JJ            5803
JJR            386
JJS            185
LS              12
MD             927
NN           13166
NNP           9427
NNPS           246
NNS           6055
PDT             21
POS            824
PRP           1716
PRP\$           766
RB            2800
RBR            130
RBS             33
RP             213
SYM              1
TO            2180
UH               3
VB            2562
VBD           3035
VBG           1458
VBN           2145
VBP           1318
VBZ           2124
WDT            440
WP             241
WP\$             14
WRB            178
``             712
=======  =========```

If you want to analyze the coverage of your own pickled tagger, use `--tagger PATH/TO/TAGGER.pickle`. You can also get detailed metrics on Found vs Actual counts, as well as Precision and Recall for each tag by using the `--metrics` argument with a corpus that provides a `tagged_sents` method, like treebank:

```\$ python analyze_tagger_coverage.py treebank --metrics
analyzing tag coverage of treebank with ClassifierBasedPOSTagger

Accuracy: 0.995689
Unknown words: 440

Tag      Found      Actual      Precision      Recall
=======  =========  ==========  =============  ==========
#               16          16  1.0            1.0
\$              724         724  1.0            1.0
''             694         694  1.0            1.0
,             4887        4886  1.0            1.0
-LRB-          120         120  1.0            1.0
-NONE-        6591        6592  1.0            1.0
-RRB-          126         126  1.0            1.0
.             3874        3874  1.0            1.0
:              563         563  1.0            1.0
CC            2271        2265  1.0            1.0
CD            3547        3546  0.99895833333  0.99895833333
DT            8170        8165  1.0            1.0
EX              88          88  1.0            1.0
FW               4           4  1.0            1.0
IN            9880        9857  0.99130434782  0.95798319327
JJ            5803        5834  0.99134948096  0.97892938496
JJR            386         381  1.0            0.91489361702
JJS            185         182  0.96666666666  1.0
LS              12          13  1.0            0.85714285714
MD             927         927  1.0            1.0
NN           13166       13166  0.99166034874  0.98791540785
NNP           9427        9410  0.99477911646  0.99398073836
NNPS           246         244  0.99029126213  0.95327102803
NNS           6055        6047  0.99515235457  0.99722414989
PDT             21          27  1.0            0.66666666666
POS            824         824  1.0            1.0
PRP           1716        1716  1.0            1.0
PRP\$           766         766  1.0            1.0
RB            2800        2822  0.99305555555  0.975
RBR            130         136  1.0            0.875
RBS             33          35  1.0            0.5
RP             213         216  1.0            1.0
SYM              1           1  1.0            1.0
TO            2180        2179  1.0            1.0
UH               3           3  1.0            1.0
VB            2562        2554  0.99142857142  1.0
VBD           3035        3043  0.990234375    0.98065764023
VBG           1458        1460  0.99650349650  0.99824868651
VBN           2145        2134  0.98852223816  0.99566473988
VBP           1318        1321  0.99305555555  0.98281786941
VBZ           2124        2125  0.99373040752  0.990625
WDT            440         445  1.0            0.83333333333
WP             241         241  1.0            1.0
WP\$             14          14  1.0            1.0
WRB            178         178  1.0            1.0
``             712         712  1.0            1.0
=======  =========  ==========  =============  ==========```

These additional metrics can be quite useful for identifying which tags a tagger has trouble with. Precision answers the question “for each word that was given this tag, was it correct?”, while Recall answers the question “for all words that should have gotten this tag, did they get it?”. If you look at `PDT`, you can see that Precision is 100%, but Recall is 66%, meaning that every word that was given the `PDT` tag was correct, but 6 out of the 27 words that should have gotten `PDT` were mistakenly given a different tag. Or if you look at `JJS`, you can see that Precision is 96.6% because it gave `JJS` to 3 words that should have gotten a different tag, while Recall is 100% because all words that should have gotten `JJS` got it.

# Training Part of Speech Taggers with NLTK Trainer

NLTK trainer makes it easy to train part-of-speech taggers with various algorithms using `train_tagger.py`.

## Training Sequential Backoff Taggers

The fastest algorithms are the sequential backoff taggers. You can specify the backoff sequence using the `--sequential` argument, which accepts any combination of the following letters:

 `a`: AffixTagger `u`: UnigramTagger `b`: BigramTagger `t`: TrigramTagger

For example, to train the same kinds of taggers that were used in Part of Speech Tagging with NLTK Part 1 – Ngram Taggers, you could do the following:

`python train_tagger.py treebank --sequential ubt`

You can rearrange `ubt` any way you want to change the order of the taggers (though `ubt` is generally the most accurate order).

## Training Affix Taggers

The `--sequential` argument also recognizes the letter `a`, which will insert an AffixTagger into the backoff chain. If you do not specify the `--affix` argument, then it will include one AffixTagger with a 3-character suffix. However, you can change this by specifying one or more `--affix N` options, where `N` should be a positive number for prefixes, and a negative number for suffixes. For example, to train an `aubt` tagger with 2 AffixTaggers, one that uses a 3 character suffix, and another that uses a 2 character prefix, specify the `--affix` argument twice:

`python train_tagger.py treebank --sequential aubt --affix -3 --affix 2`

The order of the `--affix` arguments is the order in which each AffixTagger will be trained and inserted into the backoff chain.

## Training Brill Taggers

To train a BrillTagger in a similar fashion to the one trained in Part of Speech Tagging Part 3 – Brill Tagger (using FastBrillTaggerTrainer), use the `--brill` argument:

`python train_tagger.py treebank --sequential aubt --brill`

The default training options are a maximum of 200 rules with a minimum score of 2, but you can change that with the `--max_rules` and `--min_score` arguments. You can also change the rule template bounds, which defaults to 1, using the `--template_bounds` argument.

## Training Classifier Based Taggers

Many of the arguments used by train_classifier.py can also be used to train a ClassifierBasedPOSTagger. If you don’t want this tagger to backoff to a sequential backoff tagger, be sure to specify `--sequential ''`. Here’s an example for training a NaiveBayesClassifier based tagger, similar to what was shown in Part of Speech Tagging Part 4 – Classifier Taggers:

`python train_tagger.py treebank --sequential '' --classifier NaiveBayes`

If you do want to backoff to a sequential tagger, be sure to specify a cutoff probability, like so:

`python train_tagger.py treebank --sequential ubt --classifier NaiveBayes --cutoff_prob 0.4`

Any of the NLTK classification algorithms can be used for the `--classifier` argument, such as `Maxent` or `MEGAM`, and every algorithm other than `NaiveBayes` has specific training options that can be customized.

## Phonetic Feature Options

You can also include phonetic algorithm features using the following arguments:

 `--metaphone`: Use metaphone feature `--double-metaphone`: Use double metaphone feature `--soundex`: Use soundex feature `--nysiis`: Use NYSIIS feature `--caverphone`: Use caverphone feature

These options create phonetic codes that will be included as features along with the default features used by the ClassifierBasedPOSTagger. The `--double-metaphone` algorithm comes from metaphone.py, while all the other phonetic algorithm have been copied from the advas project (which appears to be abandoned).

I created these options after discussions with Michael D Healy about Twitter Linguistics, in which he explained the prevalence of regional spelling variations. These phonetic features may be able to reduce that variation where a tagger is concerned, as slightly different spellings might generate the same phonetic code.

A tagger trained with any of these phonetic features will be an instance of `nltk_trainer.tagging.taggers.PhoneticClassifierBasedPOSTagger`, which means `nltk_trainer` must be included in your `PYTHONPATH` in order to load & use the tagger. The simplest way to do this is to install nltk-trainer using `python setup.py install`.