# Fuzzy String Matching in Python

Fuzzy matching is a general term for finding strings that are almost equal, or mostly the same. Of course almost and mostly are ambiguous terms themselves, so you’ll have to determine what they really mean for your specific needs. The best way to do this is to come up with a list of test cases before you start writing any fuzzy matching code. These test cases should be pairs of strings that either should fuzzy match, or not. I like to create doctests for this, like so:

```def fuzzy_match(s1, s2):
'''
>>> fuzzy_match('Happy Days', ' happy days ')
True
False
'''
# TODO: fuzzy matching code
return s1 == s2
```

Once you’ve got a good set of test cases, then it’s much easier to tailor your fuzzy matching code to get the best results.

## Normalization

The first step before doing any string matching is normalization. The goal with normalization is to transform your strings into a normal form, which in some cases may be all you need to do. While `'Happy Days' != ' happy days '`, with simple normalization you can get `'Happy Days'.lower() == ' happy days '.strip()`.

The most basic normalization you can do is to lowercase and strip whitespace. But chances are you’ll want to more. For example, here’s a simple normalization function that also removes all punctuation in a string.

```import string

def normalize(s):
for p in string.punctuation:
s = s.replace(p, '')

return s.lower().strip()
```

Using this `normalize` function, we can make the above fuzzy matching function pass our simple tests.

```def fuzzy_match(s1, s2):
'''
>>> fuzzy_match('Happy Days', ' happy days ')
True
False
'''
return normalize(s1) == normalize(s2)
```

## Regular Expressions

Beyond just stripping whitespace from the ends of strings, it’s also a good idea replace all whitespace occurrences with a single space character. The regex function for doing this is `re.sub('\s+', s, ' ')`. This will replace every occurrence of one or more spaces, newlines, tabs, etc, essentially eliminating the significance of whitespace for matching.

You may also be able to use regular expressions for partial fuzzy matching. Maybe you can use regular expressions to identify significant parts of a string, or perhaps split a string into component parts for further matching. If you think you can create a simple regular expression to help with fuzzy matching, do it, because chances are, any other code you write to do fuzzy matching will be more complicated, less straightforward, and probably slower. You can also use more complicated regular expressions to handle specific edge cases. But beware of any expression that takes puzzling out every time you look at it, because you’ll probably be revisiting this code a number of times to tweak it for handling new cases, and tweaking complicated regular expressions is a sure way to induce headaches and eyeball-bleeding.

## Edit Distance

The edit distance (aka Levenshtein distance) is the number of single character edits it would take to transform one string into another. Thefore, the smaller the edit distance, the more similar two strings are.

If you want to do edit distance calculations, checkout the standalone editdist module. Its `distance` function takes 2 strings and returns the Levenshtein edit distance. It’s also implemented in C, and so is quite fast.

## Fuzzywuzzy

Fuzzywuzzy is a great all-purpose library for fuzzy string matching, built (in part) on top of Python’s difflib. It has a number of different fuzzy matching functions, and it’s definitely worth experimenting with all of them. I’ve personally found `ratio` and `token_set_ratio` to be the most useful.

## NLTK

If you want to do some custom fuzzy string matching, then NLTK is a great library to use. There’s word tokenizers, stemmers, and it even has its own edit distance implementation. Here’s a way you could combine all 3 to create a fuzzy string matching function.

```from nltk import metrics, stem, tokenize

stemmer = stem.PorterStemmer()

def normalize(s):
words = tokenize.wordpunct_tokenize(s.lower().strip())
return ' '.join([stemmer.stem(w) for w in words])

def fuzzy_match(s1, s2, max_dist=3):
return metrics.edit_distance(normalize(s1), normalize(s2)) <= max_dist
```

## Phonetics

Finally, an interesting and perhaps non-obvious way to compare strings is with phonetic algorithms. The idea is that 2 strings that sound same may be the same (or at least similar enough). One of the most well known phonetic algorithms is Soundex, with a python soundex algorithm here. Another is Double Metaphone, with a python metaphone module here. You can also find code for these and other phonetic algorithms in the nltk-trainer phonetics module (copied from a now defunct sourceforge project called advas). Using any of these algorithms, you get an encoded string, and then if 2 encodings compare equal, the original strings match. Theoretically, you could even do fuzzy matching on the phonetic encodings, but that’s probably pushing the bounds of fuzziness a bit too far.

• David Avraamides

You should also look at difflib in the standard library. get_close_matches and SequenceMatcher do some similar things.

• Yes, I mentioned difflib in the context of fuzzywuzzy, but I think I need to do some of my own experiments with SequenceMatcher.

• Steven Vereecken

FYI: the new regex module, intended to eventually replace the current re module (http://pypi.python.org/pypi/regex), has some interesting additional features for “fuzzy matching”, allowing you to specify the maximum number of “errors” that is allowed for a certain (sub)pattern. I haven’t played with it much yet, but it looks very useful.

• Thanks, I had no idea there was a new regex module in the works.

• Pingback: Fuzzy string matching « Python Adventures()

• The regex code is wrong, change it with this:

re.sub(‘s+’, ‘ ‘, s)

• Thanks for catching that, corrected now.

• Hi, Jacob!

Something that’s very useful for fuzzy matching of non-english strings is unicode de-normalization, such that fuzzy_match(u”ação”, u”acao”) == True. For that, you need a deaccentuation function, such as the following:

from curses.ascii import isascii
from unicodedata import normalize

deaccentuate = lambda t: filter(isascii, normalize(‘NFD’, t).encode(‘utf-8’))

>>> deaccentuate(u”ação!”)
‘acao!’

In case you need it back in an unicode string, modify it slightly so:

deaccentuate = lambda t: u””.join(filter(isascii, normalize(‘NFD’, t).encode(‘utf-8′)))

>>> deaccentuate(u”ação!”)u’acao!’

I have been using this to compare Portuguese language strings for a long time.

Cheers,
Augusto Herrmann

• Yes, I forgot to mention that, but now I’m thinking I could do a whole article on just string normalization. I actually use unicodedata.normalize(‘NFKD’, s), but have started looking at using unidecode: http://pypi.python.org/pypi/Unidecode/

• Augusto Herrmann

I wasn’t aware of the Unidecode package, thanks for pointing it out!

The tradeoff is that using the function above is simple, as unicodedata and curses.ascii are all part of the standard library, so there are are no added dependencies. On the other hand, the library might cover some corner case that the code above does not. Have you spotted one so far?

• I’m not sure it has better coverage, but it seems simpler, and you can do more with less code.

• Christopher Stoll

You might be able to replace these techniques (normalization, regex, etc.) with a single dynamic programming algorithm, no? I guess it probably depends upon what you are trying to accomplish though.

• I’ve never heard of anyone doing that, and I’m not sure how well it would work, because you can’t necessarily break a string down into pieces and match those pieces to match the whole string. For larger strings that might make sense, but the smaller the string, the more exact you should be.

• Andrew Sanchez

After using the algorithms provided in your excellent phonetics module to get the Soundex code for a word, how can I use that code to find other words with a matching code? Ideally, I’d like to finding matching words using WordNet. There are some sites like http://www.wolframalpha.com/input/?i=soundex+Fuzzy that allow me to do exactly what I want to do, but I’d like to do this programmatically in Python so that I can get phonetic codes for lists of words, and then generate lists of matching words for each code. Further, I’m thinking that by using NLTK and WordNet, I can then filter those lists of matching words by POS. Any suggestions for a starting point? Thanks for your wonderful articles!

• Hi Andrew,

I didn’t write the phonetics module, so credit is due to the original authors.

It sounds like you need to create a search index. If you indexed all the words in WordNet by the soundex, then looking up words for a given soundex would be trivial. You could do this with a database like SQLite, or maybe a text indexing library like https://whoosh.readthedocs.org/en/latest/.

• Michael

Isn’t the regex a bit of an overkill? Why not just s.replace(” “, “”)? Is it only for the other whitespace caracters?

• The regex replaces multiple spaces in a row with a single space. You don’t want to get rid of single spaces, because then you won’t be able to separate words.