StreamHacker Weotta be Hacking

31Oct/1112

Fuzzy String Matching in Python

Fuzzy matching is a general term for finding strings that are almost equal, or mostly the same. Of course almost and mostly are ambiguous terms themselves, so you'll have to determine what they really mean for your specific needs. The best way to do this is to come up with a list of test cases before you start writing any fuzzy matching code. These test cases should be pairs of strings that either should fuzzy match, or not. I like to create doctests for this, like so:

def fuzzy_match(s1, s2):
	'''
	>>> fuzzy_match('Happy Days', ' happy days ')
	True
	>>> fuzzy_match('happy days', 'sad days')
	False
	'''
	# TODO: fuzzy matching code
	return s1 == s2

Once you've got a good set of test cases, then it's much easier to tailor your fuzzy matching code to get the best results.

Normalization

The first step before doing any string matching is normalization. The goal with normalization is to transform your strings into a normal form, which in some cases may be all you need to do. While 'Happy Days' != ' happy days ', with simple normalization you can get 'Happy Days'.lower() == ' happy days '.strip().

The most basic normalization you can do is to lowercase and strip whitespace. But chances are you'll want to more. For example, here's a simple normalization function that also removes all punctuation in a string.

import string

def normalize(s):
	for p in string.punctuation:
		s = s.replace(p, '')

	return s.lower().strip()

Using this normalize function, we can make the above fuzzy matching function pass our simple tests.

def fuzzy_match(s1, s2):
	'''
	>>> fuzzy_match('Happy Days', ' happy days ')
	True
	>>> fuzzy_match('happy days', 'sad days')
	False
	'''
	return normalize(s1) == normalize(s2)

If you want to get more advanced, keep reading...

Regular Expressions

Beyond just stripping whitespace from the ends of strings, it's also a good idea replace all whitespace occurrences with a single space character. The regex function for doing this is re.sub('\s+', s, ' '). This will replace every occurrence of one or more spaces, newlines, tabs, etc, essentially eliminating the significance of whitespace for matching.

You may also be able to use regular expressions for partial fuzzy matching. Maybe you can use regular expressions to identify significant parts of a string, or perhaps split a string into component parts for further matching. If you think you can create a simple regular expression to help with fuzzy matching, do it, because chances are, any other code you write to do fuzzy matching will be more complicated, less straightforward, and probably slower. You can also use more complicated regular expressions to handle specific edge cases. But beware of any expression that takes puzzling out every time you look at it, because you'll probably be revisiting this code a number of times to tweak it for handling new cases, and tweaking complicated regular expressions is a sure way to induce headaches and eyeball-bleeding.

Edit Distance

The edit distance (aka Levenshtein distance) is the number of single character edits it would take to transform one string into another. Thefore, the smaller the edit distance, the more similar two strings are.

If you want to do edit distance calculations, checkout the standalone editdist module. Its distance function takes 2 strings and returns the Levenshtein edit distance. It's also implemented in C, and so is quite fast.

Fuzzywuzzy

Fuzzywuzzy is a great all-purpose library for fuzzy string matching, built (in part) on top of Python's difflib. It has a number of different fuzzy matching functions, and it's definitely worth experimenting with all of them. I've personally found ratio and token_set_ratio to be the most useful.

NLTK

If you want to do some custom fuzzy string matching, then NLTK is a great library to use. There's word tokenizers, stemmers, and it even has its own edit distance implementation. Here's a way you could combine all 3 to create a fuzzy string matching function.

from nltk import metrics, stem, tokenize

stemmer = stem.PorterStemmer()

def normalize(s):
	words = tokenize.wordpunct_tokenize(s.lower().strip())
	return ' '.join([stemmer.stem(w) for w in words])

def fuzzy_match(s1, s2, max_dist=3):
	return metrics.edit_distance(normalize(s1), normalize(s2)) <= max_dist

Phonetics

Finally, an interesting and perhaps non-obvious way to compare strings is with phonetic algorithms. The idea is that 2 strings that sound same may be the same (or at least similar enough). One of the most well known phonetic algorithms is Soundex, with a python soundex algorithm here. Another is Double Metaphone, with a python metaphone module here. You can also find code for these and other phonetic algorithms in the nltk-trainer phonetics module (copied from a now defunct sourceforge project called advas). Using any of these algorithms, you get an encoded string, and then if 2 encodings compare equal, the original strings match. Theoretically, you could even do fuzzy matching on the phonetic encodings, but that's probably pushing the bounds of fuzziness a bit too far.

   
%d bloggers like this: