Fuzzy matching is a general term for finding strings that are almost equal, or mostly the same. Of course almost and mostly are ambiguous terms themselves, so you'll have to determine what they really mean for your specific needs. The best way to do this is to come up with a list of test cases before you start writing any fuzzy matching code. These test cases should be pairs of strings that either should fuzzy match, or not. I like to create doctests for this, like so:
def fuzzy_match(s1, s2): ''' >>> fuzzy_match('Happy Days', ' happy days ') True >>> fuzzy_match('happy days', 'sad days') False ''' # TODO: fuzzy matching code return s1 == s2
Once you've got a good set of test cases, then it's much easier to tailor your fuzzy matching code to get the best results.
The first step before doing any string matching is normalization. The goal with normalization is to transform your strings into a normal form, which in some cases may be all you need to do. While
'Happy Days' != ' happy days ', with simple normalization you can get
'Happy Days'.lower() == ' happy days '.strip().
The most basic normalization you can do is to lowercase and strip whitespace. But chances are you'll want to more. For example, here's a simple normalization function that also removes all punctuation in a string.
import string def normalize(s): for p in string.punctuation: s = s.replace(p, '') return s.lower().strip()
normalize function, we can make the above fuzzy matching function pass our simple tests.
def fuzzy_match(s1, s2): ''' >>> fuzzy_match('Happy Days', ' happy days ') True >>> fuzzy_match('happy days', 'sad days') False ''' return normalize(s1) == normalize(s2)
If you want to get more advanced, keep reading...
Beyond just stripping whitespace from the ends of strings, it's also a good idea replace all whitespace occurrences with a single space character. The regex function for doing this is
re.sub('\s+', s, ' '). This will replace every occurrence of one or more spaces, newlines, tabs, etc, essentially eliminating the significance of whitespace for matching.
You may also be able to use regular expressions for partial fuzzy matching. Maybe you can use regular expressions to identify significant parts of a string, or perhaps split a string into component parts for further matching. If you think you can create a simple regular expression to help with fuzzy matching, do it, because chances are, any other code you write to do fuzzy matching will be more complicated, less straightforward, and probably slower. You can also use more complicated regular expressions to handle specific edge cases. But beware of any expression that takes puzzling out every time you look at it, because you'll probably be revisiting this code a number of times to tweak it for handling new cases, and tweaking complicated regular expressions is a sure way to induce headaches and eyeball-bleeding.
The edit distance (aka Levenshtein distance) is the number of single character edits it would take to transform one string into another. Thefore, the smaller the edit distance, the more similar two strings are.
If you want to do edit distance calculations, checkout the standalone editdist module. Its
distance function takes 2 strings and returns the Levenshtein edit distance. It's also implemented in C, and so is quite fast.
Fuzzywuzzy is a great all-purpose library for fuzzy string matching, built (in part) on top of Python's difflib. It has a number of different fuzzy matching functions, and it's definitely worth experimenting with all of them. I've personally found
token_set_ratio to be the most useful.
If you want to do some custom fuzzy string matching, then NLTK is a great library to use. There's word tokenizers, stemmers, and it even has its own edit distance implementation. Here's a way you could combine all 3 to create a fuzzy string matching function.
from nltk import metrics, stem, tokenize stemmer = stem.PorterStemmer() def normalize(s): words = tokenize.wordpunct_tokenize(s.lower().strip()) return ' '.join([stemmer.stem(w) for w in words]) def fuzzy_match(s1, s2, max_dist=3): return metrics.edit_distance(normalize(s1), normalize(s2)) <= max_dist
Finally, an interesting and perhaps non-obvious way to compare strings is with phonetic algorithms. The idea is that 2 strings that sound same may be the same (or at least similar enough). One of the most well known phonetic algorithms is Soundex, with a python soundex algorithm here. Another is Double Metaphone, with a python metaphone module here. You can also find code for these and other phonetic algorithms in the nltk-trainer phonetics module (copied from a now defunct sourceforge project called advas). Using any of these algorithms, you get an encoded string, and then if 2 encodings compare equal, the original strings match. Theoretically, you could even do fuzzy matching on the phonetic encodings, but that's probably pushing the bounds of fuzziness a bit too far.
NLTK trainer makes it easy to train part-of-speech taggers with various algorithms using
Training Sequential Backoff Taggers
The fastest algorithms are the sequential backoff taggers. You can specify the backoff sequence using the
--sequential argument, which accepts any combination of the following letters:
For example, to train the same kinds of taggers that were used in Part of Speech Tagging with NLTK Part 1 - Ngram Taggers, you could do the following:
python train_tagger.py treebank --sequential ubt
You can rearrange
ubt any way you want to change the order of the taggers (though
ubt is generally the most accurate order).
Training Affix Taggers
--sequential argument also recognizes the letter
a, which will insert an AffixTagger into the backoff chain. If you do not specify the
--affix argument, then it will include one AffixTagger with a 3-character suffix. However, you can change this by specifying one or more
--affix N options, where
N should be a positive number for prefixes, and a negative number for suffixes. For example, to train an
aubt tagger with 2 AffixTaggers, one that uses a 3 character suffix, and another that uses a 2 character prefix, specify the
--affix argument twice:
python train_tagger.py treebank --sequential aubt --affix -3 --affix 2
The order of the
--affix arguments is the order in which each AffixTagger will be trained and inserted into the backoff chain.
Training Brill Taggers
python train_tagger.py treebank --sequential aubt --brill
The default training options are a maximum of 200 rules with a minimum score of 2, but you can change that with the
--min_score arguments. You can also change the rule template bounds, which defaults to 1, using the
Training Classifier Based Taggers
Many of the arguments used by train_classifier.py can also be used to train a ClassifierBasedPOSTagger. If you don't want this tagger to backoff to a sequential backoff tagger, be sure to specify
--sequential ''. Here's an example for training a NaiveBayesClassifier based tagger, similar to what was shown in Part of Speech Tagging Part 4 - Classifier Taggers:
python train_tagger.py treebank --sequential '' --classifier NaiveBayes
If you do want to backoff to a sequential tagger, be sure to specify a cutoff probability, like so:
python train_tagger.py treebank --sequential ubt --classifier NaiveBayes --cutoff_prob 0.4
Any of the NLTK classification algorithms can be used for the
--classifier argument, such as
MEGAM, and every algorithm other than
NaiveBayes has specific training options that can be customized.
Phonetic Feature Options
You can also include phonetic algorithm features using the following arguments:
||Use metaphone feature|
||Use double metaphone feature|
||Use soundex feature|
||Use NYSIIS feature|
||Use caverphone feature|
These options create phonetic codes that will be included as features along with the default features used by the ClassifierBasedPOSTagger. The
--double-metaphone algorithm comes from metaphone.py, while all the other phonetic algorithm have been copied from the advas project (which appears to be abandoned).
I created these options after discussions with Michael D Healy about Twitter Linguistics, in which he explained the prevalence of regional spelling variations. These phonetic features may be able to reduce that variation where a tagger is concerned, as slightly different spellings might generate the same phonetic code.
A tagger trained with any of these phonetic features will be an instance of
nltk_trainer.tagging.taggers.PhoneticClassifierBasedPOSTagger, which means
nltk_trainer must be included in your
PYTHONPATH in order to load & use the tagger. The simplest way to do this is to install nltk-trainer using
python setup.py install.