Looking Up Words by Sound Similarity
Credit: Greg Jorgensen, Scott David Daniels
Problem
You need to look up words (most often people’s surnames) by sound, rather than by spelling, so that likely spelling mistakes don’t spoil the search.
Solution
The Soundex algorithm (by Odell and Russell, made popular by Knuth) transforms each surname into a signature that is more representative of how that surname is likely to sound when pronounced than of how it’s spelled:
def soundex(name, len=4):
""" soundex module conforming to Odell-Russell algorithm """
# digits holds the soundex values for the alphabet
soundex_digits = '01230120022455012623010202'
sndx = ''
fc = ''
# Translate letters in name to soundex digits
for c in name.upper( ):
if c.isalpha( ):
if not fc: fc = c # Remember first letter
d = soundex_digits[ord(c)-ord('A')]
# Duplicate consecutive soundex digits are skipped
if not sndx or (d != sndx[-1]):
sndx += d
# Replace first digit with first letter
sndx = fc + sndx[1:]
# Remove all 0s from the soundex code
sndx = sndx.replace('0', '')
# Return soundex code truncated or 0-padded to len characters
return (sndx + (len * '0'))[:len]Discussion
The common approach to avoiding confusion when a name’s spelling induces lookup errors is the Soundex algorithm, by Odell and Russell, as reported by Knuth. The algorithm is designed for English-language surnames. If you have a significant number of non-English surnames, you might want to alter the values in digits to improve your matches. For example, ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access