A few months ago, I needed to synchronize our groupware’s address book with our employee database. Since the address book provided only minimal data about employees, and the employee database didn’t contain any of the exact fields in the address book, I had to synchronize them using nothing more than first and last names.
This was not as easy as it sounds. It quickly became apparent that only a small portion of the >3500 records would match directly. A simple SQL join between imported tables would not work; the data was too inconsistent. After matching a few names manually, I became increasingly obsessed with the problem of matching names, e.g., identifying that “Bill Gates” was the same person as “William Gates III.”
The problem was tenacious; I would add processing to catch misspellings or hyphenations, and new issues would come up. The script quickly grew into a module, Lingua::EN::MatchNames, and then a second, Lingua::EN::Nickname, in response to the bizarre and arbitrary conventions for shortening first names. Many first name forms have little or no similarity whatsoever: Peggy = Margaret = Midge, and several can follow an almost endless mutation path: Peggy > Margaret > Martha > Mary > Maryanne > Anna > Roseanne > Rosalyn > Linda > Melinda > …).
When the initial versions were complete, my script was able to match the vast majority of the records on its own (with greater than 85% certainty per match), and most of the rest ...