Possible Solutions

Although it's important to realize that this remains an unsolved problem in the general case, there are a number of ideas that people have tried that work in certain circumstances. Some of these approaches will be dead ends, but others, when further developed, seem to have the potential to work on a wide range of data sets.

Matching on Multiple Fields

In Chapter 7, "Data Finds Data," Jeff Jonas describes a hypothetical employee who could be discovered to also be a shoplifter through a combination of his name and his address. In that case, a combination of a name and an address is sufficient evidence to suggest that two different records in fact represent the same person. Jeff would also be quick to point out that he's come across cases where a "Patrick Smith" and a "Patricia Smith" shared an address and both went by "Pat Smith," so if you're not careful it's easy to get trapped in a maze of exceptions to otherwise obvious rules.

This does illustrate the basic and most common approach to matching items in data sets: choose a set of parameters and create a set of fixed rules that tell you whether things match or not. For example, "do two people have the same name and the same address?" or "do two films have the same name and were released the same year?"

This approach will work in many cases, but it has a few drawbacks. First of all, it requires the developer to identify the fields and rules by which things match. This can be incredibly tedious, since when they realize ...

Get Beautiful Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.