Chapter 10. The Big Score

So far we’ve talked about matching data and all its challenges, including making sure it’s the same type of data or compatible, that the representations of the data are the same (is it “IA” or “Iowa”?), and so on. But now what do we do with it all? How do we decide whether our “fuzzy” match is actually pretty damned precise?

We build something called a score. Now, “scoring” sounds all fancy and is the subject of doctoral-level theses in all kinds of industries, but for our purposes it is a simple thing. From the Glossary:

Score

A simple number that implies the strength of a given match based on the sum of the matching attributes (columns).

That’s it.

The rest of this chapter will show you how to make that simple sentence a reality, including which columns to choose and which to ignore, how to tell what a “good” score (match) is, and how to tune the scoring process and (usually) make it simpler. I am a simple guy. Simpler is better.

What Will We Want?

For each dataset there will be a different number of attributes you will match upon, and more importantly, there will be a fairly small set—usually around five, typically not more than ten—that can comfortably be used to confidently call a match with some acceptable measure of accuracy, that are “material” to the results, in other words.

You then tune that score and run filtering queries where the results are either above a given x (probable matches) or below it (probable nonmatches). Note the x. You ...

Get Fuzzy Data Matching with SQL now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.