Chapter 3. Text Matching
As we saw in Chapter 2, once our data is cleansed and consistently formatted, we can find matching entities by checking for exact matches between their data attributes. If the data is of high quality, and if the attribute values are nonrepetitive, then checking for equivalence is straightforward. However, this is rarely the case with real-world data.
We can increase our likelihood of matching all relevant records by using approximate (often referred to as fuzzy) matching techniques. For numerical values, we can set a tolerance on how close the values need to be. For example, a date of birth might be matched if it’s within a few days or a location might be matched if its coordinates are within a certain distance apart. For textual data, we can look for similarities and differences between strings that could arise accidentally.
Of course, by accepting nonexact matches as equivalent we open up the possibility of matching records incorrectly.
In this chapter, we will introduce some frequently used text matching techniques and then apply them to our sample problem to see if this can improve our entity resolution performance.
Edit Distance Matching
For matching text, one of the most useful approximate matching techniques is to measure the edit distance between two strings. The edit distance is the minimum number of operations to transform one string into the other. This metric can therefore be used to assess the likelihood that two strings do actually describe ...
Get Hands-On Entity Resolution now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.