Skip to Content
Hands-On Entity Resolution
book

Hands-On Entity Resolution

by Michael Shearer
February 2024
Intermediate to advanced
198 pages
4h 41m
English
O'Reilly Media, Inc.
Content preview from Hands-On Entity Resolution

Chapter 3. Text Matching

As we saw in Chapter 2, once our data is cleansed and consistently formatted, we can find matching entities by checking for exact matches between their data attributes. If the data is of high quality, and if the attribute values are nonrepetitive, then checking for equivalence is straightforward. However, this is rarely the case with real-world data.

We can increase our likelihood of matching all relevant records by using approximate (often referred to as fuzzy) matching techniques. For numerical values, we can set a tolerance on how close the values need to be. For example, a date of birth might be matched if it’s within a few days or a location might be matched if its coordinates are within a certain distance apart. For textual data, we can look for similarities and differences between strings that could arise accidentally.

Of course, by accepting nonexact matches as equivalent we open up the possibility of matching records incorrectly.

In this chapter, we will introduce some frequently used text matching techniques and then apply them to our sample problem to see if this can improve our entity resolution performance.

Edit Distance Matching

For matching text, one of the most useful approximate matching techniques is to measure the edit distance between two strings. The edit distance is the minimum number of operations to transform one string into the other. This metric can therefore be used to assess the likelihood that two strings do actually describe ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

Data Management at Scale

Data Management at Scale

Piethein Strengholt
The Goal

The Goal

Eliyahu M. Goldratt, Jeff Cox
The Decision Intelligence Handbook

The Decision Intelligence Handbook

L. Y. Pratt, N. E. Malcolm

Publisher Resources

ISBN: 9781098148478Errata Page