Chapter 1. Introduction to Entity Resolution
All around the world vast quantities of data are being collected and stored, and more data is being added every day. This data records the world we live in and the changing attributes and characteristics of the people, places, and things around us.
Within this global ecosystem of data processing, organizations independently collect overlapping sets of information about the same real-world entity. And each organization has its own approach to organizing and cataloging the data it holds.
Companies and institutions seek to derive valuable insights from this raw data. Advanced analytical techniques have been developed to discern patterns in the data, extract meaning, and even attempt to predict the future. The performance of these algorithms depends on the quality and richness of the data fed into them. By combining data from more than one organization, often a richer, more complete dataset can be created, from which more valuable conclusions can be drawn.
This book will guide you through how to join these heterogeneous datasets to create richer sets of data about the world in which we live. This process of joining datasets is known by a variety of names including name matching, fuzzy matching, record linking, entity reconciliation, and entity resolution. In this book we will use the term entity resolution to describe the overall process of resolving, that is, joining, data together that refers to real-world entities.
What Is Entity Resolution?
Entity resolution is a key analytic technique to identify data records that refer to the same real-world entity. This matching process enables the removal of duplicate entries within a single source and the joining of disparate data sources when common unique identifiers are not available.
Entity resolution enables enterprises to build rich and comprehensive data assets, to reveal relationships, and to construct networks for marketing and risk management purposes. It is often a key prerequisite to harness the full potential of machine learning and AI.
For example, healthcare providers often need to join records from across different practices or historical archives held on different platforms. In financial services, customer databases need to be reconciled to offer the most relevant products and services or to enable fraud detection. To enhance resilience or provide transparency on environmental and social issues, corporations need to join supply chain records with sources of risk intelligence.
Why Is Entity Resolution Needed?
In everyday life as individuals, we are assigned a lot of numbers—according to my healthcare provider, I am identified by one number, another by my employer, another by my national government, and so on. When I sign up for services, I’m often assigned a number (or more than one sometimes) by my bank, chosen retailer, or online provider. Why all these numbers? Back in a simpler time, when services were delivered in a local community, customers were known personally and interactions were conducted face to face, it was obvious who you were dealing with. Exchanges were often discrete transactions with no need to refer to any prior business and no need to keep records associated with individual customers.
As more and more services began to be provided remotely and offered on a wider regional or even national basis, a means of identifying who was who became necessary. Names were clearly insufficiently unique, so names were often combined with location to create a composite identifier: Mrs. Jones became Mrs. Jones from Bromley as opposed to Mrs. Jones from Harrow. As records migrated from paper to electronic form, the assignment of a unique machine-readable number began the era of numeric, and alphanumeric, identifiers that surround us today.
Within the confines of their own domain these identifiers usually work well. I identify myself with my unique number and it’s clear that I’m the same returning individual. This identifier allows a common context to be quickly established between two parties and reduces the possibility of misunderstanding. These identifiers typically have nothing in common, vary in length and format, and are assigned according to different schemes. There is no mechanism to translate between them or to identify that individually and collectively they refer to me and not another individual.
However, when business is depersonalized, and I don’t know the person I’m dealing with and they don’t know me, what happens if I register for the same service more than once? Perhaps I’ve forgotten to identify with my unique number or a new application is being submitted on my behalf. A second number will be created that also identifies me. This duplication makes it more difficult for the service provider to offer a personalized service as they must now join together two different records to understand fully who I am and what my needs might be.
Within larger organizations, the problem of matching up customer records becomes even more challenging. Different functions or business lines may maintain their own records that are specifically tailored to their purpose but were designed independently of each other. A common problem is how to construct a comprehensive (or 360 degree) view of a customer. Customers may have interacted with different parts of an organization over many years. They may have done so in different contexts—as an individual, as part of a joint household, or perhaps in an official capacity associated with a company or other legal entity. In the course of these different interactions, the same person may have been assigned a multiplicity of identifiers in various systems.
This situation commonly arises due to (often historic) mergers and acquisitions, where overlapping sets of customers are to be amalgamated and treated consistently as a single population. How do we match up a customer from one domain with one from another?
This challenge of joining records also occurs when bringing together datasets supplied by different organizations. Because there is typically no universally adopted standard or common key between enterprises, especially with respect to individuals, the joining of their data is a commonly overlooked and nontrivial exercise.
Main Challenges of Entity Resolution
If our assigned unique identifiers are all different and don’t match up, how can we identify that two records refer to the same entity? Our best approach is to compare individual attributes of those entities, such as their name, and if they share enough similarities, make our best judgment that they are a match. This sounds simple enough, right? Let’s delve into some of the reasons why that isn’t as straightforward as it sounds.
Lack of Unique Names
First, there is the challenge of recognizing uniqueness between names or labels. The repeated assignment of the same name to different real-world entities presents an obvious challenge in differentiating who is who. Perhaps you searched the internet for your own name. Chances are, unless your name is particularly uncommon, you will have found plenty of doppelgangers with exactly the same name as yourself.
Inconsistent Naming Conventions
Names are recorded in a variety of ways and data structures. Sometimes names are described in full, but often abbreviations are present or less significant parts of the name are omitted. For example, my name might be expressed, entirely correctly, as any of the variations in Table 1-1.
Name |
---|
Michael Shearer |
Michael William Shearer |
Michael William Robert Shearer |
Michael W R Shearer |
M W R Shearer |
M W Shearer |
None of these names exactly match each other but all refer to the same person, the same real-world entity. Titles, nicknames, shortened forms, or accented characters all frustrate the process of finding an exact match. Double-barreled or hyphenated last names add further permutations.
In an international context, naming practices vary enormously across the globe. Personal names may be present at the start or the end of a name and family names may or may not be present. Family names may also vary according to the sex and marital status of the individual. Names may be written in a variety of alphabets/character sets or translated differently between languages.1
Data Capture Inconsistencies
The process of capturing and recording names or labels usually reflects the data standards of the acquirer. At the most basic level, some data acquisition processes will employ uppercase characters only, others lowercase, while many will permit mixed case with initial letters capitalized.
A name may be heard only in conversation without the opportunity to clarify the correct spelling or may be incorrectly transcribed in a hurry. Names or labels are often mistyped during manual rekeying or accidentally omitted. Sometimes different conventions are used that can easily be interpreted incorrectly if the original context is lost. For example, even a simple name can be recorded as “First name, Last name," or perhaps as “Last name, First name," or even transposed completely into the wrong fields.
International data capture can lead to inconsistencies in transliteration between one script and another, or to transcription errors when captured verbally.
Worked Example
Let’s consider a simple fictitious example to illustrate how these challenges might manifest themselves. To begin with, imagine the only information we have is the name, as shown in Table 1-2.
Name |
---|
Michael Shearer |
Micheal William Shearer |
Is it likely that a “Michael Shearer” refers to the same entity as a “Micheal William Shearer”? Absent any other information, there is a fair chance that both refer to the same person. The second, with the addition of a middle name, has extra information but otherwise they are nearly identical and a comparison of the two last names would produce an exact match. Notice I slipped in a common misspelling of my first name. Did you spot it?
What if we add another attribute—can that help improve our matching accuracy? If you can’t remember your membership number, a service provider will often ask for a date of birth to help identify you (they also do this for security reasons). Date of birth is a particularly helpful attribute because it doesn’t change and has a large number of potential values (known as high cardinality). Also, the composite structure of individual values for day, month, and year may give us clues to the likelihood of a match when an exact equivalence isn’t established. For example, consider Table 1-3.
Name | Date of birth |
---|---|
Michael Shearer | 1/4/1970 |
Micheal William Shearer | 14 January 1970 |
At first glance the date of birth is not equivalent between the two records, so we might be tempted to discount the match. If these two individuals are born 10 days apart, they are unlikely to be the same person! However, there is only a single-digit difference between the two, with the former lacking the leading digit 1 in the day subfield—could this be a typo? It’s hard to tell. If the records were from different sources, we would also have to consider whether the data format was consistent—do we have the UK format of DD/MM/YYYY or the US format of MM/DD/YYYY?
What if we add a place of birth? Again, this attribute shouldn’t change but it can be expressed at different levels of granularity or with different punctuation. Table 1-4 shows the enriched records.
Name | Date of birth | Place of birth |
---|---|---|
Michael Shearer | 1/4/1970 | Stow-on-the-Wold |
Micheal William Shearer | 14 January 1970 | Stow on the Wold |
Here there is no exact match on the place of birth between either record, although both could be factually correct.
Therefore, place of birth, which may be recorded at different levels of specificity, doesn’t help us as much as we thought it might. What about something more personal, like a phone number? Of course, many of us do change our phone number throughout our life but with the ability to keep a cherished and well‑socialized mobile phone number when swapping between providers, this number is a more sticky attribute that we can use. However, even here we have challenges. Individuals may possess more than one number (a work and a personal number, for example), or the identifier may be recorded in a variety of formats, including spaces or hyphens. It may include or exclude an international dialing prefix.
Table 1-5 shows our complete records.
Name | Date of birth | Place of birth | Mobile number |
---|---|---|---|
Michael Shearer | 1/4/1970 | Stow-on-the-Wold | 07700 900999 |
Micheal William Shearer | 14 January 1970 | Stow on the Wold | 0770-090-0999 |
As you can see, this resolution challenge is quickly becoming quite complicated.
Deliberate Obfuscation
The vast majority of data inconsistencies that frustrate the matching process arise through inattentive but well-meaning data capture processes. However, for some uses we must consider the scenario where data has been maliciously obfuscated to disguise the true identity of the entity and prevent associations that might reveal a criminal intent or association.
Match Permutations
If I asked you to match your name against a simple table of, say, 30 names, you could probably do so within a few seconds. A longer list might take minutes but it is still a practical task. However, if I asked you to compare a list of 100 names with a second list of 100 names, the task becomes a lot more laborious and prone to error.
Not only does the number of potential matches expand to 10,000 (100 × 100), but if you want to do so in one pass through the second table you have to hold all 100 names from the first table in your head—not easy!
Similarly, if I asked you to deduplicate a list of 100 names in a single list, you’d actually have to compare:
- The first name against the remaining 99, then
- The second name against the remaining 98 and so on.
In fact, you’d have 4,950 comparisons to make. At one per second that’s about 80 minutes of work just to compare two short lists. For much larger datasets, the number of potential combinations becomes impractical, even for the most performant hardware.
Blind Matching?
So far we have assumed that the sets of data we seek to match are fully transparent to us—that the values of the attributes are readily available, in full, and have not been obscured or masked in any way. In some cases this ideal is not possible due to privacy constraints or geopolitical factors that prevent data from moving across borders. How can we find matches without being able to see the data? This feels like magic, but as we will see in Chapter 10, there are cryptographic techniques that enable matching to still take place without requiring full exposure of the list to be matched against.
The Entity Resolution Process
To overcome the challenges mentioned, the basic entity resolution process is divided into four sequential steps:
- Data standardization
- Record blocking
- Attribute comparison
- Match classification
After match classification additional postprocessing steps may be required:
- Clustering
- Canonicalization
Let’s describe each of these steps briefly in turn.
Data Standardization
Before we can compare records we need to ensure that we have consistent data structures so that we can test for equivalence between attributes. We also need to ensure that the formatting of those attributes is consistent. This processing step usually involves splitting fields and removing null values and extraneous characters. It is often bespoke to the source dataset.
Record Blocking
To overcome the challenge of impractically high volumes of record comparisons, a process called blocking is often used. Instead of comparing every record with every other record, only subsets of record pairs, preselected based on ready equivalence between certain attributes, are compared in their entirety. This filtering approach concentrates the resolution process on those records with the highest propensity to match.
Attribute Comparison
The process of comparing individual attributes between the pairs of records selected by the blocking process occurs next. The degree of equivalence can be established based on an exact match between attributes or a similarity function. This process produces a set of equivalence measures between two record pairs.
Match Classification
The final step in the basic entity resolution process is to conclude whether the collective similarity between individual attributes is sufficient to declare two records a match, i.e., to resolve that they refer to the same real-world entity. This judgment can be made according to a set of manually defined rules or can be based on a machine learning probabilistic approach.
Clustering
Once our match classification is complete, we may group our records into connected clusters via their matching pairs. The inclusion of a record pair in a cluster may be determined by an additional match confidence threshold. Records without pairs above this threshold will form standalone clusters. If our matching criteria allow for different equivalence criteria, then our clusters may be intransitive; i.e., record A may be paired with record B, and record B paired with record C, but record C may not be paired to record A. As a result, clusters may be highly interconnected or more loosely coupled.
Canonicalization
Post resolution there may be a need to determine which attribute values should be used to represent an entity. If approximate matching techniques have been used to determine equivalence, or if an additional variable attribute is present in the pair or cluster but has not been used in the matching process, then there may be a need to decide which value is the most representative. The resulting canonical attribute values are then used to describe the resolved entity in onward calculations.
Worked Example
Returning to our simple example, let’s apply the steps to our data. First, let’s standardize our data, splitting the name attribute, standardizing the date of birth, and removing the extra characters in the place of birth and mobile number fields. Table 1-6 shows our cleansed records.
First name | Last name | Date of birth | Place of birth | Mobile number |
---|---|---|---|---|
Michael | Shearer | 1/4/1970 | Stow on the Wold | 07700 900999 |
Micheal | Shearer | 1/14/1970 | Stow on the Wold | 07700 900999 |
In this simple example, we have only one pair to consider, so we don’t need to apply blocking. We’ll return to this in Chapter 5.
Next we’ll compare the individual attributes for exact matches. Table 1-7 shows the comparison between each attribute as either a “Match” or a “No match.”
Attribute | Value record 1 | Value record 2 | Comparison |
---|---|---|---|
First name | Michael | Micheal | No match |
Last name | Shearer | Shearer | Match |
Date of birth | 1/4/1970 | 1/14/1970 | No match |
Place of birth | Stow on the Wold | Stow on the Wold | Match |
Mobile number | 07700 900999 | 07700 900999 | Match |
Finally, we apply step 4 to determine whether we have an overall match. A simple rule might be if the majority of the attributes match, then we conclude the overall record is a match, as in this case.
Alternatively, we might consider various combinations of matching attributes to be sufficient for us to declare a match. In our example, to declare a match we could look for either:
- Name match AND (date of birth OR place of birth match), or
- Name match AND mobile number match
We can take this approach a step further and assign a relative weighting to each of our attribute comparisons; a mobile number match is worth perhaps twice as much as a date of birth match, and so on. Combining these weighted scores produces an overall match score that can be considered against a given confidence threshold.
We will look more at different approaches to determine these relative weightings, using statistical techniques and machine learning, in Chapter 4.
As we have seen, different attributes may be stronger or weaker in helping us determine whether we have a match. Earlier, we considered the likelihood of finding a match for a name that is quite common versus one that is found more infrequently. For example, in a UK context, a match on a last name of Smith is likely to be less informative than a match on Shearer—there are fewer Shearers than Smiths, so a match is inherently less likely to begin with (a lower prior probability).
This probabilistic approach works particularly well when some of the values of a categorical attribute (one with a finite set of values) are significantly more common than others. If we consider a city attribute as part of an address match in a UK dataset, then London is likely to occur much more frequently than, say, Bath, and therefore may be weighted less.
Note that we haven’t been able to determine which date of birth is definitively correct, so we are left with a canonicalization challenge.
Measuring Performance
Statistical approaches may help us to decide how to evaluate and combine all the clues that comparing individual attributes gives us, but how do we decide whether the combination is good enough or not? How do we set the confidence threshold to declare a match? This depends on what is important to us and how we propose to use our newly found matches.
Do we care more about being sure we spot every potential match and we are OK if in the process we declare a few matches that turn out to be false? This measure is known as recall. Or we don’t want to waste our time with incorrect matches but if we miss a few true matches along the way that’s OK. This is called precision.
When comparing two records, there are four different scenarios that can arise. Table 1-8 lists the different combinations of match decision and ground truth.
You decide | Ground truth | Instance of |
---|---|---|
Match | Match | True positive (TP) |
Match | Not match | False positive (FP) |
Not match | Match | False negative (FN) |
Not match | Not match | True negative (TN) |
If our recall measure is high, then we are only declaring relatively few false negatives, i.e., when we declare a match we rarely overlook a good candidate. If our precision is high, then when we declare a match we nearly always get it right.
At one extreme, imagine we declare every candidate pair a match; we would have zero false negatives and our measure of recall would be a perfect (1.0); we’d never overlook a match. Of course our precision would be very poor as we’d declare lots of nonmatches incorrectly as matches. Alternatively, imagine we declare a match in the ideal case, when every attribute is exactly equivalent; then we will never declare a match in error and our precision will be perfect (1.0), at the expense of our recall, which will be very poor as a lot of good matches pass us by.
Ideally, of course, we’d like high recall and precision simultaneously—our matches are both correct and comprehensive—but this is tricky to achieve! Chapter 6 describes this process in more detail.
Getting Started
So, how can we solve these challenges?
Hopefully this chapter has given you a good understanding of what entity resolution is, why it is needed, and the main steps in the process. Subsequent chapters will guide you, hands-on, through a set of worked real-world examples based on publicly available data.
Fortunately, in addition to commercial options, there are several open‑source Python libraries that do much of the hard work for us. These frameworks provide the scaffolding upon which we can construct a bespoke matching process that suits our data and context.
Before we begin, we’ll take a short detour in the next chapter to set up our analytic environment and review some of the foundational Python data science libraries we will use, and then we’ll consider the first step in our entity resolution process—standardizing our data ready for matching.
1 For further details on global naming conventions, see this guide.
Get Hands-On Entity Resolution now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.