Chapter 4. Probabilistic Matching
In Chapter 3, we explored how to use approximate matching techniques to measure the degree of similarity between attribute values. We set a threshold above which we declared equivalence and then combined these matching features, with equal weight, to conclude that two records referred to the same entity when both were a match. We evaluated our performance against exact matches only.
In this chapter, we will examine how to use probability-based techniques to calculate the optimum weighting for each equivalent attribute in calculating the likelihood of an overall entity match. This probability-based approach allows us to declare a match when the most statistically significant attributes are equivalent (either exact or approximate) but those with less significance are insufficiently similar. It also allows us to grade our confidence in the declaration of a match and apply appropriate match thresholds. The model that will be introduced in this section is known as the Fellegi-Sunter (FS) model.
We will also introduce a probabilistic entity resolution framework, Splink, that we will use to help us calculate these metrics and resolve our entities together.
Sample Problem
Let’s return to our exact match results from the end of Chapter 2. Opening the Chapter4.ipynb notebook we reload the standardized datasets from the Wikipedia and TheyWorkForYou websites. As in Chapter 3, we start by calculating the Cartesian, or cross, product of the two datasets as:
cross = df_w.merge(df_t, how='cross', suffixes=('_w', '_t'))
This gives us our total population of 650 × 650 = 422,500 record pairs—a pair for every name combination between the Wikipedia and TheyWorkForYou datasets.
Throughout this chapter, we will be using exact matches between the Firstname
, Lastname
, and Constituency
fields of each of these record pairs multiple times. Thus, it’s more efficient to calculate these matches once and store them as additional feature columns:
cross['Fmatch']= (cross['Firstname_w']==cross['Firstname_t']) cross['Lmatch']= (cross['Lastname_w']==cross['Lastname_t']) cross['Cmatch']= (cross['Constituency_w']==cross['Constituency_t'])
We also calculate the total number of matching columns, which we will use later:
cross['Tmatch'] = sum([cross['Fmatch'],cross['Lmatch'],cross['Cmatch']])
Based on our exploration of the data in Chapter 2, we know that within our total population of 422,500 combinations we have 637 record pairs that have an exact match on constituency and either first name or last name. This is our match
population:
match = cross[cross['Cmatch'] & (cross['Fmatch'] | cross['Lmatch'])]
The remainder, our notmatch
population, is extracted as the inverse:
notmatch = cross[(~cross['Cmatch']) | (~cross['Fmatch'] & ~cross['Lmatch'])]
These combinations are summarized in Table 4-1.
Match/not match population | Constituency match | First name match | Last name match |
---|---|---|---|
Not match | No | No | No |
Not match | No | No | Yes |
Not match | No | Yes | No |
Not match | No | Yes | Yes |
Not match | Yes | No | No |
Match | Yes | No | Yes |
Match | Yes | Yes | No |
Match | Yes | Yes | Yes |
We will now examine how well first name and last name equivalence, both individually and together, can predict whether an individual record belongs in either the match
or notmatch
population.
Single Attribute Match Probability
Let’s begin by considering whether first name equivalence alone is a good indicator that two entities within a record pair refer to the same person. We will examine both the match
and notmatch
populations and establish, within each of those subsets, how many first names match and how many do not.
Naming Convention
As we work through various subsets of these populations, it’s helpful to adopt a standard naming convention so that we can see at a glance how each population of records was selected. As we select records we add the selection criteria to the population name, right to left, e.g., first_match
should be read as first selecting those records that are part of the match
population and within that subset of the population further selecting only those rows where the first names are equivalent.
First Name Match Probability
Starting with the match
population we can select those records where the first names are equivalent to give us our first_match
population:
first_match = match[match['Fmatch']] len(first_match) 632
Repeating this for the three other combinations of match/not match, and first name equivalence or not, we can draw up a population map, as shown in Figure 4-1.
Therefore, based on first name equivalence only, we have:
Now we can calculate some probability values. First, the probability that a record pair whose first names are equivalent is actually a true positive match can be calculated as the number of pairs within the match
population whose first names match divided by the number of pairs whose first names match across both the match
and notmatch
populations:
From this we can see that, at only about 23%, first name equivalence alone isn’t a great predictor of a match between two records. This value is a conditional probability, that is, it is the probability of a true positive match conditional on the first name being a match. This can be written as:
where the pipe character (|) is read as “given that.”
Last Name Match Probability
Applying the same calculations to the last name, we can draw a second population map, as shown in Figure 4-2.
As for first name, the probability that a pair of records whose last names are equivalent is actually a match can be calculated as the number of pairs within the match
population whose last names match divided by the number of pairs whose last names match across both the match
and notmatch
populations.
For these records last name equivalence is clearly a better predictor of a true match than first name, which instinctively makes sense.
Again, this can be written as:
Multiple Attribute Match Probability
Now if we consider both first name and last name equivalence we can further subdivide our population map. Starting with our first name map and further subdividing each first name category into last name equivalence, and not, we can view our population as shown in Figure 4-3.
Extending our calculation to both first name and last name exact matches, we can calculate the probability of a true positive match given both first name and last name equivalence as:
If the first name matches but last name doesn’t, what is the probability that it’s a match?
If the first name doesn’t match but the last name does, what is the probability that it’s a match?
As we expected, if either first name or last name isn’t an exact match, then the probability of a true positive match is low, but a last name match gives us more confidence than a first name one.
If neither first name nor last name matches, the probability that it is a match is:
This is not surprising given that we defined true positive matches as records with an exact match on constituency and either first name or last name.
In conclusion, we can use these probabilities to inform our decision making on whether we are likely to have a true positive match or not. In this example, we would place more weight on a last name match than a first name one. This is an improvement on our method in Chapter 3, where we gave them the same weighting (and required them both to be equivalent) to declare a match.
But wait, we have a problem. In the preceding example, we started with a known population of matches that we used to compute the probabilities that first name and last name equivalence equate to a match. However, in most situations we don’t have a known match
population; otherwise we wouldn’t need to perform matching in the first place! How do we overcome this? To do so, we need to reframe our calculation a little and then employ some clever estimation techniques.
Probabilistic Models
In the previous section, we learned that some attributes are more informative than others; that is, they have more predictive power to help us decide whether a match is likely to be correct. In this section, we examine how to calculate these contributions and how to combine them to assess the overall likelihood of a match.
We start with a little statistical theory (using first name equivalence as an example) before we generalize to model what we can deploy at scale.
Bayes’ Theorem
Bayes’ theorem, named after Thomas Bayes, states that the conditional probability of an event, based on the occurrence of another event, is equal to the probability of the second event given the first event, multiplied by the probability of the first event.
Consider the probability that two records chosen at random are a true positive match, P(match), multiplied by the probability that within those matches the first names match, P (first|match):
Equally, we could calculate the same value in the reverse order, starting with the probability that the first name matches, P(first) multiplied by the probability that records within this population are a true positive match:
Equating these probabilities, we have:
Rearranging we can calculate:
We can calculate P(first) as the sum of the probabilities across both the match
and notmatch
populations:
Substituting in the preceding equation, we have:
Alternatively, we can rearrange this as:
If we can estimate the values in this equation, we can determine the probability that if a first name is equivalent, then the record pair really is a match.
Let’s examine these values in a little more detail, simplifying the notation as we go along.
m Value
The conditional probability that an attribute will be equivalent within the overall match
population is known as the m value. Using our Firstname
example, we can denote this as:
In a perfect dataset, all the first names within the match
population would be exactly equivalent and the m value would be 1. This value can therefore be thought of as a measure of data quality, i.e., how much variability there is in how an attribute has been captured across the datasets. A higher value indicates a better-quality attribute.
u Value
The conditional probability that an attribute will be equivalent within the overall notmatch
population is known as the u value. Again, using our Firstname
example, we can denote this as:
This value reflects how much commonality there is in this attribute across the datasets. A lower value indicates a less common, more distinguishing attribute that, if found to be equivalent in a particular case, would lead us to question whether it belongs in the notmatch
population and is really a match. Conversely, a higher u value tells us that this particular attribute is not as valuable for determining overall matches.
A good example of u value is a month of birth attribute, which, assuming the population is equally distributed across the year, will have a u value of .
Lambda () Value
The lambda value, , also known as the prior, is the probability that two randomly chosen records match.
In contrast with the m and u values, the value is a record-level value not associated with any particular attribute. This value is a measure of how much duplication there is in the dataset overall and is the starting point for our probability calculations.
The inverse, the likelihood that two randomly chosen records are not a match, can be written as follows:
Bayes Factor
Substituting these compact notations can result in the following:
The ratio is also known as the Bayes factor, in this case of the Firstname
parameter. The Bayes factor, as a combination of both the m and u values, gives a measure of the significance we should attach to the fact that the Firstname
values were equivalent.
Fellegi-Sunter Model
The Fellegi-Sunter model, named after Ivan P. Fellegi and Alan B. Sunter,1 describes how we can extend our simple Bayesian approach, combining the contribution of multiple attributes, to calculate the overall likelihood of a match. It relies on the simplifying assumption of conditional independence between attributes, also known as naive Bayes.
Using the FS model, we can combine the Bayes factors associated with each attribute in our record by simply multiplying them together. Taking our Firstname
example and extending it to consider when the Lastname
is also equivalent we have:
When an attribute isn’t equivalent, the Bayes factor is calculated as the inverse, . Therefore when the Firstname
is equivalent but the Lastname
is not, we calculate the probability of an overall match as:
Once we can calculate the m and u values for each attribute, and the value for the overall dataset, we can easily calculate the probabilities for each record pair. We simply determine the equivalence of each attribute (either exact or approximate as appropriate), select the appropriate Bayes factors, and multiply them together using the preceding formula to calculate the overall probability for that record pair.
For our simple example, our Bayes factors are therefore calculated as shown in Table 4-2.
Firstname equivalence |
Lastname equivalence |
Firstname Bayes factor |
Lastname Bayes factor |
Combined Bayes factor |
---|---|---|---|---|
No | No | |||
No | Yes | |||
Yes | No | |||
Yes | Yes |
Match Weight
To make overall match calculations more intuitive, the logarithm of the Bayes factors is sometimes used so that they can be added together rather than multiplied. In this way it is easy to visualize the relative contribution of each attribute to the overall score.
For our simple first name and last name equivalence example, the logarithmic match weight might be calculated (using base 2) as:
We can calculate probability from the match weight as:
Now that we understand how to combine our individual attribute probabilities, or match weights, together let’s consider how to estimate our value and our m and u values for each attribute when we don’t have a known match
population. One technique that we can use is called the expectation-maximization (EM) algorithm.
Expectation-Maximization Algorithm
The expectation-maximization algorithm uses an iterative approach to approximating the and m and u values. Let’s see a simplified form of this in action applied to our sample problem.
Iteration 1
For the first iteration we make the opening assumption that record pairs where the majority of the feature columns are equivalent are matches:
it1_match = cross[cross['Tmatch']>=2] it1_notmatch = cross[cross['Tmatch']<2] len(it1_match) 637
This gives us a pseudo match
population, it1_match
, of 637 records. In addition to the 628 perfect matches we found in Chapter 2, we also have 9 matches where either Firstname
or Lastname
(but not both) doesn’t match, as we see in Figure 4-4:
it1_match[~it1_match['Fmatch'] | ~it1_match['Lmatch']] [['Constituency_w','Firstname_w','Firstname_t', 'Lastname_w','Lastname_t']]
Our initial value is therefore:
Our initial prior match weight is therefore .
As a starting point, it’s therefore extremely unlikely that two records are a match. Now let’s calculate our m and u values so that we can update our probability on a per-record basis.
As we have a pseudo match
and notmatch
population, it’s straightforward to calculate our m and u values as the proportion of each population with an equivalent attribute. For Firstname
, Lastname
, and Constituency
we use:
mfi1 = len(it1_match[it1_match['Fmatch']])/len(it1_match) mli1 = len(it1_match[it1_match['Lmatch']])/len(it1_match) mci1 = len(it1_match[it1_match['Cmatch']])/len(it1_match) ufi1 = len(it1_notmatch[it1_notmatch['Fmatch']])/len(it1_notmatch) uli1 = len(it1_notmatch[it1_notmatch['Lmatch']])/len(it1_notmatch) uci1 = len(it1_notmatch[it1_notmatch['Cmatch']])/len(it1_notmatch)
Table 4-3 shows these values and the resulting match weight values per attribute.
Attribute | m value | u value | Match Bayes factor | Match weight | Not match Bayes factor | Not match weight |
---|---|---|---|---|---|---|
Firstname |
0.9921 | 0.0049 | 203.97 | 7.67 | 0.0079 | –6.98 |
Lastname |
0.9937 | 0.0008 | 1201.19 | 10.23 | 0.0063 | –7.31 |
Constituency |
1.0 | 0.0 | 0 |
There are no record pairs where the constituency is equivalent in the notmatch
population, so its u value is 0 and therefore its match
weight is mathematically infinity and its notmatch
weight is negative infinity.
Now we can use these values in the Fellegi-Sunter model to calculate the match probability for every record pair in the full population. We use a helper function to calculate these probabilities based on the values of the Constituency
, Firstname
, and Lastname
match features:
def match_prb(Fmatch,Lmatch,Cmatch,mf1,ml1,mc1,uf1,ul1,uc1, lmbda): if (Fmatch==1): mf = mf1 uf = uf1 else: mf = (1-mf1) uf = (1-uf1) if (Lmatch==1): ml = ml1 ul = ul1 else: ml = (1-ml1) ul = (1-ul1) if (Cmatch==1): mc = mc1 uc = uc1 else: mc = (1-mc1) uc = (1-uc1) prob = (lmbda * ml * mf * mc) / (lmbda * ml * mf * mc + (1-lmbda) * ul * uf * uc) return(prob)
We apply this function to the whole population with:
cross['prob'] = cross.apply(lambda x: match_prb( x.Fmatch,x.Lmatch,x.Cmatch, mfi1,mli1,mci1, ufi1,uli1,uci1, lmbda), axis=1)
Once we’ve calculated these values we can iterate again, resegmenting our population into match
and notmatch
populations based on the calculated match probabilities.
Iteration 2
For illustration purposes, we use an overall match probability of greater than 0.99 to define our new assumed match
population and assign any record with an equal or lower probability to our notmatch
population:
it2_match = cross[cross['prob']>0.99] it2_notmatch = cross[cross['prob']<=0.99] len(it2_match) 633
Applying this 0.99 threshold gives us a slightly reduced match
population of 633. Let’s see why. If we select the records just below the threshold we can see:
it2_notmatch[it2_notmatch['prob']>0.9] [['Constituency_w', 'Lastname_w','Lastname_t','prob']]
As we see in Figure 4-5, if the Lastname
isn’t equivalent, the new match probability falls just below our 0.99 threshold. Using these new match
and notmatch
populations we can revise our , m, and u values and iterate again, recalculating the match probabilities for each record pair.
In this case, our doesn’t materially change:
Only the Lastname
values change slightly, as shown in Table 4-4.
Attribute | m value | u value | Match Bayes factor | Match weight | Not match Bayes factor | Not match weight |
---|---|---|---|---|---|---|
Firstname |
0.9921 | 0.0049 | 203.97 | 7.67 | 0.0079 | –6.98 |
Lastname |
1.0 | 0.0008 | 1208.79 | 10.24 | 0 | |
Constituency |
1.0 | 0.0 | 0 |
Iteration 3
In this simple example, this next iteration doesn’t change the match
population, which remains at 633, because the EM algorithm has already converged.
This gives us our final parameter values of:
This instinctively feels right. We know that a match will always have an equivalent constituency and either first name or last name will match, with last name slightly more likely to be equivalent than first name (five out of nine versus four out of nine in the preceding sample).
Similarly, we know the constituency will never be the same in a notmatch
record pair and it’s very unlikely that either the first name or last name will accidentally match either (with first name slightly more likely).
We can turn these estimated values into match probabilities using the equations in the previous section:
As expected, these probabilities match the values we calculated using the probability maps in Figure 4-3 when we knew the match
and notmatch
population upfront.
In conclusion, we are now able to estimate match probabilities for all the different permutations of attribute equivalence without having to know the match
population in advance. This probabilistic approach is both powerful and scalable to large datasets with multiple attributes. To help us apply these techniques more easily, we introduce a performant and easy-to-use open source library, Splink, in the next section.
Introducing Splink
Splink is a Python package for probabilistic entity resolution. Splink implements the Fellegi-Sunter model and includes a variety of interactive outputs to help users understand models and diagnose linkage problems.
Splink supports a number of backends to perform the matching calculations. To begin with, we will use DuckDB, an in-process SQL database management system, that we can run locally on our laptop.
Configuring Splink
To import Splink into our notebook, we use:
import splink
Splink requires a unique ID column in each dataset, so we need to create these by copying their respective DataFrame indexes:
df_w['unique_id'] = df_w.index df_t['unique_id'] = df_t.index
Splink also needs the same columns to be present in both datasets. Therefore, we need to create blank columns where these are present in only one set of records and then remove unnecessary columns:
df_w['Flink'] = None df_t['Notes'] = None df_w = df_w[['Firstname','Lastname','Constituency','Flink','Notes', 'unique_id']] df_t = df_t[['Firstname','Lastname','Constituency','Flink','Notes', 'unique_id']]
Our next step is to configure Splink settings:
from splink.duckdb.linker import DuckDBLinker from splink.duckdb import comparison_library as cl settings = { "link_type": "link_only", "comparisons": [ cl.exact_match("Firstname"), cl.exact_match("Lastname"), cl.exact_match("Constituency"), ], } linker = DuckDBLinker([df_w, df_t], settings)
Splink supports both deduplication of records within a single dataset and linking between one or more separate datasets. Here we set link_type
to link_only
to tell Splink that we want only to match, not deduplicate, between our two datasets. We also tell Splink the comparisons we wish to use, in this case exact matches across our three attributes. Lastly, we instantiate the linker with these settings and our source DataFrames.
To help us understand our datasets, Splink provides a visualization of the distribution of the columns to be matched:
linker.profile_columns(['Firstname','Lastname','Constituency'])
The graphs we see in Figure 4-6 show us the combined population across both datasets.
Starting with distribution of first names we can see from the bottom right of the graph that, within the population of 352 distinct names, approximately 35% occur only twice, most probably once in each dataset. Then, moving right to left, we see a gradual increase in frequency to the most popular name, with 32 occurrences. Looking at the top 10 values by value count we see that John is the most popular name, followed by Andrew, David, etc. This tells us that Firstname
is a reasonable attribute to match on, but used alone, it will result in some false positives.
For last name the pattern is more stark, with a larger population of 574 distinct names, of which nearly 80% occur only twice. Looking at the top 10 values, the most common last names, Smith and Jones, occur 18 times, almost half as common as the most popular first name. As expected, this tells us that Lastname
is a richer attribute than Firstname
and therefore its equivalence is a better predictor of matching entities.
As expected, constituencies are uniquely paired across the two datasets, so all values appear exactly twice.
For the purposes of this simple example, we’re going to ask Splink to calculate all the parameters of the model using the expectation-maximization algorithm we introduced earlier. The initial True
parameter tells Splink to compare all the records across both datasets without blocking (we’ll see this in the next chapter). We also tell Splink to recalculate the u values at each iteration by setting fix_u_probabilities
to False
. Setting the fix_probability_two_random_records_match
to False
means the value (the overall match probability between the two datasets) will be recalculated at each iteration. Finally, we tell Splink to use the updated value when calculating probabilities for record pairs.:
em_session = linker.estimate_parameters_using_expectation_maximisation( 'True', fix_u_probabilities=False, fix_probability_two_random_records_match=False, populate_probability_two_random_records_match_from_trained_values =True)
Splink Performance
The EM model converges after three iterations. Splink produces an interactive chart showing the iterative progression of the relative match weights values:
em_session.match_weights_interactive_history_chart()
Figure 4-7 shows the final match weights that Splink has calculated after the third iteration. First, we have the prior (starting) match weight, which is a measure of how likely it is that two records chosen at random match. If you hover over the match weight bars you can see the calculated match weight value together with the underlying m and u parameters. These are calculated as follows:
For illustration purposes, Splink approximates the Constituency
not exact match weight as negative infinity and displays it in a different color. This is because there are no cases where the Firstname
and Lastname
attributes match but the Constituency
does not.
We can see the discrete values Splink has calculated using:
linker.save_settings_to_json("Chapter4_Splink_Settings.json", overwrite=True)
{'link_type': 'link_only', 'comparisons': [{'output_column_name': 'Firstname', 'comparison_levels': [{'sql_condition': '"Firstname_l" IS NULL OR "Firstname_r" IS NULL', 'label_for_charts': 'Null', 'is_null_level': True}, {'sql_condition': '"Firstname_l" = "Firstname_r"', 'label_for_charts': 'Exact match', 'm_probability': 0.992118804074688, 'u_probability': 0.004864290128404288}, {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons', 'm_probability': 0.007881195925311958, 'u_probability': 0.9951357098715956}], 'comparison_description': 'Exact match vs. anything else'}, {'output_column_name': 'Lastname', 'comparison_levels': [{'sql_condition': '"Lastname_l" IS NULL OR "Lastname_r" IS NULL', 'label_for_charts': 'Null', 'is_null_level': True}, {'sql_condition': '"Lastname_l" = "Lastname_r"', 'label_for_charts': 'Exact match', 'm_probability': 0.9937726043638647, 'u_probability': 0.00082730840955421}, {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons', 'm_probability': 0.006227395636135347, 'u_probability': 0.9991726915904457}], 'comparison_description': 'Exact match vs. anything else'}, {'output_column_name': 'Constituency', 'comparison_levels': [{'sql_condition': '"Constituency_l" IS NULL OR "Constituency_r" IS NULL', 'label_for_charts': 'Null', 'is_null_level': True}, {'sql_condition': '"Constituency_l" = "Constituency_r"', 'label_for_charts': 'Exact match', 'm_probability': 0.9999999403661186, 'u_probability': 3.092071473132138e-05}, {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons', 'm_probability': 5.963388147277392e-08, 'u_probability': 0.9999690792852688}], 'comparison_description': 'Exact match vs. anything else'}], 'retain_intermediate_calculation_columns': True, 'retain_matching_columns': True, 'sql_dialect': 'duckdb', 'linker_uid': 'adm20und', 'probability_two_random_records_match': 0.0015075875293170335}
The m and u probabilities match those we calculated manually using the expectation-maximization algorithm earlier in the chapter.
Finally, as before, we apply a threshold match probability and select the record pair above the threshold:
pres = linker.predict(threshold_match_probability = 0.99).as_pandas_dataframe() len(pres) 633
Analysis of these predictions shows that all 633 are true positives, leaving the 13 by-election true negatives and 4 false negatives. We can view the 4 false negatives with:
m_outer = match.merge( pres, left_on=['Constituency_t'], right_on=['Constituency_l'], how='outer') m_outer[m_outer['Constituency_t']!=m_outer['Constituency_l']] [['Constituency_w','Lastname_w','Lastname_t']]
The output, shown in Figure 4-8, shows that the mismatch on Lastname
is the reason these entities fall below the match threshold.
In comparison to the unweighted results in Chapter 3, Splink declares a match for “Liz Truss” versus “Elizabeth Truss,” but does not match “Anne Marie Morris” to “Anne Morris,” nor “Martin Docherty-Hughes” to “Martin Docherty.” This is because it is more heavily influenced by a mismatch on Lastname
, which is statistically a better negative predictor, than a mismatch on Firstname
.
Summary
To recap, we took two sets of records and combined them into a composite dataset containing every record pair combination. We then computed exact match features between equivalent fields and then combined those features, weighted according to how often they occurred in both the matching and nonmatching populations, to determine the overall likelihood of a match.
We saw how to use probability theory to calculate the match weights using the iterative expectation-maximization algorithm when we don’t have known match
populations.
Finally, we introduced the probabilistic entity resolution framework Splink, which greatly simplified the calculations when combining multiple attributes and helped us visualize and understand our match results.
Now that we have worked through a small-scale example, we will apply the techniques of approximate and probabilistic matching on a larger scale.
Get Hands-On Entity Resolution now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.