# Chapter 4. Probabilistic Matching

In Chapter 3, we explored how to use approximate matching techniques to measure the degree of similarity between attribute values. We set a threshold above which we declared equivalence and then combined these matching features, with equal weight, to conclude that two records referred to the same entity when both were a match. We evaluated our performance against exact matches only.

In this chapter, we will examine how to use probability-based techniques to calculate the optimum weighting for each equivalent attribute in calculating the likelihood of an overall entity match. This probability-based approach allows us to declare a match when the most statistically significant attributes are equivalent (either exact or approximate) but those with less significance are insufficiently similar. It also allows us to grade our confidence in the declaration of a match and apply appropriate match thresholds. The model that will be introduced in this section is known as the Fellegi-Sunter (FS) model.

We will also introduce a probabilistic entity resolution framework, Splink, that we will use to help us calculate these metrics and resolve our entities together.

# Sample Problem

Let’s return to our exact match results from the end of Chapter 2. Opening the *Chapter4.ipynb* notebook we reload the standardized datasets from the Wikipedia and TheyWorkForYou websites. As in Chapter 3, we start by calculating the Cartesian, or cross, product of the two datasets as:

cross = df_w.merge(df_t, how='cross', suffixes=('_w', '_t'))

This gives us our total population of 650 × 650 = 422,500 record pairs—a pair for every name combination between the Wikipedia and TheyWorkForYou datasets.

Throughout this chapter, we will be using exact matches between the `Firstname`

, `Lastname`

, and `Constituency`

fields of each of these record pairs multiple times. Thus, it’s more efficient to calculate these matches once and store them as additional feature columns:

cross['Fmatch']= (cross['Firstname_w']==cross['Firstname_t']) cross['Lmatch']= (cross['Lastname_w']==cross['Lastname_t']) cross['Cmatch']= (cross['Constituency_w']==cross['Constituency_t'])

We also calculate the total number of matching columns, which we will use later:

cross['Tmatch'] = sum([cross['Fmatch'],cross['Lmatch'],cross['Cmatch']])

Based on our exploration of the data in Chapter 2, we know that within our total population of 422,500 combinations we have 637 record pairs that have an exact match on constituency and either first name or last name. This is our `match`

population:

match = cross[cross['Cmatch'] & (cross['Fmatch'] | cross['Lmatch'])]

The remainder, our `notmatch`

population, is extracted as the inverse:

notmatch = cross[(~cross['Cmatch']) | (~cross['Fmatch'] & ~cross['Lmatch'])]

These combinations are summarized in Table 4-1.

Match/not match population | Constituency match | First name match | Last name match |
---|---|---|---|

Not match | No | No | No |

Not match | No | No | Yes |

Not match | No | Yes | No |

Not match | No | Yes | Yes |

Not match | Yes | No | No |

Match | Yes | No | Yes |

Match | Yes | Yes | No |

Match | Yes | Yes | Yes |

We will now examine how well first name and last name equivalence, both individually and together, can predict whether an individual record belongs in either the `match`

or `notmatch`

population.

# Single Attribute Match Probability

Let’s begin by considering whether first name equivalence alone is a good indicator that two entities within a record pair refer to the same person. We will examine both the `match`

and `notmatch`

populations and establish, within each of those subsets, how many first names match and how many do not.

# Naming Convention

As we work through various subsets of these populations, it’s helpful to adopt a standard naming convention so that we can see at a glance how each population of records was selected. As we select records we add the selection criteria to the population name, right to left, e.g., `first_match`

should be read as first selecting those records that are part of the `match`

population and within that subset of the population further selecting only those rows where the first names are equivalent.

## First Name Match Probability

Starting with the `match`

population we can select those records where the first names are equivalent to give us our `first_match`

population:

first_match = match[match['Fmatch']]len(first_match)632

Repeating this for the three other combinations of match/not match, and first name equivalence or not, we can draw up a population map, as shown in Figure 4-1.

Therefore, based on first name equivalence only, we have:

$True\phantom{\rule{0.166667em}{0ex}}positive\phantom{\rule{0.166667em}{0ex}}matches\phantom{\rule{0.166667em}{0ex}}\left(TP\right)=632$

$False\phantom{\rule{0.166667em}{0ex}}positive\phantom{\rule{0.166667em}{0ex}}matches\phantom{\rule{0.166667em}{0ex}}\left(FP\right)=2052$

$True\phantom{\rule{0.166667em}{0ex}}negative\phantom{\rule{0.166667em}{0ex}}matches\phantom{\rule{0.166667em}{0ex}}\left(TN\right)=419811$

$False\phantom{\rule{0.166667em}{0ex}}negative\phantom{\rule{0.166667em}{0ex}}matches\phantom{\rule{0.166667em}{0ex}}\left(FN\right)=5$

Now we can calculate some probability values. First, the probability that a record pair whose first names are equivalent is actually a true positive match can be calculated as the number of pairs within the `match`

population whose first names match divided by the number of pairs whose first names match across both the `match`

and `notmatch`

populations:

$prob\_match\_first=\frac{len(first\_match)}{\left(len\right(first\_match)+len(first\_notmatch\left)\right)}=\frac{632}{(632+2052)}\approx 0.2355$

From this we can see that, at only about 23%, first name equivalence alone isn’t a great predictor of a match between two records. This value is a conditional probability, that is, it is the probability of a true positive match conditional on the first name being a match. This can be written as:

$P\left(match\right|first)$

where the pipe character (|) is read as “given that.”

## Last Name Match Probability

Applying the same calculations to the last name, we can draw a second population map, as shown in Figure 4-2.

As for first name, the probability that a pair of records whose last names are equivalent is actually a match can be calculated as the number of pairs within the `match`

population whose last names match divided by the number of pairs whose last names match across both the `match`

and `notmatch`

populations.

$prob\_match\_last=\frac{len(last\_match)}{\left(len\right(last\_match)+len(last\_notmatch\left)\right)}=\frac{633}{(633+349)}\approx 0.6446$

For these records last name equivalence is clearly a better predictor of a true match than first name, which instinctively makes sense.

Again, this can be written as:

$P\left(match\right|last)$

# Multiple Attribute Match Probability

Now if we consider both first name and last name equivalence we can further subdivide our population map. Starting with our first name map and further subdividing each first name category into last name equivalence, and not, we can view our population as shown in Figure 4-3.

Extending our calculation to both first name and last name exact matches, we can calculate the probability of a true positive match given both first name and last name equivalence as:

$prob\_match\_last\_first=\frac{len(last\_first\_match)}{\left(len\right(last\_first\_match)+len(last\_first\_notmatch)}\phantom{\rule{4.pt}{0ex}}=\frac{628}{(628+0)}=1.0$

If the first name matches but last name doesn’t, what is the probability that it’s a match?

$prob\_match\_notlast\_first=\frac{len(notlast\_first\_match)}{\left(len\right(notlast\_first\_match)+len(notlast\_first\_notmatch\left)\right)}=\frac{4}{(4+2052)}\approx 0.0019$

If the first name doesn’t match but the last name does, what is the probability that it’s a match?

$prob\_match\_last\_notfirst=\frac{len(last\_notfirst\_match)}{\left(len\right(last\_notfirst\_match)+len(last\_notfirst\_notmatch\left)\right)}=\frac{5}{(5+349)}\approx 0.0141$

As we expected, if either first name or last name isn’t an exact match, then the probability of a true positive match is low, but a last name match gives us more confidence than a first name one.

If neither first name nor last name matches, the probability that it is a match is:

$prob\_match\_notlast\_notfirst=$

$\frac{len(notlast\_notfirst\_match)}{\left(len\right(notlast\_notfirst\_match)+len(notlast\_notfirst\_notmatch\left)\right)}=\frac{0}{(0+419462)}=0$

This is not surprising given that we defined true positive matches as records with an exact match on constituency and either first name or last name.

In conclusion, we can use these probabilities to inform our decision making on whether we are likely to have a true positive match or not. In this example, we would place more weight on a last name match than a first name one. This is an improvement on our method in Chapter 3, where we gave them the same weighting (and required them both to be equivalent) to declare a match.

But wait, we have a problem. In the preceding example, we started with a known population of matches that we used to compute the probabilities that first name and last name equivalence equate to a match. However, in most situations we don’t have a known `match`

population; otherwise we wouldn’t need to perform matching in the first place! How do we overcome this? To do so, we need to reframe our calculation a little and then employ some clever estimation techniques.

# Probabilistic Models

In the previous section, we learned that some attributes are more informative than others; that is, they have more predictive power to help us decide whether a match is likely to be correct. In this section, we examine how to calculate these contributions and how to combine them to assess the overall likelihood of a match.

We start with a little statistical theory (using first name equivalence as an example) before we generalize to model what we can deploy at scale.

## Bayes’ Theorem

*Bayes’ theorem*, named after Thomas Bayes, states that the conditional probability of an event, based on the occurrence of another event, is equal to the probability of the second event given the first event, multiplied by the probability of the first event.

Consider the probability that two records chosen at random are a true positive match, P(match), multiplied by the probability that within those matches the first names match, P (first|match):

$P\left(first\right|match)\times P(match)$

Equally, we could calculate the same value in the reverse order, starting with the probability that the first name matches, P(first) multiplied by the probability that records within this population are a true positive match:

$P\left(match\right|first)\times P(first)$

Equating these probabilities, we have:

$P\left(match\right|first)\times P(first)=P(first\left|match\right)\times P\left(match\right)$

Rearranging we can calculate:

$P\left(match\right|first)=\frac{P\left(first\right|match)\times P(match)}{P\left(first\right)}$

We can calculate P(first) as the sum of the probabilities across both the `match`

and `notmatch`

populations:

$$\begin{array}{rl}P(first)& =(P(first|match)\times P(match)\\ & \text{}+P(first|notmatch)\times P(notmatch))\end{array}$$

Substituting in the preceding equation, we have:

$P\left(match\right|first)=\frac{P\left(first\right|match)\times P(match)}{P\left(first\right|match)\times P(match)+P(first\left|notmatch\right)\times P\left(notmatch\right)}$

Alternatively, we can rearrange this as:

$P\left(match\right|first)=1-{(1+\frac{P\left(first\right|match)}{P\left(first\right|notmatch)}\times \frac{P\left(match\right)}{P\left(notmatch\right)})}^{-1}$

If we can estimate the values in this equation, we can determine the probability that if a first name is equivalent, then the record pair really is a match.

Let’s examine these values in a little more detail, simplifying the notation as we go along.

## m Value

The conditional probability that an attribute will be equivalent within the overall `match`

population is known as the *m value*. Using our `Firstname`

example, we can denote this as:

${m}_{f}=P\left(first\right|match)$

In a perfect dataset, all the first names within the `match`

population would be exactly equivalent and the *m* value would be 1. This value can therefore be thought of as a measure of data quality, i.e., how much variability there is in how an attribute has been captured across the datasets. A higher value indicates a better-quality attribute.

## u Value

The conditional probability that an attribute will be equivalent within the overall `notmatch`

population is known as the *u value*. Again, using our `Firstname`

example, we can denote this as:

${u}_{f}=P\left(first\right|notmatch)$

This value reflects how much commonality there is in this attribute across the datasets. A lower value indicates a less common, more distinguishing attribute that, if found to be equivalent in a particular case, would lead us to question whether it belongs in the `notmatch`

population and is really a match. Conversely, a higher *u* value tells us that this particular attribute is not as valuable for determining overall matches.

A good example of *u* value is a month of birth attribute, which, assuming the population is equally distributed across the year, will have a *u* value of $\frac{1}{12}$.

## Lambda ($\lambda $) Value

The *lambda value*, $\lambda $, also known as the prior, is the probability that two randomly chosen records match.

$\lambda =P\left(match\right)$

In contrast with the *m* and *u* values, the $\lambda $ value is a record-level value not associated with any particular attribute. This value is a measure of how much duplication there is in the dataset overall and is the starting point for our probability calculations.

The inverse, the likelihood that two randomly chosen records are not a match, can be written as follows:

$1-\lambda =P\left(notmatch\right)$

## Bayes Factor

Substituting these compact notations can result in the following:

$P\left(match\right|first)=1-{(1+\frac{{m}_{f}}{{u}_{f}}\times \frac{\lambda}{(1-\lambda )})}^{-1}$

The ratio $\frac{{m}_{f}}{{u}_{f}}$ is also known as the *Bayes factor*, in this case of the `Firstname`

parameter. The Bayes factor, as a combination of both the *m* and *u* values, gives a measure of the significance we should attach to the fact that the `Firstname`

values were equivalent.

## Fellegi-Sunter Model

The *Fellegi-Sunter model*, named after Ivan P. Fellegi and Alan B. Sunter,^{1} describes how we can extend our simple Bayesian approach, combining the contribution of multiple attributes, to calculate the overall likelihood of a match. It relies on the simplifying assumption of conditional independence between attributes, also known as *naive Bayes*.

Using the FS model, we can combine the Bayes factors associated with each attribute in our record by simply multiplying them together. Taking our `Firstname`

example and extending it to consider when the `Lastname`

is also equivalent we have:

$P\left(match\right|last\left|first\right)=1-{(1+\frac{{m}_{f}}{{u}_{f}}\times \frac{{m}_{l}}{{u}_{l}}\times \frac{\lambda}{(1-\lambda )})}^{-1}$

When an attribute isn’t equivalent, the Bayes factor is calculated as the inverse, $\frac{(1-{m}_{l})}{(1-{u}_{l})}$. Therefore when the `Firstname`

is equivalent but the `Lastname`

is not, we calculate the probability of an overall match as:

$P\left(match\right|notlast\left|first\right)=1-{(1+\frac{{m}_{f}}{{u}_{f}}\times \frac{(1-{m}_{l})}{(1-{u}_{l})}\times \frac{\lambda}{(1-\lambda )})}^{-1}$

Once we can calculate the *m* and *u* values for each attribute, and the $\lambda $ value for the overall dataset, we can easily calculate the probabilities for each record pair. We simply determine the equivalence of each attribute (either exact or approximate as appropriate), select the appropriate Bayes factors, and multiply them together using the preceding formula to calculate the overall probability for that record pair.

For our simple example, our Bayes factors are therefore calculated as shown in Table 4-2.

`Firstname` equivalence |
`Lastname` equivalence |
`Firstname` Bayes factor |
`Lastname` Bayes factor |
Combined Bayes factor |
---|---|---|---|---|

No | No | $\frac{(1-{m}_{f})}{(1-{u}_{f})}$ | $\frac{(1-{m}_{l})}{(1-{u}_{l})}$ | $\frac{(1-{m}_{f})}{(1-{u}_{f})}\times \frac{(1-{m}_{l})}{(1-{u}_{l})}$ |

No | Yes | $\frac{(1-{m}_{f})}{(1-{u}_{f})}$ | $\frac{{m}_{l}}{{u}_{l}}$ | $\frac{(1-{m}_{f})}{(1-{u}_{f})}\times \frac{{m}_{l}}{{u}_{l}}$ |

Yes | No | $\frac{{m}_{f}}{{u}_{f}}$ | $\frac{(1-{m}_{l})}{(1-{u}_{l})}$ | $\frac{{m}_{f}}{{u}_{f}}\times \frac{(1-{m}_{l})}{(1-{u}_{l})}$ |

Yes | Yes | $\frac{{m}_{f}}{{u}_{f}}$ | $\frac{{m}_{l}}{{u}_{l}}$ | $\frac{{m}_{f}}{{u}_{f}}\times \frac{{m}_{l}}{{u}_{l}}$ |

## Match Weight

To make overall match calculations more intuitive, the logarithm of the Bayes factors is sometimes used so that they can be added together rather than multiplied. In this way it is easy to visualize the relative contribution of each attribute to the overall score.

For our simple first name and last name equivalence example, the logarithmic match weight might be calculated (using base 2) as:

$MatchWeight=lo{g}_{2}\frac{{m}_{f}}{{u}_{f}}+lo{g}_{2}\frac{{m}_{l}}{{u}_{l}}+lo{g}_{2}\frac{\lambda}{(1-\lambda )}$

We can calculate probability from the match weight as:

$Probability=1-{(1+{2}^{MatchWeight})}^{-1}$

Now that we understand how to combine our individual attribute probabilities, or match weights, together let’s consider how to estimate our $\lambda $ value and our *m* and *u* values for each attribute when we don’t have a known `match`

population. One technique that we can use is called the *expectation-maximization (EM) algorithm*.

# Expectation-Maximization Algorithm

The expectation-maximization algorithm uses an iterative approach to approximating the $\lambda $ and *m* and *u* values. Let’s see a simplified form of this in action applied to our sample problem.

## Iteration 1

For the first iteration we make the opening assumption that record pairs where the majority of the feature columns are equivalent are matches:

it1_match = cross[cross['Tmatch']>=2] it1_notmatch = cross[cross['Tmatch']<2]len(it1_match)637

This gives us a pseudo `match`

population, `it1_match`

, of 637 records. In addition to the 628 perfect matches we found in Chapter 2, we also have 9 matches where either `Firstname`

or `Lastname`

(but not both) doesn’t match, as we see in Figure 4-4:

it1_match[~it1_match['Fmatch'] | ~it1_match['Lmatch']] [['Constituency_w','Firstname_w','Firstname_t', 'Lastname_w','Lastname_t']]

Our initial $\lambda $ value is therefore:

${\lambda}_{1}=\frac{637}{650\times 650}\approx 0.0015$

$(1-{\lambda}_{1})=(1-0.0015)\approx 0.9985$

Our initial prior match weight is therefore $lo{g}_{2}\frac{{\lambda}_{1}}{(1-{\lambda}_{1})}\approx -9.371$.

As a starting point, it’s therefore extremely unlikely that two records are a match. Now let’s calculate our *m* and *u* values so that we can update our probability on a per-record basis.

As we have a pseudo `match`

and `notmatch`

population, it’s straightforward to calculate our *m* and *u* values as the proportion of each population with an equivalent attribute. For `Firstname`

, `Lastname`

, and `Constituency`

we use:

mfi1 = len(it1_match[it1_match['Fmatch']])/len(it1_match) mli1 = len(it1_match[it1_match['Lmatch']])/len(it1_match) mci1 = len(it1_match[it1_match['Cmatch']])/len(it1_match) ufi1 = len(it1_notmatch[it1_notmatch['Fmatch']])/len(it1_notmatch) uli1 = len(it1_notmatch[it1_notmatch['Lmatch']])/len(it1_notmatch) uci1 = len(it1_notmatch[it1_notmatch['Cmatch']])/len(it1_notmatch)

Table 4-3 shows these values and the resulting match weight values per attribute.

Attribute | m value |
u value |
Match Bayes factor | Match weight | Not match Bayes factor | Not match weight |
---|---|---|---|---|---|---|

`Firstname` |
0.9921 | 0.0049 | 203.97 | 7.67 | 0.0079 | –6.98 |

`Lastname` |
0.9937 | 0.0008 | 1201.19 | 10.23 | 0.0063 | –7.31 |

`Constituency` |
1.0 | 0.0 | $\infty $ | $\infty $ | 0 | $-\infty $ |

There are no record pairs where the constituency is equivalent in the `notmatch`

population, so its *u* value is 0 and therefore its `match`

weight is mathematically infinity and its `notmatch`

weight is negative infinity.

Now we can use these values in the Fellegi-Sunter model to calculate the match probability for every record pair in the full population. We use a helper function to calculate these probabilities based on the values of the `Constituency`

, `Firstname`

, and `Lastname`

match features:

def match_prb(Fmatch,Lmatch,Cmatch,mf1,ml1,mc1,uf1,ul1,uc1, lmbda): if (Fmatch==1): mf = mf1 uf = uf1 else: mf = (1-mf1) uf = (1-uf1) if (Lmatch==1): ml = ml1 ul = ul1 else: ml = (1-ml1) ul = (1-ul1) if (Cmatch==1): mc = mc1 uc = uc1 else: mc = (1-mc1) uc = (1-uc1) prob = (lmbda * ml * mf * mc) / (lmbda * ml * mf * mc + (1-lmbda) * ul * uf * uc) return(prob)

We apply this function to the whole population with:

cross['prob'] = cross.apply(lambda x: match_prb( x.Fmatch,x.Lmatch,x.Cmatch, mfi1,mli1,mci1, ufi1,uli1,uci1, lmbda), axis=1)

Once we’ve calculated these values we can iterate again, resegmenting our population into `match`

and `notmatch`

populations based on the calculated match probabilities.

## Iteration 2

For illustration purposes, we use an overall match probability of greater than 0.99 to define our new assumed `match`

population and assign any record with an equal or lower probability to our `notmatch`

population:

it2_match = cross[cross['prob']>0.99] it2_notmatch = cross[cross['prob']<=0.99]len(it2_match)633

Applying this 0.99 threshold gives us a slightly reduced `match`

population of 633. Let’s see why. If we select the records just below the threshold we can see:

it2_notmatch[it2_notmatch['prob']>0.9] [['Constituency_w', 'Lastname_w','Lastname_t','prob']]

As we see in Figure 4-5, if the `Lastname`

isn’t equivalent, the new match probability falls just below our 0.99 threshold. Using these new `match`

and `notmatch`

populations we can revise our $\lambda $, *m*, and *u* values and iterate again, recalculating the match probabilities for each record pair.

In this case, our $\lambda $ doesn’t materially change:

${\lambda}_{2}=\frac{633}{650\times 650}\approx 0.0015$

Only the `Lastname`

values change slightly, as shown in Table 4-4.

Attribute | m value |
u value |
Match Bayes factor | Match weight | Not match Bayes factor | Not match weight |
---|---|---|---|---|---|---|

`Firstname` |
0.9921 | 0.0049 | 203.97 | 7.67 | 0.0079 | –6.98 |

`Lastname` |
1.0 |
0.0008 |
1208.79 |
10.24 |
0 |
$-\infty $ |

`Constituency` |
1.0 | 0.0 | $\infty $ | $\infty $ | 0 | $-\infty $ |

## Iteration 3

In this simple example, this next iteration doesn’t change the `match`

population, which remains at 633, because the EM algorithm has already converged.

This gives us our final parameter values of:

$\lambda \approx 0.0015$

${m}_{f}=P\left(first\right|match)\approx 0.9921$

${m}_{l}=P\left(last\right|match)\approx 1.0$

${m}_{c}=P\left(constituency\right|match)\approx 1.0$

${u}_{f}=P\left(first\right|notmatch)\approx 0.0049$

${u}_{l}=P\left(last\right|notmatch)\approx 0.0008$

${u}_{c}=P\left(constituency\right|notmatch)\approx 0$

This instinctively feels right. We know that a match will always have an equivalent constituency and either first name or last name will match, with last name slightly more likely to be equivalent than first name (five out of nine versus four out of nine in the preceding sample).

Similarly, we know the constituency will never be the same in a `notmatch`

record pair and it’s very unlikely that either the first name or last name will accidentally match either (with first name slightly more likely).

We can turn these estimated values into match probabilities using the equations in the previous section:

$P\left(match\right|last\left|first\right)=1-{(1+\frac{{m}_{f}}{{u}_{f}}\times \frac{{m}_{l}}{{u}_{l}}\times \frac{\lambda}{(1-\lambda )})}^{-1}=1.0$

$P\left(match\right|notlast\left|first\right)=1-{(1+\frac{{m}_{f}}{{u}_{f}}\times \frac{(1-{m}_{l})}{(1-{u}_{l})}\times \frac{\lambda}{(1-\lambda )})}^{-1}\approx 0.0019$

$P\left(match\right|notfirst\left|last\right)=1-{(1+\frac{(1-{m}_{f})}{(1-{u}_{f})}\times \frac{{m}_{l}}{{u}_{l}}\times \frac{\lambda}{(1-\lambda )})}^{-1}\approx 0.0141$

$P\left(match\right|notfirst\left|notlast\right)=1-{(1+\frac{(1-{m}_{f})}{(1-{u}_{f})}\times \frac{(1-{m}_{l})}{(1-{u}_{l})}\times \frac{\lambda}{(1-\lambda )})}^{-1}=0$

As expected, these probabilities match the values we calculated using the probability maps in Figure 4-3 when we knew the `match`

and `notmatch`

population upfront.

In conclusion, we are now able to estimate match probabilities for all the different permutations of attribute equivalence without having to know the `match`

population in advance. This probabilistic approach is both powerful and scalable to large datasets with multiple attributes. To help us apply these techniques more easily, we introduce a performant and easy-to-use open source library, Splink, in the next section.

# Introducing Splink

Splink is a Python package for probabilistic entity resolution. Splink implements the Fellegi-Sunter model and includes a variety of interactive outputs to help users understand models and diagnose linkage problems.

Splink supports a number of backends to perform the matching calculations. To begin with, we will use DuckDB, an in-process SQL database management system, that we can run locally on our laptop.

## Configuring Splink

To import Splink into our notebook, we use:

import splink

Splink requires a unique ID column in each dataset, so we need to create these by copying their respective DataFrame indexes:

df_w['unique_id'] = df_w.index df_t['unique_id'] = df_t.index

Splink also needs the same columns to be present in both datasets. Therefore, we need to create blank columns where these are present in only one set of records and then remove unnecessary columns:

df_w['Flink'] = None df_t['Notes'] = None df_w = df_w[['Firstname','Lastname','Constituency','Flink','Notes', 'unique_id']] df_t = df_t[['Firstname','Lastname','Constituency','Flink','Notes', 'unique_id']]

Our next step is to configure Splink settings:

from splink.duckdb.linker import DuckDBLinker from splink.duckdb import comparison_library as cl settings = { "link_type": "link_only", "comparisons": [ cl.exact_match("Firstname"), cl.exact_match("Lastname"), cl.exact_match("Constituency"), ], } linker = DuckDBLinker([df_w, df_t], settings)

Splink supports both deduplication of records within a single dataset and linking between one or more separate datasets. Here we set `link_type`

to `link_only`

to tell Splink that we want only to match, not deduplicate, between our two datasets. We also tell Splink the comparisons we wish to use, in this case exact matches across our three attributes. Lastly, we instantiate the linker with these settings and our source DataFrames.

To help us understand our datasets, Splink provides a visualization of the distribution of the columns to be matched:

linker.profile_columns(['Firstname','Lastname','Constituency'])

The graphs we see in Figure 4-6 show us the combined population across both datasets.

Starting with distribution of first names we can see from the bottom right of the graph that, within the population of 352 distinct names, approximately 35% occur only twice, most probably once in each dataset. Then, moving right to left, we see a gradual increase in frequency to the most popular name, with 32 occurrences. Looking at the top 10 values by value count we see that John is the most popular name, followed by Andrew, David, etc. This tells us that `Firstname`

is a reasonable attribute to match on, but used alone, it will result in some false positives.

For last name the pattern is more stark, with a larger population of 574 distinct names, of which nearly 80% occur only twice. Looking at the top 10 values, the most common last names, Smith and Jones, occur 18 times, almost half as common as the most popular first name. As expected, this tells us that `Lastname`

is a richer attribute than `Firstname`

and therefore its equivalence is a better predictor of matching entities.

As expected, constituencies are uniquely paired across the two datasets, so all values appear exactly twice.

For the purposes of this simple example, we’re going to ask Splink to calculate all the parameters of the model using the expectation-maximization algorithm we introduced earlier. The initial `True`

parameter tells Splink to compare all the records across both datasets without blocking (we’ll see this in the next chapter). We also tell Splink to recalculate the *u* values at each iteration by setting `fix_u_probabilities`

to `False`

. Setting the `fix_probability_two_random_records_match`

to `False`

means the $\lambda $ value (the overall match probability between the two datasets) will be recalculated at each iteration. Finally, we tell Splink to use the updated $\lambda $ value when calculating probabilities for record pairs.:

em_session = linker.estimate_parameters_using_expectation_maximisation( 'True', fix_u_probabilities=False, fix_probability_two_random_records_match=False, populate_probability_two_random_records_match_from_trained_values =True)

## Splink Performance

The EM model converges after three iterations. Splink produces an interactive chart showing the iterative progression of the relative match weights values:

em_session.match_weights_interactive_history_chart()

Figure 4-7 shows the final match weights that Splink has calculated after the third iteration. First, we have the prior (starting) match weight, which is a measure of how likely it is that two records chosen at random match. If you hover over the match weight bars you can see the calculated match weight value together with the underlying *m* and *u* parameters. These are calculated as follows:

$Prior\left(starting\right)\phantom{\rule{4pt}{0ex}}match\phantom{\rule{4pt}{0ex}}weight=lo{g}_{2}\frac{\lambda}{(1-\lambda )}\approx -9.38$

$First\phantom{\rule{4pt}{0ex}}name\phantom{\rule{4pt}{0ex}}match\phantom{\rule{4pt}{0ex}}weight\phantom{\rule{4pt}{0ex}}\left(exact\phantom{\rule{4pt}{0ex}}match\right)=lo{g}_{2}\frac{{m}_{f}}{{u}_{f}}\approx 7.67$

$First\phantom{\rule{4pt}{0ex}}name\phantom{\rule{4pt}{0ex}}match\phantom{\rule{4pt}{0ex}}weight\phantom{\rule{4pt}{0ex}}\left(not\phantom{\rule{4pt}{0ex}}exact\phantom{\rule{4pt}{0ex}}match\right)=lo{g}_{2}\frac{(1-{m}_{f})}{(1-{u}_{f})}\approx -6.98$

$Last\phantom{\rule{4pt}{0ex}}name\phantom{\rule{4pt}{0ex}}match\phantom{\rule{4pt}{0ex}}weight\phantom{\rule{4pt}{0ex}}\left(exact\phantom{\rule{4pt}{0ex}}match\right)=lo{g}_{2}\frac{{m}_{l}}{{u}_{l}}\approx 10.23$

$Last\phantom{\rule{4pt}{0ex}}name\phantom{\rule{4pt}{0ex}}match\phantom{\rule{4pt}{0ex}}weight\phantom{\rule{4pt}{0ex}}\left(not\phantom{\rule{4pt}{0ex}}exact\phantom{\rule{4pt}{0ex}}match\right)=lo{g}_{2}\frac{(1-{m}_{l})}{(1-{u}_{l})}\approx -7.32$

$Constituency\phantom{\rule{4pt}{0ex}}match\phantom{\rule{4pt}{0ex}}weight\phantom{\rule{4pt}{0ex}}\left(exact\phantom{\rule{4pt}{0ex}}match\right)=lo{g}_{2}\frac{{m}_{c}}{{u}_{c}}\approx 14.98$

For illustration purposes, Splink approximates the `Constituency`

not exact match weight as negative infinity and displays it in a different color. This is because there are no cases where the `Firstname`

and `Lastname`

attributes match but the `Constituency`

does not.

We can see the discrete values Splink has calculated using:

linker.save_settings_to_json("Chapter4_Splink_Settings.json", overwrite=True)

{'link_type': 'link_only', 'comparisons': [{'output_column_name': 'Firstname', 'comparison_levels': [{'sql_condition': '"Firstname_l" IS NULL OR "Firstname_r" IS NULL', 'label_for_charts': 'Null', 'is_null_level': True}, {'sql_condition': '"Firstname_l" = "Firstname_r"', 'label_for_charts': 'Exact match', 'm_probability': 0.992118804074688, 'u_probability': 0.004864290128404288}, {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons', 'm_probability': 0.007881195925311958, 'u_probability': 0.9951357098715956}], 'comparison_description': 'Exact match vs. anything else'}, {'output_column_name': 'Lastname', 'comparison_levels': [{'sql_condition': '"Lastname_l" IS NULL OR "Lastname_r" IS NULL', 'label_for_charts': 'Null', 'is_null_level': True}, {'sql_condition': '"Lastname_l" = "Lastname_r"', 'label_for_charts': 'Exact match', 'm_probability': 0.9937726043638647, 'u_probability': 0.00082730840955421}, {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons', 'm_probability': 0.006227395636135347, 'u_probability': 0.9991726915904457}], 'comparison_description': 'Exact match vs. anything else'}, {'output_column_name': 'Constituency', 'comparison_levels': [{'sql_condition': '"Constituency_l" IS NULL OR "Constituency_r" IS NULL', 'label_for_charts': 'Null', 'is_null_level': True}, {'sql_condition': '"Constituency_l" = "Constituency_r"', 'label_for_charts': 'Exact match', 'm_probability': 0.9999999403661186, 'u_probability': 3.092071473132138e-05}, {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons', 'm_probability': 5.963388147277392e-08, 'u_probability': 0.9999690792852688}], 'comparison_description': 'Exact match vs. anything else'}], 'retain_intermediate_calculation_columns': True, 'retain_matching_columns': True, 'sql_dialect': 'duckdb', 'linker_uid': 'adm20und', 'probability_two_random_records_match': 0.0015075875293170335}

The *m* and *u* probabilities match those we calculated manually using the expectation-maximization algorithm earlier in the chapter.

Finally, as before, we apply a threshold match probability and select the record pair above the threshold:

pres = linker.predict(threshold_match_probability = 0.99).as_pandas_dataframe()len(pres)633

Analysis of these predictions shows that all 633 are true positives, leaving the 13 by-election true negatives and 4 false negatives. We can view the 4 false negatives with:

m_outer = match.merge( pres, left_on=['Constituency_t'], right_on=['Constituency_l'], how='outer') m_outer[m_outer['Constituency_t']!=m_outer['Constituency_l']] [['Constituency_w','Lastname_w','Lastname_t']]

The output, shown in Figure 4-8, shows that the mismatch on `Lastname`

is the reason these entities fall below the match threshold.

In comparison to the unweighted results in Chapter 3, Splink declares a match for “Liz Truss” versus “Elizabeth Truss,” but does not match “Anne Marie Morris” to “Anne Morris,” nor “Martin Docherty-Hughes” to “Martin Docherty.” This is because it is more heavily influenced by a mismatch on `Lastname`

, which is statistically a better negative predictor, than a mismatch on `Firstname`

.

# Summary

To recap, we took two sets of records and combined them into a composite dataset containing every record pair combination. We then computed exact match features between equivalent fields and then combined those features, weighted according to how often they occurred in both the matching and nonmatching populations, to determine the overall likelihood of a match.

We saw how to use probability theory to calculate the match weights using the iterative expectation-maximization algorithm when we don’t have known `match`

populations.

Finally, we introduced the probabilistic entity resolution framework Splink, which greatly simplified the calculations when combining multiple attributes and helped us visualize and understand our match results.

Now that we have worked through a small-scale example, we will apply the techniques of approximate and probabilistic matching on a larger scale.

^{1} The original paper is available online.

Get *Hands-On Entity Resolution* now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.