Chapter 4. Probabilistic Matching

In Chapter 3, we explored how to use approximate matching techniques to measure the degree of similarity between attribute values. We set a threshold above which we declared equivalence and then combined these matching features, with equal weight, to conclude that two records referred to the same entity when both were a match. We evaluated our performance against exact matches only.

In this chapter, we will examine how to use probability-based techniques to calculate the optimum weighting for each equivalent attribute in calculating the likelihood of an overall entity match. This probability-based approach allows us to declare a match when the most statistically significant attributes are equivalent (either exact or approximate) but those with less significance are insufficiently similar. It also allows us to grade our confidence in the declaration of a match and apply appropriate match thresholds. The model that will be introduced in this section is known as the Fellegi-Sunter (FS) model.

We will also introduce a probabilistic entity resolution framework, Splink, that we will use to help us calculate these metrics and resolve our entities together.

Sample Problem

Let’s return to our exact match results from the end of Chapter 2. Opening the Chapter4.ipynb notebook we reload the standardized datasets from the Wikipedia and TheyWorkForYou websites. As in Chapter 3, we start by calculating the Cartesian, or cross, product of the two datasets as:

cross = df_w.merge(df_t, how='cross', suffixes=('_w', '_t'))

This gives us our total population of 650 × 650 = 422,500 record pairs—a pair for every name combination between the Wikipedia and TheyWorkForYou datasets.

Throughout this chapter, we will be using exact matches between the Firstname, Lastname, and Constituency fields of each of these record pairs multiple times. Thus, it’s more efficient to calculate these matches once and store them as additional feature columns:

cross['Fmatch']= (cross['Firstname_w']==cross['Firstname_t'])
cross['Lmatch']= (cross['Lastname_w']==cross['Lastname_t'])
cross['Cmatch']= (cross['Constituency_w']==cross['Constituency_t'])

We also calculate the total number of matching columns, which we will use later:

cross['Tmatch'] =
    sum([cross['Fmatch'],cross['Lmatch'],cross['Cmatch']])

Based on our exploration of the data in Chapter 2, we know that within our total population of 422,500 combinations we have 637 record pairs that have an exact match on constituency and either first name or last name. This is our match population:

match = cross[cross['Cmatch'] & (cross['Fmatch'] |
   cross['Lmatch'])]

The remainder, our notmatch population, is extracted as the inverse:

notmatch = cross[(~cross['Cmatch']) | (~cross['Fmatch'] &
    ~cross['Lmatch'])]

These combinations are summarized in Table 4-1.

Table 4-1. Match and not match combinations
Match/not match population	Constituency match	First name match	Last name match
Not match	No	No	No
Not match	No	No	Yes
Not match	No	Yes	No
Not match	No	Yes	Yes
Not match	Yes	No	No
Match	Yes	No	Yes
Match	Yes	Yes	No
Match	Yes	Yes	Yes

We will now examine how well first name and last name equivalence, both individually and together, can predict whether an individual record belongs in either the match or notmatch population.

Single Attribute Match Probability

Let’s begin by considering whether first name equivalence alone is a good indicator that two entities within a record pair refer to the same person. We will examine both the match and notmatch populations and establish, within each of those subsets, how many first names match and how many do not.

Naming Convention

As we work through various subsets of these populations, it’s helpful to adopt a standard naming convention so that we can see at a glance how each population of records was selected. As we select records we add the selection criteria to the population name, right to left, e.g., first_match should be read as first selecting those records that are part of the match population and within that subset of the population further selecting only those rows where the first names are equivalent.

First Name Match Probability

Starting with the match population we can select those records where the first names are equivalent to give us our first_match population:

first_match = match[match['Fmatch']]

len(first_match)
632

Repeating this for the three other combinations of match/not match, and first name equivalence or not, we can draw up a population map, as shown in Figure 4-1.

Therefore, based on first name equivalence only, we have:

$upper T r u e p o s i t i v e m a t c h e s left-parenthesis upper T upper P right-parenthesis equals 632$

$upper F a l s e p o s i t i v e m a t c h e s left-parenthesis upper F upper P right-parenthesis equals 2052$

$upper T r u e n e g a t i v e m a t c h e s left-parenthesis upper T upper N right-parenthesis equals 419811$

$upper F a l s e n e g a t i v e m a t c h e s left-parenthesis upper F upper N right-parenthesis equals 5$

Now we can calculate some probability values. First, the probability that a record pair whose first names are equivalent is actually a true positive match can be calculated as the number of pairs within the match population whose first names match divided by the number of pairs whose first names match across both the match and notmatch populations:

$p r o b_m a t c h_f i r s t = \frac{l e n (f i r s t_m a t c h)}{(l e n (f i r s t_m a t c h) + l e n (f i r s t_n o t m a t c h))} = \frac{632}{(632 + 2052)} \approx 0.2355$

From this we can see that, at only about 23%, first name equivalence alone isn’t a great predictor of a match between two records. This value is a conditional probability, that is, it is the probability of a true positive match conditional on the first name being a match. This can be written as:

$upper P left-parenthesis m a t c h vertical-bar f i r s t right-parenthesis$

where the pipe character (|) is read as “given that.”

Last Name Match Probability

Applying the same calculations to the last name, we can draw a second population map, as shown in Figure 4-2.

As for first name, the probability that a pair of records whose last names are equivalent is actually a match can be calculated as the number of pairs within the match population whose last names match divided by the number of pairs whose last names match across both the match and notmatch populations.

$p r o b_m a t c h_l a s t = \frac{l e n (l a s t_m a t c h)}{(l e n (l a s t_m a t c h) + l e n (l a s t_n o t m a t c h))} = \frac{633}{(633 + 349)} \approx 0.6446$

For these records last name equivalence is clearly a better predictor of a true match than first name, which instinctively makes sense.

Again, this can be written as:

$upper P left-parenthesis m a t c h vertical-bar l a s t right-parenthesis$

Multiple Attribute Match Probability

Now if we consider both first name and last name equivalence we can further subdivide our population map. Starting with our first name map and further subdividing each first name category into last name equivalence, and not, we can view our population as shown in Figure 4-3.

Extending our calculation to both first name and last name exact matches, we can calculate the probability of a true positive match given both first name and last name equivalence as:

$p r o b_m a t c h_l a s t_f i r s t = \frac{l e n (l a s t_f i r s t_m a t c h)}{(l e n (l a s t_f i r s t_m a t c h) + l e n (l a s t_f i r s t_n o t m a t c h)} = \frac{628}{(628 + 0)} = 1.0$

If the first name matches but last name doesn’t, what is the probability that it’s a match?

$p r o b_m a t c h_n o t l a s t_f i r s t = \frac{l e n (n o t l a s t_f i r s t_m a t c h)}{(l e n (n o t l a s t_f i r s t_m a t c h) + l e n (n o t l a s t_f i r s t_n o t m a t c h))} = \frac{4}{(4 + 2052)} \approx 0.0019$

If the first name doesn’t match but the last name does, what is the probability that it’s a match?

$p r o b_m a t c h_l a s t_n o t f i r s t = \frac{l e n (l a s t_n o t f i r s t_m a t c h)}{(l e n (l a s t_n o t f i r s t_m a t c h) + l e n (l a s t_n o t f i r s t_n o t m a t c h))} = \frac{5}{(5 + 349)} \approx 0.0141$

As we expected, if either first name or last name isn’t an exact match, then the probability of a true positive match is low, but a last name match gives us more confidence than a first name one.

If neither first name nor last name matches, the probability that it is a match is:

$p r o b_m a t c h_n o t l a s t_n o t f i r s t =$

$\frac{l e n (n o t l a s t_n o t f i r s t_m a t c h)}{(l e n (n o t l a s t_n o t f i r s t_m a t c h) + l e n (n o t l a s t_n o t f i r s t_n o t m a t c h))} = \frac{0}{(0 + 419462)} = 0$

This is not surprising given that we defined true positive matches as records with an exact match on constituency and either first name or last name.

In conclusion, we can use these probabilities to inform our decision making on whether we are likely to have a true positive match or not. In this example, we would place more weight on a last name match than a first name one. This is an improvement on our method in Chapter 3, where we gave them the same weighting (and required them both to be equivalent) to declare a match.

But wait, we have a problem. In the preceding example, we started with a known population of matches that we used to compute the probabilities that first name and last name equivalence equate to a match. However, in most situations we don’t have a known match population; otherwise we wouldn’t need to perform matching in the first place! How do we overcome this? To do so, we need to reframe our calculation a little and then employ some clever estimation techniques.

Probabilistic Models

In the previous section, we learned that some attributes are more informative than others; that is, they have more predictive power to help us decide whether a match is likely to be correct. In this section, we examine how to calculate these contributions and how to combine them to assess the overall likelihood of a match.

We start with a little statistical theory (using first name equivalence as an example) before we generalize to model what we can deploy at scale.

Bayes’ Theorem

Bayes’ theorem, named after Thomas Bayes, states that the conditional probability of an event, based on the occurrence of another event, is equal to the probability of the second event given the first event, multiplied by the probability of the first event.

Consider the probability that two records chosen at random are a true positive match, P(match), multiplied by the probability that within those matches the first names match, P (first|match):

$upper P left-parenthesis f i r s t vertical-bar m a t c h right-parenthesis times upper P left-parenthesis m a t c h right-parenthesis$

Equally, we could calculate the same value in the reverse order, starting with the probability that the first name matches, P(first) multiplied by the probability that records within this population are a true positive match:

$upper P left-parenthesis m a t c h vertical-bar f i r s t right-parenthesis times upper P left-parenthesis f i r s t right-parenthesis$

Equating these probabilities, we have:

$P (m a t c h | f i r s t) \times P (f i r s t) = P (f i r s t | m a t c h) \times P (m a t c h)$

Rearranging we can calculate:

$P (m a t c h | f i r s t) = \frac{P (f i r s t | m a t c h) \times P (m a t c h)}{P (f i r s t)}$

We can calculate P(first) as the sum of the probabilities across both the match and notmatch populations:

$\begin{aligned} P (f i r s t) & = (P (f i r s t | m a t c h) \times P (m a t c h) \\ + P (f i r s t | n o t m a t c h) \times P (n o t m a t c h)) \end{aligned}$

Substituting in the preceding equation, we have:

$P (m a t c h | f i r s t) = \frac{P (f i r s t | m a t c h) \times P (m a t c h)}{P (f i r s t | m a t c h) \times P (m a t c h) + P (f i r s t | n o t m a t c h) \times P (n o t m a t c h)}$

Alternatively, we can rearrange this as:

$P (m a t c h | f i r s t) = 1 - {(1 + \frac{P (f i r s t | m a t c h)}{P (f i r s t | n o t m a t c h)} \times \frac{P (m a t c h)}{P (n o t m a t c h)})}^{- 1}$

If we can estimate the values in this equation, we can determine the probability that if a first name is equivalent, then the record pair really is a match.

Let’s examine these values in a little more detail, simplifying the notation as we go along.

m Value

The conditional probability that an attribute will be equivalent within the overall match population is known as the m value. Using our Firstname example, we can denote this as:

$m Subscript f Baseline equals upper P left-parenthesis f i r s t vertical-bar m a t c h right-parenthesis$

In a perfect dataset, all the first names within the match population would be exactly equivalent and the m value would be 1. This value can therefore be thought of as a measure of data quality, i.e., how much variability there is in how an attribute has been captured across the datasets. A higher value indicates a better-quality attribute.

u Value

The conditional probability that an attribute will be equivalent within the overall notmatch population is known as the u value. Again, using our Firstname example, we can denote this as:

$u Subscript f Baseline equals upper P left-parenthesis f i r s t vertical-bar n o t m a t c h right-parenthesis$

This value reflects how much commonality there is in this attribute across the datasets. A lower value indicates a less common, more distinguishing attribute that, if found to be equivalent in a particular case, would lead us to question whether it belongs in the notmatch population and is really a match. Conversely, a higher u value tells us that this particular attribute is not as valuable for determining overall matches.

A good example of u value is a month of birth attribute, which, assuming the population is equally distributed across the year, will have a u value of $one-twelfth$ .

Lambda ( $λ$ ) Value

The lambda value, $λ$ , also known as the prior, is the probability that two randomly chosen records match.

$lamda equals upper P left-parenthesis m a t c h right-parenthesis$

In contrast with the m and u values, the $λ$ value is a record-level value not associated with any particular attribute. This value is a measure of how much duplication there is in the dataset overall and is the starting point for our probability calculations.

The inverse, the likelihood that two randomly chosen records are not a match, can be written as follows:

$1 - λ = P (n o t m a t c h)$

Bayes Factor

Substituting these compact notations can result in the following:

$P (m a t c h | f i r s t) = 1 - {(1 + \frac{m_{f}}{u_{f}} \times \frac{λ}{(1 - λ)})}^{- 1}$

The ratio $\frac{m_{f}}{u_{f}}$ is also known as the Bayes factor, in this case of the Firstname parameter. The Bayes factor, as a combination of both the m and u values, gives a measure of the significance we should attach to the fact that the Firstname values were equivalent.

Fellegi-Sunter Model

The Fellegi-Sunter model, named after Ivan P. Fellegi and Alan B. Sunter,¹ describes how we can extend our simple Bayesian approach, combining the contribution of multiple attributes, to calculate the overall likelihood of a match. It relies on the simplifying assumption of conditional independence between attributes, also known as naive Bayes.

Using the FS model, we can combine the Bayes factors associated with each attribute in our record by simply multiplying them together. Taking our Firstname example and extending it to consider when the Lastname is also equivalent we have:

$P (m a t c h | l a s t | f i r s t) = 1 - {(1 + \frac{m_{f}}{u_{f}} \times \frac{m_{l}}{u_{l}} \times \frac{λ}{(1 - λ)})}^{- 1}$

When an attribute isn’t equivalent, the Bayes factor is calculated as the inverse, $\frac{(1 - m_{l})}{(1 - u_{l})}$ . Therefore when the Firstname is equivalent but the Lastname is not, we calculate the probability of an overall match as:

$P (m a t c h | n o t l a s t | f i r s t) = 1 - {(1 + \frac{m_{f}}{u_{f}} \times \frac{(1 - m_{l})}{(1 - u_{l})} \times \frac{λ}{(1 - λ)})}^{- 1}$

Once we can calculate the m and u values for each attribute, and the $λ$ value for the overall dataset, we can easily calculate the probabilities for each record pair. We simply determine the equivalence of each attribute (either exact or approximate as appropriate), select the appropriate Bayes factors, and multiply them together using the preceding formula to calculate the overall probability for that record pair.

For our simple example, our Bayes factors are therefore calculated as shown in Table 4-2.

Table 4-2. `Firstname`, `Lastname` match factor calculations
`Firstname` equivalence	`Lastname` equivalence	`Firstname` Bayes factor	`Lastname` Bayes factor	Combined Bayes factor
No	No	$\frac{(1 - m_{f})}{(1 - u_{f})}$	$\frac{(1 - m_{l})}{(1 - u_{l})}$	$\frac{(1 - m_{f})}{(1 - u_{f})} \times \frac{(1 - m_{l})}{(1 - u_{l})}$
No	Yes	$\frac{(1 - m_{f})}{(1 - u_{f})}$	$\frac{m_{l}}{u_{l}}$	$\frac{(1 - m_{f})}{(1 - u_{f})} \times \frac{m_{l}}{u_{l}}$
Yes	No	$\frac{m_{f}}{u_{f}}$	$\frac{(1 - m_{l})}{(1 - u_{l})}$	$\frac{m_{f}}{u_{f}} \times \frac{(1 - m_{l})}{(1 - u_{l})}$
Yes	Yes	$\frac{m_{f}}{u_{f}}$	$\frac{m_{l}}{u_{l}}$	$\frac{m_{f}}{u_{f}} \times \frac{m_{l}}{u_{l}}$

Match Weight

To make overall match calculations more intuitive, the logarithm of the Bayes factors is sometimes used so that they can be added together rather than multiplied. In this way it is easy to visualize the relative contribution of each attribute to the overall score.

For our simple first name and last name equivalence example, the logarithmic match weight might be calculated (using base 2) as:

$M a t c h W e i g h t = l o g_{2} \frac{m_{f}}{u_{f}} + l o g_{2} \frac{m_{l}}{u_{l}} + l o g_{2} \frac{λ}{(1 - λ)}$

We can calculate probability from the match weight as:

$P r o b a b i l i t y = 1 - {(1 + 2^{M a t c h W e i g h t})}^{- 1}$

Now that we understand how to combine our individual attribute probabilities, or match weights, together let’s consider how to estimate our $λ$ value and our m and u values for each attribute when we don’t have a known match population. One technique that we can use is called the expectation-maximization (EM) algorithm.

Expectation-Maximization Algorithm

The expectation-maximization algorithm uses an iterative approach to approximating the $λ$ and m and u values. Let’s see a simplified form of this in action applied to our sample problem.

Iteration 1

For the first iteration we make the opening assumption that record pairs where the majority of the feature columns are equivalent are matches:

it1_match = cross[cross['Tmatch']>=2]
it1_notmatch = cross[cross['Tmatch']<2]

len(it1_match)
637

This gives us a pseudo match population, it1_match, of 637 records. In addition to the 628 perfect matches we found in Chapter 2, we also have 9 matches where either Firstname or Lastname (but not both) doesn’t match, as we see in Figure 4-4:

it1_match[~it1_match['Fmatch'] | ~it1_match['Lmatch']]
   [['Constituency_w','Firstname_w','Firstname_t',
      'Lastname_w','Lastname_t']]

Our initial $λ$ value is therefore:

$λ_{1} = \frac{637}{650 \times 650} \approx 0.0015$

$(1 - λ_{1}) = (1 - 0.0015) \approx 0.9985$

Our initial prior match weight is therefore $l o g_{2} \frac{λ_{1}}{(1 - λ_{1})} \approx - 9.371$ .

As a starting point, it’s therefore extremely unlikely that two records are a match. Now let’s calculate our m and u values so that we can update our probability on a per-record basis.

As we have a pseudo match and notmatch population, it’s straightforward to calculate our m and u values as the proportion of each population with an equivalent attribute. For Firstname, Lastname, and Constituency we use:

mfi1 = len(it1_match[it1_match['Fmatch']])/len(it1_match)
mli1 = len(it1_match[it1_match['Lmatch']])/len(it1_match)
mci1 = len(it1_match[it1_match['Cmatch']])/len(it1_match)

ufi1 = len(it1_notmatch[it1_notmatch['Fmatch']])/len(it1_notmatch)
uli1 = len(it1_notmatch[it1_notmatch['Lmatch']])/len(it1_notmatch)
uci1 = len(it1_notmatch[it1_notmatch['Cmatch']])/len(it1_notmatch)

Table 4-3 shows these values and the resulting match weight values per attribute.

Table 4-3. Iteration 1 m and u values
Attribute	m value	u value	Match Bayes factor	Match weight	Not match Bayes factor	Not match weight
`Firstname`	0.9921	0.0049	203.97	7.67	0.0079	–6.98
`Lastname`	0.9937	0.0008	1201.19	10.23	0.0063	–7.31
`Constituency`	1.0	0.0	$\infty$	$\infty$	0	$- \infty$

There are no record pairs where the constituency is equivalent in the notmatch population, so its u value is 0 and therefore its match weight is mathematically infinity and its notmatch weight is negative infinity.

Now we can use these values in the Fellegi-Sunter model to calculate the match probability for every record pair in the full population. We use a helper function to calculate these probabilities based on the values of the Constituency, Firstname, and Lastname match features:

def match_prb(Fmatch,Lmatch,Cmatch,mf1,ml1,mc1,uf1,ul1,uc1, lmbda):
    if (Fmatch==1):
        mf = mf1
        uf = uf1
    else:
        mf = (1-mf1)
        uf = (1-uf1)
    if (Lmatch==1):
        ml = ml1
        ul = ul1
    else:
        ml = (1-ml1)
        ul = (1-ul1)
    if (Cmatch==1):
        mc = mc1
        uc = uc1
    else:
        mc = (1-mc1)
        uc = (1-uc1)
    prob = (lmbda * ml * mf * mc) / (lmbda * ml * mf * mc +
           (1-lmbda) * ul * uf * uc)
    return(prob)

We apply this function to the whole population with:

cross['prob'] = cross.apply(lambda x: match_prb(
      x.Fmatch,x.Lmatch,x.Cmatch,
      mfi1,mli1,mci1,
      ufi1,uli1,uci1,
      lmbda), axis=1)

Once we’ve calculated these values we can iterate again, resegmenting our population into match and notmatch populations based on the calculated match probabilities.

Iteration 2

For illustration purposes, we use an overall match probability of greater than 0.99 to define our new assumed match population and assign any record with an equal or lower probability to our notmatch population:

it2_match = cross[cross['prob']>0.99]
it2_notmatch = cross[cross['prob']<=0.99]

len(it2_match)
633

Applying this 0.99 threshold gives us a slightly reduced match population of 633. Let’s see why. If we select the records just below the threshold we can see:

it2_notmatch[it2_notmatch['prob']>0.9]
   [['Constituency_w', 'Lastname_w','Lastname_t','prob']]

As we see in Figure 4-5, if the Lastname isn’t equivalent, the new match probability falls just below our 0.99 threshold. Using these new match and notmatch populations we can revise our $λ$ , m, and u values and iterate again, recalculating the match probabilities for each record pair.

In this case, our $λ$ doesn’t materially change:

$λ_{2} = \frac{633}{650 \times 650} \approx 0.0015$

Only the Lastname values change slightly, as shown in Table 4-4.

Table 4-4. Iteration 2 m and u values
Attribute	m value	u value	Match Bayes factor	Match weight	Not match Bayes factor	Not match weight
`Firstname`	0.9921	0.0049	203.97	7.67	0.0079	–6.98
`Lastname`	1.0	0.0008	1208.79	10.24	0	$- \infty$
`Constituency`	1.0	0.0	$\infty$	$\infty$	0	$- \infty$

Iteration 3

In this simple example, this next iteration doesn’t change the match population, which remains at 633, because the EM algorithm has already converged.

This gives us our final parameter values of:

$lamda almost-equals 0.0015$

$m_{f} = P (f i r s t | m a t c h) \approx 0.9921$

$m_{l} = P (l a s t | m a t c h) \approx 1.0$

$m_{c} = P (c o n s t i t u e n c y | m a t c h) \approx 1.0$

$u_{f} = P (f i r s t | n o t m a t c h) \approx 0.0049$

$u_{l} = P (l a s t | n o t m a t c h) \approx 0.0008$

$u_{c} = P (c o n s t i t u e n c y | n o t m a t c h) \approx 0$

This instinctively feels right. We know that a match will always have an equivalent constituency and either first name or last name will match, with last name slightly more likely to be equivalent than first name (five out of nine versus four out of nine in the preceding sample).

Similarly, we know the constituency will never be the same in a notmatch record pair and it’s very unlikely that either the first name or last name will accidentally match either (with first name slightly more likely).

We can turn these estimated values into match probabilities using the equations in the previous section:

$P (m a t c h | l a s t | f i r s t) = 1 - {(1 + \frac{m_{f}}{u_{f}} \times \frac{m_{l}}{u_{l}} \times \frac{λ}{(1 - λ)})}^{- 1} = 1.0$

$P (m a t c h | n o t l a s t | f i r s t) = 1 - {(1 + \frac{m_{f}}{u_{f}} \times \frac{(1 - m_{l})}{(1 - u_{l})} \times \frac{λ}{(1 - λ)})}^{- 1} \approx 0.0019$

$P (m a t c h | n o t f i r s t | l a s t) = 1 - {(1 + \frac{(1 - m_{f})}{(1 - u_{f})} \times \frac{m_{l}}{u_{l}} \times \frac{λ}{(1 - λ)})}^{- 1} \approx 0.0141$

$P (m a t c h | n o t f i r s t | n o t l a s t) = 1 - {(1 + \frac{(1 - m_{f})}{(1 - u_{f})} \times \frac{(1 - m_{l})}{(1 - u_{l})} \times \frac{λ}{(1 - λ)})}^{- 1} = 0$

As expected, these probabilities match the values we calculated using the probability maps in Figure 4-3 when we knew the match and notmatch population upfront.

In conclusion, we are now able to estimate match probabilities for all the different permutations of attribute equivalence without having to know the match population in advance. This probabilistic approach is both powerful and scalable to large datasets with multiple attributes. To help us apply these techniques more easily, we introduce a performant and easy-to-use open source library, Splink, in the next section.

Introducing Splink

Splink is a Python package for probabilistic entity resolution. Splink implements the Fellegi-Sunter model and includes a variety of interactive outputs to help users understand models and diagnose linkage problems.

Splink supports a number of backends to perform the matching calculations. To begin with, we will use DuckDB, an in-process SQL database management system, that we can run locally on our laptop.

Configuring Splink

To import Splink into our notebook, we use:

import splink

Splink requires a unique ID column in each dataset, so we need to create these by copying their respective DataFrame indexes:

df_w['unique_id'] = df_w.index
df_t['unique_id'] = df_t.index

Splink also needs the same columns to be present in both datasets. Therefore, we need to create blank columns where these are present in only one set of records and then remove unnecessary columns:

df_w['Flink'] = None
df_t['Notes'] = None

df_w = df_w[['Firstname','Lastname','Constituency','Flink','Notes',
   'unique_id']]
df_t = df_t[['Firstname','Lastname','Constituency','Flink','Notes',
   'unique_id']]

Our next step is to configure Splink settings:

from splink.duckdb.linker import DuckDBLinker
from splink.duckdb import comparison_library as cl

settings = {
   "link_type": "link_only", "comparisons": [
       cl.exact_match("Firstname"),
       cl.exact_match("Lastname"),
       cl.exact_match("Constituency"),
   ],
}
linker = DuckDBLinker([df_w, df_t], settings)

Splink supports both deduplication of records within a single dataset and linking between one or more separate datasets. Here we set link_type to link_only to tell Splink that we want only to match, not deduplicate, between our two datasets. We also tell Splink the comparisons we wish to use, in this case exact matches across our three attributes. Lastly, we instantiate the linker with these settings and our source DataFrames.

To help us understand our datasets, Splink provides a visualization of the distribution of the columns to be matched:

linker.profile_columns(['Firstname','Lastname','Constituency'])

The graphs we see in Figure 4-6 show us the combined population across both datasets.

Starting with distribution of first names we can see from the bottom right of the graph that, within the population of 352 distinct names, approximately 35% occur only twice, most probably once in each dataset. Then, moving right to left, we see a gradual increase in frequency to the most popular name, with 32 occurrences. Looking at the top 10 values by value count we see that John is the most popular name, followed by Andrew, David, etc. This tells us that Firstname is a reasonable attribute to match on, but used alone, it will result in some false positives.

For last name the pattern is more stark, with a larger population of 574 distinct names, of which nearly 80% occur only twice. Looking at the top 10 values, the most common last names, Smith and Jones, occur 18 times, almost half as common as the most popular first name. As expected, this tells us that Lastname is a richer attribute than Firstname and therefore its equivalence is a better predictor of matching entities.

As expected, constituencies are uniquely paired across the two datasets, so all values appear exactly twice.

For the purposes of this simple example, we’re going to ask Splink to calculate all the parameters of the model using the expectation-maximization algorithm we introduced earlier. The initial True parameter tells Splink to compare all the records across both datasets without blocking (we’ll see this in the next chapter). We also tell Splink to recalculate the u values at each iteration by setting fix_u_probabilities to False. Setting the fix_probability_two_random_records_match to False means the $λ$ value (the overall match probability between the two datasets) will be recalculated at each iteration. Finally, we tell Splink to use the updated $λ$ value when calculating probabilities for record pairs.:

em_session = linker.estimate_parameters_using_expectation_maximisation(
   'True',
   fix_u_probabilities=False,
   fix_probability_two_random_records_match=False,
   populate_probability_two_random_records_match_from_trained_values
     =True)

Splink Performance

The EM model converges after three iterations. Splink produces an interactive chart showing the iterative progression of the relative match weights values:

em_session.match_weights_interactive_history_chart()

Figure 4-7 shows the final match weights that Splink has calculated after the third iteration. First, we have the prior (starting) match weight, which is a measure of how likely it is that two records chosen at random match. If you hover over the match weight bars you can see the calculated match weight value together with the underlying m and u parameters. These are calculated as follows:

$P r i o r (s t a r t i n g) m a t c h w e i g h t = l o g_{2} \frac{λ}{(1 - λ)} \approx - 9.38$

$F i r s t n a m e m a t c h w e i g h t (e x a c t m a t c h) = l o g_{2} \frac{m_{f}}{u_{f}} \approx 7.67$

$F i r s t n a m e m a t c h w e i g h t (n o t e x a c t m a t c h) = l o g_{2} \frac{(1 - m_{f})}{(1 - u_{f})} \approx - 6.98$

$L a s t n a m e m a t c h w e i g h t (e x a c t m a t c h) = l o g_{2} \frac{m_{l}}{u_{l}} \approx 10.23$

$L a s t n a m e m a t c h w e i g h t (n o t e x a c t m a t c h) = l o g_{2} \frac{(1 - m_{l})}{(1 - u_{l})} \approx - 7.32$

$C o n s t i t u e n c y m a t c h w e i g h t (e x a c t m a t c h) = l o g_{2} \frac{m_{c}}{u_{c}} \approx 14.98$

For illustration purposes, Splink approximates the Constituency not exact match weight as negative infinity and displays it in a different color. This is because there are no cases where the Firstname and Lastname attributes match but the Constituency does not.

We can see the discrete values Splink has calculated using:

linker.save_settings_to_json("Chapter4_Splink_Settings.json",
   overwrite=True)

{'link_type': 'link_only',
'comparisons': [{'output_column_name': 'Firstname',
   'comparison_levels': [{'sql_condition': '"Firstname_l" IS NULL OR
       "Firstname_r" IS NULL',
     'label_for_charts': 'Null',
     'is_null_level': True},
    {'sql_condition': '"Firstname_l" = "Firstname_r"',
     'label_for_charts': 'Exact match',
     'm_probability': 0.992118804074688,
     'u_probability': 0.004864290128404288},
    {'sql_condition': 'ELSE',
     'label_for_charts': 'All other comparisons',
     'm_probability': 0.007881195925311958,
     'u_probability': 0.9951357098715956}],
   'comparison_description': 'Exact match vs. anything else'},
  {'output_column_name': 'Lastname',
   'comparison_levels': [{'sql_condition': '"Lastname_l" IS NULL OR
       "Lastname_r" IS NULL',
     'label_for_charts': 'Null',
     'is_null_level': True},
    {'sql_condition': '"Lastname_l" = "Lastname_r"',
     'label_for_charts': 'Exact match',
     'm_probability': 0.9937726043638647,
     'u_probability': 0.00082730840955421},
    {'sql_condition': 'ELSE',
     'label_for_charts': 'All other comparisons',
     'm_probability': 0.006227395636135347,
     'u_probability': 0.9991726915904457}],
   'comparison_description': 'Exact match vs. anything else'},
  {'output_column_name': 'Constituency',
   'comparison_levels': [{'sql_condition': '"Constituency_l" IS NULL OR
       "Constituency_r" IS NULL',
     'label_for_charts': 'Null',
     'is_null_level': True},
    {'sql_condition': '"Constituency_l" = "Constituency_r"',
     'label_for_charts': 'Exact match',
     'm_probability': 0.9999999403661186,
     'u_probability': 3.092071473132138e-05},
    {'sql_condition': 'ELSE',
     'label_for_charts': 'All other comparisons',
     'm_probability': 5.963388147277392e-08,
     'u_probability': 0.9999690792852688}],
   'comparison_description': 'Exact match vs. anything else'}],
'retain_intermediate_calculation_columns': True,
'retain_matching_columns': True,
'sql_dialect': 'duckdb',
'linker_uid': 'adm20und',
'probability_two_random_records_match': 0.0015075875293170335}

The m and u probabilities match those we calculated manually using the expectation-maximization algorithm earlier in the chapter.

Finally, as before, we apply a threshold match probability and select the record pair above the threshold:

pres = linker.predict(threshold_match_probability =
   0.99).as_pandas_dataframe()

len(pres)
633

Analysis of these predictions shows that all 633 are true positives, leaving the 13 by-election true negatives and 4 false negatives. We can view the 4 false negatives with:

m_outer = match.merge(
   pres,
   left_on=['Constituency_t'],
   right_on=['Constituency_l'],
   how='outer')

m_outer[m_outer['Constituency_t']!=m_outer['Constituency_l']]
   [['Constituency_w','Lastname_w','Lastname_t']]

The output, shown in Figure 4-8, shows that the mismatch on Lastname is the reason these entities fall below the match threshold.

In comparison to the unweighted results in Chapter 3, Splink declares a match for “Liz Truss” versus “Elizabeth Truss,” but does not match “Anne Marie Morris” to “Anne Morris,” nor “Martin Docherty-Hughes” to “Martin Docherty.” This is because it is more heavily influenced by a mismatch on Lastname, which is statistically a better negative predictor, than a mismatch on Firstname.

Summary

To recap, we took two sets of records and combined them into a composite dataset containing every record pair combination. We then computed exact match features between equivalent fields and then combined those features, weighted according to how often they occurred in both the matching and nonmatching populations, to determine the overall likelihood of a match.

We saw how to use probability theory to calculate the match weights using the iterative expectation-maximization algorithm when we don’t have known match populations.

Finally, we introduced the probabilistic entity resolution framework Splink, which greatly simplified the calculations when combining multiple attributes and helped us visualize and understand our match results.

Now that we have worked through a small-scale example, we will apply the techniques of approximate and probabilistic matching on a larger scale.

¹ The original paper is available online.

Get Hands-On Entity Resolution now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Hands-On Entity Resolution by Michael Shearer

Chapter 4. Probabilistic Matching

Sample Problem

Single Attribute Match Probability

Naming Convention

First Name Match Probability

Figure 4-1. First name population map

Last Name Match Probability

Figure 4-2. Last name population map

Multiple Attribute Match Probability

Figure 4-3. First name, last name population map

Probabilistic Models

Bayes’ Theorem

m Value

u Value

Lambda ( $λ$ ) Value

Bayes Factor

Fellegi-Sunter Model

Match Weight

Expectation-Maximization Algorithm

Iteration 1

Figure 4-4. Expectation-maximization iteration 1 additional matches

Iteration 2

Figure 4-5. Iteration 2 records with below-the-line match thresholds

Iteration 3

Introducing Splink

Configuring Splink

Figure 4-6. Splink column profiles

Splink Performance

Figure 4-7. Splink match weights

Figure 4-8. Splink below the threshold due to `Lastname` mismatch

Summary

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly

Chapter 4. Probabilistic Matching

Sample Problem

Single Attribute Match Probability

Naming Convention

First Name Match Probability

Figure 4-1. First name population map

Last Name Match Probability

Figure 4-2. Last name population map

Multiple Attribute Match Probability

Figure 4-3. First name, last name population map

Probabilistic Models

Bayes’ Theorem

m Value

u Value

Lambda ( λ ) Value

Bayes Factor

Fellegi-Sunter Model

Match Weight

Expectation-Maximization Algorithm

Iteration 1

Figure 4-4. Expectation-maximization iteration 1 additional matches

Iteration 2

Figure 4-5. Iteration 2 records with below-the-line match thresholds

Iteration 3

Introducing Splink

Configuring Splink

Figure 4-6. Splink column profiles

Splink Performance

Figure 4-7. Splink match weights

Figure 4-8. Splink below the threshold due to Lastname mismatch

Summary

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly

Lambda ( $λ$ ) Value

Figure 4-8. Splink below the threshold due to `Lastname` mismatch