Chapter 6. Email

Everyone hates email, but now that use of Social Security numbers for data matching is (rightly) becoming scarce, I lean on email addresses fairly heavily. Why? While everyone hates email, everyone tends to have at least one email address, and with few exceptions they tend to be unique (I have an aunt and uncle who share an email address, and that happens, but it is rare). Along with mobile phone numbers, if you have a match on email, your likelihood of a false positive is small.

What Makes a Valid Email Address?

One thing that can throw off matching on email addresses is that Gmail allows optional periods on the recipient side, but their presence or absence makes no difference to Google! Hence, to Gmail, but not to most software:

mortimer.snedley@gmail.com = mortimersnedley@gmail.com

How do you handle this? Remove the periods, of course:

SELECT
    COUNT(*) [Exact Email Matches]
FROM crm.NormalizedCustomer C
INNER JOIN dbo.PotentialMatches P ON
    REPLACE(C.Email, '.', '') = REPLACE(P.email, '.', '');
Exact Email Matches
39

If you read all the requests for comments (RFCs) on email addresses (and there are a lot of them), you will come to the conclusion that most attempts to validate email addresses on data entry are misguided. Email addresses can contain almost any character, and any format you think they require is probably mistaken. However, valid email addresses do need to contain two specific characters: an @ to separate the delivery address from ...

Get Fuzzy Data Matching with SQL now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.