Chapter 4. Location, Location, Location
Addresses differ around the world, and while I have worked in Canada and England, I will stick with what I really know and discuss only United States addresses and their components like ZIP code. However, most of the techniques presented here are probably applicable elsewhere, perhaps with some tuning to account for differences in postal code formats and so on.
What Makes an Address?
Addresses are composed of many parts:
- Street number
-
“123”
- Street name
-
“Main”
- Street type
- Box, suite, lot, or apartment number
-
Perhaps “Floor” and other variants.
- City
-
Sometimes called locale in schemas (or even just l in LDAP).
- County
-
Are you “data quality mature”? In your system’s user interface is county a cascading drop-down list based on the state chosen or, better, the ZIP code?
- State, province, or state/province abbreviation
- ZIP or postal code
-
What about “+4” for the United States? Does your organization consistently enter and check that for data quality?
- Country
-
Is it from a constrained drop-down list? Good. If it is a freeform text field that is hand-entered, then probably Not Good.
- Latitude and longitude
-
Unlikely, or it is getting autopopulated by a background process and still could be wrong (hint: rural addresses, P.O. boxes, etc.). This isn’t useful for address matching, so we will drop it from our discussion.
In the United Kingdom and other Commonwealth countries, perhaps you have a “House Name” field (“Snedley House”). Maybe there are other administrative units between city and county or county and state or province. There may also be “Attention” or “Attn” in one of the lines to denote who to deliver to in a large company.
Likely, many of these attributes are combined; the first three may be in a Street1 or Address1 column and the box or suite in Street2 or Address2, as in the following CSV file:
LastName
,
FirstName
,
Street1
,
Street2
,...
Snedley
,
Mortimer
,
443
Arroyo
Rd
,
Box
22
,...
In addition, you may be capturing multiple addresses per customer for home, billing, shipping, etc. You will need to know if you are supposed to match the incoming data against all of them or simply against some “primary” address.
You are going to start seeing some common themes appear. Certain concerns that were true for names will remain true for addresses, phone numbers, and other fields. The techniques used become “muscle memory” over time because they are so common.
Street Address
Let’s cut to the chase. Most people get their street number correct. The rest? Is it “Mountain View” or “Mountainview”? Both streets exist in various locales. USPS and certain other entities have views about which spelling to use, which can be surprising to the locals. Is it “Rd.” or “Rd” or even “Road”? Is it “US-50 Highway West,” “US 50 Hwy W,” or “Hwy 50 W”? (I’ve seen all three.) It’s a problem.
My answer is going to seem a bit like “nuke it from space,” but I have a reason. We are collecting lots of attributes to possibly (note the “possibly”) match against. If we get a bit fuzzy on this one, there are others that will help us refine our match later. Depending on the data (I always do some EDA first to determine the “shape” of the data on both sides), I may parse out the first “word” and assume that’s a street number and then take the first x characters of the next “word.” But most often I just do the following quick and dirty trick:
SELECT
TOP
10
/*
The following "10" is obviously tunable to better fit your needs.
*/
LEFT
(
Address1
,
10
)
[
Only
Compare
1
st
Part
]
FROM
crm
.
NormalizedCustomer
ORDER
BY
LEFT
(
Address1
,
10
);
Only Compare 1st Part | |
---|---|
0 | 1 Central |
1 | 1 Century |
2 | 1 Commerce |
3 | 1 Garfield |
4 | 1 Huntwood |
5 | 1 Midway R |
6 | 1 N Harlem |
7 | 1 N San Sa |
8 | 1 Rancho D |
9 | 1 S Pine S |
It’s so crude. I feel ashamed. Really.
But you would be surprised how often this is good enough! And since it is rare to have a 10-digit street number, it usually catches the number and some part of the street name, as we can see by looking at the other “end” of the data:
SELECT
TOP
10
LEFT
(
Address1
,
10
)
[
Part
of
Address
to
Compare
]
FROM
crm
.
NormalizedCustomer
ORDER
BY
LEFT
(
Address1
,
10
)
DESC
;
-- Let's see what's at the "bottom"
Part of Address to Compare | |
---|---|
0 | 99586 Main |
1 | 9939 N 14t |
2 | 99385 Char |
3 | 993 Washin |
4 | 992 Civic |
5 | 99 Tank Fa |
6 | 99 5th Ave |
7 | 98839 Hawt |
8 | 987 Main S |
9 | 985 E 6th |
Box, Suite, Lot, or Apartment Number
Do you need them? Are you getting good enough matches without them? Then ignore them. Remember, unless you are the size of Amazon, your customer base may not be large enough to care. Sure, you may have two customers in the same apartment complex, but then, do they have the same names? My opinion is to ignore this if you can. If you can’t, then you may have to attempt some sort of normalization to compare against on both sides and hope for the best. We have no Address2 field in our sample data, so a SQL variable comes to the rescue to show what I mean:
DECLARE
@
Street2
VARCHAR
(
50
)
=
'Suite 500'
DECLARE
@
IncomingStreet2
VARCHAR
(
50
)
=
'Ste 500'
SELECT
@
Street2
[
Production
Street2
]
,
@
IncomingStreet2
[
Incoming
Data
]
,
/*
The following shows the kind of normalization you can do to attempt to always be comparing "apples to apples" on both sides. You can add as many REPLACEs as you need. Ugh. Pay attention to the "column names" (variables) in each clause to keep clear what's going on.
*/
REPLACE
(
REPLACE
(
REPLACE
(
@
Street2
,
'Suite'
,
'Ste'
)
,
'P.O. Box'
,
'Box'
)
,
'Apartment'
,
'Apt'
)
[
Normalized
Street2
]
,
REPLACE
(
REPLACE
(
REPLACE
(
@
IncomingStreet2
,
'Suite'
,
'Ste'
)
,
'P.O. Box'
,
'Box'
)
,
'Apartment'
,
'Apt'
)
[
Normalized
Incoming
]
,
CASE
REPLACE
(
REPLACE
(
REPLACE
(
@
Street2
,
'Suite'
,
'Ste'
)
,
'P.O. Box'
,
'Box'
)
,
'Apartment'
,
'Apt'
)
WHEN
REPLACE
(
REPLACE
(
REPLACE
(
@
IncomingStreet2
,
'Suite'
,
'Ste'
)
,
'P.O. Box'
,
'Box'
)
,
'Apartment'
,
'Apt'
)
THEN
'Match!'
ELSE
'No Match!'
END
[
Do
They
Match
?
]
;
Production Street2 | Incoming Data | Normalized Street2 | Normalized Incoming | Do They Match? |
---|---|---|---|---|
Suite 500 | Ste 500 | Ste 500 | Ste 500 | Match! |
Don’t Overdo It!
The astute observer will realize that the LEFT(Address1, 10)
trick in the street address discussion attempts to circumvent the whole similar problem we just saw of “Suite versus Ste” and for a much, much larger set of words:
-
Boulevard versus Blvd
-
Circle versus Cir
-
Court versus Ct
-
Highway versus Hwy (Note that “Highway” can be all over in an address, not just a suffix. Consider “Business US-50 Highway South”.)
-
North versus N (You didn’t think we had to worry about just road types, did you?)
-
Northwest versus NW (Remember, a lot of directionals can be in the string or at the end, even past the road type.)
-
Parkway versus Pkwy
-
Place versus Pl (Or is that “Plaza”? Now you have two “Pl” results. Is that OK?)
-
Road versus Rd
-
South versus S (Beware changing all “South” to “S” unless you change “123 South St” to “123 S St,” which is an entirely different thing!)
-
Street versus St
…and so on and so on. The list is very large if you want to be thorough. Obviously, punctuation differences are simply normalized out, so we don’t worry about “Rd versus Rd.” (“Rd” plus period). Also, everything mentioned in “City” applies to roads, when the road name is “Saint George Ave” (or is that “St. George Avenue”?).
Note if you do this type of normalization on Address1 or Address2 type fields, you might want to do it before you normalize spaces out! Let’s look at some of these problems. First, our sample data:
SELECT
Address1
FROM
crm
.
NormalizedCustomer
WHERE
Address1
LIKE
'%Road%'
OR
Address1
LIKE
'%South%'
OR
Address1
LIKE
'%State%'
ORDER
BY
Address1
;
Address1 | |
---|---|
0 | 1 State Route 27 |
1 | 1844 Southern Blvd |
2 | 3 State Route 35 S |
3 | 3273 State St |
4 | 33 State St |
5 | 366 South Dr |
6 | 51120 State Route 18 |
7 | 5384 Southwyck Blvd |
8 | 6 S Broadway St |
9 | 72 Southern Blvd |
10 | 721 Interstate 45 S |
11 | 73 Southern Blvd |
12 | 73 State Road 434 E |
13 | 83 County Road 437 #8581 |
14 | 85092 Southern Blvd |
15 | 868 State St #38 |
16 | 8728 S Broad St |
17 | 9 State Highway 57 #22 |
18 | 92 Broadway |
A naive “change all” algorithm trying to normalize things like “9 State Highway 57 #22” to “9 St Hwy 57 #22” causes us pain:
SELECT
REPLACE
(
REPLACE
(
REPLACE
(
REPLACE
(
Address1
,
'Highway'
,
'Hwy'
),
'Road'
,
'Rd'
),
'South'
,
'S'
),
'State'
,
'St'
)
[
Naive
Normalization
-
Ooops
!
]
FROM
crm
.
NormalizedCustomer
WHERE
Address1
LIKE
'%Road%'
OR
Address1
LIKE
'%South%'
OR
Address1
LIKE
'%State%'
ORDER
BY
Address1
;
Naive Normalization - Ooops! | |
---|---|
0 | 1 St Route 27 |
1 | 1844 Sern Blvd |
2 | 3 St Route 35 S |
3 | 3273 St St |
4 | 33 St St |
5 | 366 S Dr |
6 | 51120 St Route 18 |
7 | 5384 Swyck Blvd |
8 | 6 S BRdway St |
9 | 72 Sern Blvd |
10 | 721 InterSt 45 S |
11 | 73 Sern Blvd |
12 | 73 St Rd 434 E |
13 | 83 County Rd 437 #8581 |
14 | 85092 Sern Blvd |
15 | 868 St St #38 |
16 | 8728 S BRd St |
17 | 9 St Hwy 57 #22 |
18 | 92 BRdway |
¡No bueno! Sure, we got our “9 St Hwy 57 #22,” but we also got “73 Southern Blvd” changed to “73 Sern Blvd” and “6 S Broadway St” changed to “6 S BRdway St.” Doh! “Spaces,” you think, “I need to check and make sure there are always surrounding spaces!” Good idea! Sorta.... First, notice how hard the problem actually is, since you can’t use pattern matching or you end up where you were earlier, so you have to do something like the following:
SELECT
/*
We're going to change the highly indented style for
this one - I know you can follow along. I'm so proud
of you!
*/
REPLACE
(
REPLACE
(
REPLACE
(
REPLACE
(
REPLACE
(
REPLACE
(
REPLACE
(
REPLACE
(
REPLACE
(
REPLACE
(
REPLACE
(
REPLACE
(
/*
For each keyword, always start with the one
surrounded by spaces on both sides.
*/
Address1
,
' Highway '
,
' Hwy '
),
' Highway'
,
' Hwy'
),
'Highway '
,
'Hwy '
),
' Road '
,
' Rd '
),
' Road'
,
' Rd'
),
'Road '
,
'Rd '
),
' South '
,
' S '
),
' South'
,
' S'
),
'South '
,
'S '
),
' State '
,
' St '
),
' State'
,
' St'
),
'State '
,
'St '
)
[
Better
?
Still
Ooops
!
]
FROM
crm
.
NormalizedCustomer
WHERE
Address1
LIKE
'%Road%'
OR
Address1
LIKE
'%South%'
OR
Address1
LIKE
'%State%'
ORDER
BY
Address1
;
Better? Still Ooops! | |
---|---|
0 | 1 St Route 27 |
1 | 1844 Sern Blvd |
2 | 3 St Route 35 S |
3 | 3273 St St |
4 | 33 St St |
5 | 366 S Dr |
6 | 51120 St Route 18 |
7 | 5384 Swyck Blvd |
8 | 6 S Broadway St |
9 | 72 Sern Blvd |
10 | 721 InterSt 45 S |
11 | 73 Sern Blvd |
12 | 73 St Rd 434 E |
13 | 83 County Rd 437 #8581 |
14 | 85092 Sern Blvd |
15 | 868 St St #38 |
16 | 8728 S BRd St |
17 | 9 St Hwy 57 #22 |
18 | 92 Broadway |
That fixed “Broadway” becoming “BRdway,” but we see “Broad” still becoming “BRd,” and “State St” becoming “St St.”
The net of all this is again why I simply take some number of characters off the left of the street before or after normalizing spaces and punctuation out and move on. Going down this rathole becomes a much larger problem than is easily conquered in pure SQL.
If you need more normalization beyond this, there are third-party address cleansing services; you may want to explore engaging one. There are also open source geolocation libraries such as GeoPy. You may want to explore both approaches, but be aware their utility can be limited if many of your addresses are like “PO Box 12.” Geolocation is not really useful in those cases, since they will typically then just use some “city center” as the location. Nor can many address-cleansing solutions do much in terms of making sure such a box number is valid at a given post office.
City
Is it “Saint Charles” or “St. Charles” or “St Charles”? Worse, are you sure you have the gender right on that “Saint”? It’s “Saint Louis” but “Sainte Genevieve,” which may get rendered “Sainte” or “Ste.” or “Ste” or even the wrong but common “St.” Obviously again normalizing the name—removing spaces and punctuation—helps in the compare, but there isn’t much you can do sometimes because:
Sainte <> Ste
If I have the ZIP or postal code, then I actually consider the city name noisy data and usually drop it from the comparisons. It is rare in most cases that matching on city name adds much value, unless you are missing a postal code (that informal “fishbowl drawing” scenario again—did they just dash down “NYC” or “LA” and move on?).
County
The first question is, where are you getting this data from in your customer system? Are you sure it’s correct? Is it constrained in the user interface to allow only those counties that are actually in a given state, for example? Or are users just typing in a text field? (Uh-oh.) What about data you imported into your system? Do you have an idea of how good the data quality on county was in that dataset? Are you sure?
Better, do you pay for a geolocation service that fills that information in for you? That is the gold standard. But if not—if county is human-entered or coming from the imported data—I consider it highly suspect and, like city, often not useful if you have something better to match against, like ZIP or postal code.
Otherwise, if you have to match on it, it is like any other name—normalize it and hope for the best, realizing there’s a St. Louis County, so everything we talked about with things like “Saint” in city is true for county, too (and, as previously mentioned in street addresses, in street names like “Saint George Ave”). Also, unless the field is required in the data entry form, it is quite common for this attribute to be empty for a good percentage of the rows in a dataset, so that lowers its overall usefulness, too.
State or State Abbreviation
You will probably have the state abbreviation. If the incoming data has “California” instead of “CA,” my first recommendation is to find a reference (they exist online at the USPS and other places) and create a small mapping table like ref.PostalAbbreviations in our samples. Then use that mapping table to cross-reference between the two or to transform the incoming data to have just the abbreviations. I would do the latter, adding a column “to the right” of the existing columns, perhaps, during one of the first ETLs.
State, especially normalized down to two-character abbreviations, doesn’t have the same issues as most names—no spaces or punctuation, no variants in spelling (just misspellings or mistakes like using “AR” when you meant “AZ”). It can be a useful match, but again, if you have ZIP or postal codes, why bother? Oh, and don’t forget military APO and FPO codes, if you have deployed armed services members in your customer base.
ZIP or Postal Code
A ZIP+4 nine-digit ZIP code plus a street number will just about be the best match you can get. Unfortunately, ZIP+4 is rare (or else is “80301-0000”), and likely you have only five-digit ZIP codes either in your production data or in the data you are matching against. After making sure the ZIP code is five characters long, you will find ZIP codes useful for matching. A five-digit ZIP code and a street number may not be unique, but it and those first 10 characters of the street address will most likely be. Observe the difference between the following two. With EDA we’ve already determined the incoming data has ZIP+4. While our production data does not, we will try to match on both the street address and just the ZIP code minus the “+4.”
In the following example, we’ll do a left outer join on the postal code and address. The WHERE
clause is simply to limit our result set to a manageable number and make sure the addresses are not NULL
:
SELECT
COUNT
(
*
)
[
#
Matched
]
FROM
crm
.
NormalizedCustomer
C
LEFT
OUTER
JOIN
dbo
.
PotentialMatches
P
ON
LEFT
(
C
.
PostalCode
,
5
)
=
LEFT
(
P
.
zip
,
5
)
AND
C
.
Address1
=
P
.
address
WHERE
C
.
LastName
=
P
.
last_name
AND
C
.
FirstName
=
P
.
first_name
AND
C
.
LastName
=
'Snedley'
AND
C
.
Address1
IS
NOT
NULL
AND
P
.
address
IS
NOT
NULL
;
# Matched |
---|
0 |
Remembering our discussions on street addresses, let’s just take the first 10 characters of the address instead of trying to match on it exactly:
SELECT
COUNT
(
*
)
[
#
Matched
]
FROM
crm
.
NormalizedCustomer
C
LEFT
OUTER
JOIN
dbo
.
PotentialMatches
P
ON
LEFT
(
C
.
PostalCode
,
5
)
=
LEFT
(
P
.
zip
,
5
)
AND
LEFT
(
C
.
Address1
,
10
)
=
LEFT
(
P
.
address
,
10
)
WHERE
C
.
LastName
=
P
.
last_name
AND
C
.
FirstName
=
P
.
first_name
AND
C
.
LastName
=
'Snedley'
AND
C
.
Address1
IS
NOT
NULL
AND
P
.
address
IS
NOT
NULL
;
# Matched |
---|
10 |
But remember what I said about data sometimes being noisy and not needed? Check it out:
SELECT
COUNT
(
*
)
[
Same
#
Matched
]
FROM
crm
.
NormalizedCustomer
C
LEFT
OUTER
JOIN
dbo
.
PotentialMatches
P
ON
LEFT
(
C
.
PostalCode
,
5
)
=
LEFT
(
P
.
zip
,
5
)
/*
Do we even NEED address at all? Let's comment it out
and see what happens to our row count.
AND LEFT(C.Address1, 10) = LEFT(P.address, 10)
*/
WHERE
C
.
LastName
=
P
.
last_name
AND
C
.
FirstName
=
P
.
first_name
AND
C
.
LastName
=
'Snedley'
AND
C
.
Address1
IS
NOT
NULL
AND
P
.
address
IS
NOT
NULL
;
Same # Matched |
---|
10 |
For the purposes of this test case and this data, perhaps street address adds no value whatsoever and can be dropped from the comparisons (and ultimately, the score). However, in general having some amount of data from the street address along with the postal code is of value.
Country
Unless you are working for a multinational conglomerate and, more importantly, with international data, then country is even less useful than county, usually, and is typically incorrect or defaulted to something like “USA” (if the system is in the United States), even if the postal code is set to something that is obviously not a valid United States ZIP code. And for that matter, is it “United States,” “USA,” or “US”? One system I worked on had all three in the data, even though the field had never been exposed to user input! Let that sink in for a while. (Hint: Various automated address quality vendors, data imports, and other programmatic inputs were the culprits.)
You are going to be a sad panda if you have to use country to match on in your data. As always, apply all the techniques you’ve learned so far: removing punctuation and spaces and perhaps normalizing everything to a two-character or three-character ISO country code (use cross-reference tables), as shown in the following example. Of course, doing that for all the countries in the world, the variants on their spelling and abbreviations will be difficult. I suggest dividing it up by country and letting people familiar with the vagaries of each do the work and then perhaps roll all the results together afterward, if that would even be meaningful.
SELECT
TOP
10
Country
,
CASE
WHEN
Country
IN
(
'United States'
,
'U.S.'
,
'U.S.A.'
,
'US'
,
'USA'
)
THEN
'US'
ELSE
Country
END
[
Normalized
Country
]
FROM
crm
.
NormalizedCustomer
;
Country | Normalized Country | |
---|---|---|
0 | U.S.A. | US |
1 | US | US |
2 | U.S. | US |
3 | United States | US |
4 | USA | US |
5 | U.S.A. | US |
6 | US | US |
7 | U.S. | US |
8 | United States | US |
9 | USA | US |
Final Thoughts on Locations
You should leave this chapter understanding that when it comes to data matching, more attributes to match against do not necessarily lead to better results! If there is “junk data” anywhere in your production data, I am willing to bet it is in addresses. If you have postal code, do you need city? Especially if it adds a cross-check problem, such as my former coworker claiming to live in Kansas City when he did not.
Similarly, sometimes comparing whole strings is problematic. Are you sure you have the “+4” part of the ZIP code on both sides of the comparison? Are you sure they’re accurate? If not, instead of writing a lot of logic like this:
If there are nonzero “+4” parts to both postal codes, then compare them; otherwise, only use five digits.
just truncate both sides to five digits and move on! The same is true for street addresses: don’t go down the route (pun) of trying to normalize out all the various street types if you can just truncate to some fairly accurate “prefix” that makes that problem disappear entirely.
Finally, be prepared to discover that many attributes in your address data are just wrong. For businesses without a large international presence, country is often wrong and also all over the place in spelling. Latitude and longitude, if filled in by a geolocation process, may still be incorrect, especially for many rural addresses where the calculated location ends up being a pin in the center of the town with their post office. What does your geolocation software do with APO/FPO military addresses? Do you even know?
In fact, anything filled in by an automated process may be suspect—country, latitude, longitude, county, the “+4” on ZIP codes—depending on the quality of the data going into it and also how the process handles all these issues. That is why I ignore all these attributes as much as possible. Even in terms of data cleansing these processes can “get in the way.” If they normalize “New York” and “NYC” to “New York City,” that is great for your production data but leaves us a problem if we want to match on city and the incoming data has “NYC.” Again, this is why I often simply resort to the first ten characters of the normalized address and the five-digit ZIP code.
It may seem strange to think “less is more” when it comes to data, but in the case of fuzzy matching, it often is! You will understand why when you finish Chapter 10, on scoring.
Get Fuzzy Data Matching with SQL now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.