book

Fuzzy Data Matching with SQL

Name: Fuzzy Data Matching with SQL
Author: Jim Lehmer
ISBN: 9781098152277

by Jim Lehmer

October 2023

Intermediate to advanced

282 pages

6h 32m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
What Problems Are We Trying to Solve?What Will We Cover?Part I: ReviewPart II: Various Data ProblemsPart III: Bringing It TogetherAppendixWho Is This Book For?Why SQL?Warning! Opinions Ahead!Typographical Conventions Used in This BookAdditional Information on the Book’s ConventionsThe Data “Model”Environment LayoutCustomer Table“Normalized” ViewMeet the SnedleysUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
I. Review
1. A SELECT Review
Simple SELECT StatementsCommon Table ExpressionsIn CASE of EmergencyJoinsA Diversion into NULL ValuesOUTER JOINsFinding the Most Current ValueFinal Thoughts on SELECT
2. Function Junction
Aggregate FunctionsMAXMINCOUNTSUMAVGConversion FunctionsCAST and CONVERTCOALESCETRY_CONVERTCryptographic Functions: HASHBYTESDate and Time FunctionsGETDATEDATEADDDATEDIFFDATEPARTISDATELogical Functions: IIFString FunctionsCHARINDEX and PATINDEXLENLEFT, RIGHT, and SUBSTRINGLTRIM, RTRIM, and TRIMLOWER and UPPERREPLACE and TRANSLATEREVERSESTRING_AGGSystem FunctionsISNULLISNUMERICFinal Thoughts on Functions
II. Various Data Problems
3. Names, Names, Names
What’s in a Name?Last NamesPunctuationSuffixesFirst NamesMiddle NameNicknamesCompany NameFull Name“Person-Like Entities”Final Thoughts on Names
4. Location, Location, Location
What Makes an Address?Street AddressBox, Suite, Lot, or Apartment NumberDon’t Overdo It!CityCountyState or State AbbreviationZIP or Postal CodeCountryFinal Thoughts on Locations
5. Dates, Dates, Dates
Time Is RelativeFinal Thoughts on Dates
6. Email
What Makes a Valid Email Address?Final Thoughts on Email
7. Phone Numbers
What Makes a “Phone Number”?One Final Note on Tax IDsFinal Thoughts on Phone Numbers (and Tax IDs)

8. Bad Characters
Data RepresentationsInvisible WhitespaceCOLLATECleaning Up the Input DataFinal Thoughts on Bad Characters
9. Orthogonal Data
A Common Problem, A Common Solution, A New Common ProblemLather, Rinse, RepeatFinal Thoughts on Orthogonal Data
III. Bringing It Together
10. The Big Score
What Will We Want?Tuning ScoresEliminating DuplicatesDuplicate DataDuplicated DataFinal Thoughts on Scoring
11. Data Quality, or GIGO
Sneaking Data Quality InImpossible DataSimply WrongSemantically WrongETL Your Way to SuccessFinal Thoughts on Data Quality
12. Tying It All Together
ApproachWhat’s the Score?First Pass: Naive MatchingSecond Pass: Normalizing RelationsImpossible DataNow Let’s NormalizeThird Pass: Score!What About Tuning?Final Thoughts on Practical Matters
13. Code Is Data, Too!
Working with XML DataWorking with JSON DataExtracting Data from HTMLCode-Generating CodeImpact Analysis: The Second Case StudyGather Together Every Code “Artifact” You CanImport Artifacts into SQLAnd Now, for My Next TrickFinal Thoughts on Code As DataFinal Thoughts on All of It
Appendix. The Data “Model”
Customer TableNormalizedCustomer ViewPotentialMatches TableCustomerCountByState ViewPostalAbbreviations Table
Glossary
Index
About the Author

Content preview from Fuzzy Data Matching with SQL

Chapter 4. Location, Location, Location

Addresses differ around the world, and while I have worked in Canada and England, I will stick with what I really know and discuss only United States addresses and their components like ZIP code. However, most of the techniques presented here are probably applicable elsewhere, perhaps with some tuning to account for differences in postal code formats and so on.

What Makes an Address?

Addresses are composed of many parts:

Street number: “123”
Street name: “Main”
Street type: “St” versus “Blvd” versus “Rd” versus “Hwy”
Box, suite, lot, or apartment number: Perhaps “Floor” and other variants.
City: Sometimes called locale in schemas (or even just l in LDAP).
County: Are you “data quality mature”? In your system’s user interface is county a cascading drop-down list based on the state chosen or, better, the ZIP code?
State, province, or state/province abbreviation: Probably the latter.
ZIP or postal code: What about “+4” for the United States? Does your organization consistently enter and check that for data quality?
Country: Is it from a constrained drop-down list? Good. If it is a freeform text field that is hand-entered, then probably Not Good.
Latitude and longitude: Unlikely, or it is getting autopopulated by a background process and still could be wrong (hint: rural addresses, P.O. boxes, etc.). This isn’t useful for address matching, so we will drop it from our discussion.

In the United Kingdom and other Commonwealth countries, ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781098152260Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Fuzzy Data Matching with SQL

by Jim Lehmer

Chapter 4. Location, Location, Location

What Makes an Address?

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.