book

Hands-On Entity Resolution

Name: Hands-On Entity Resolution
Author: Michael Shearer
ISBN: 9781098148485

by Michael Shearer

February 2024

Intermediate to advanced

198 pages

4h 41m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Includes

Quizzes

Preface
Who Should Read This BookWhy I Wrote This BookNavigating This BookConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Introduction to Entity Resolution
What Is Entity Resolution?Why Is Entity Resolution Needed?Main Challenges of Entity ResolutionLack of Unique NamesInconsistent Naming ConventionsData Capture InconsistenciesWorked ExampleDeliberate ObfuscationMatch PermutationsBlind Matching?The Entity Resolution ProcessData StandardizationRecord BlockingAttribute ComparisonMatch ClassificationClusteringCanonicalizationWorked ExampleMeasuring PerformanceGetting Started
2. Data Standardization
Sample ProblemEnvironment SetupAcquiring DataWikipedia DataTheyWorkForYou DataCleansing DataWikipediaTheyWorkForYouAttribute ComparisonConstituencyMeasuring PerformanceSample CalculationSummary
3. Text Matching
Edit Distance MatchingLevenshtein DistanceJaro SimilarityJaro-Winkler SimilarityPhonetic MatchingMetaphoneMatch Rating ApproachComparing the TechniquesSample ProblemFull Similarity ComparisonMeasuring PerformanceSummary
4. Probabilistic Matching
Sample ProblemSingle Attribute Match ProbabilityFirst Name Match ProbabilityLast Name Match ProbabilityMultiple Attribute Match ProbabilityProbabilistic ModelsBayes’ Theoremm Valueu ValueLambda ( λ ) ValueBayes FactorFellegi-Sunter ModelMatch WeightExpectation-Maximization AlgorithmIteration 1Iteration 2Iteration 3Introducing SplinkConfiguring SplinkSplink PerformanceSummary
5. Record Blocking
Sample ProblemData AcquisitionWikipedia DataUK Companies House DataData StandardizationWikipedia DataUK Companies House DataRecord Blocking and Attribute ComparisonRecord Blocking with SplinkAttribute ComparisonMatch ClassificationMeasuring PerformanceSummary
6. Company Matching
Sample ProblemData AcquisitionData StandardizationCompanies House DataMaritime and Coastguard Agency DataRecord Blocking and Attribute ComparisonMatch ClassificationMeasuring PerformanceMatching New EntitiesSummary
7. Clustering
Simple Exact Match ClusteringApproximate Match ClusteringSample ProblemData AcquisitionData StandardizationRecord Blocking and Attribute ComparisonData AnalysisExpectation-Maximization Blocking RulesMatch Classification and ClusteringCluster VisualizationCluster AnalysisSummary
8. Scaling Up on Google Cloud
Google Cloud SetupSetting Up Project StorageCreating a Dataproc ClusterConfiguring a Dataproc ClusterEntity Resolution on SparkMeasuring PerformanceTidy Up!Summary
9. Cloud Entity Resolution Services
Introduction to BigQueryEnterprise Knowledge Graph APISchema MappingReconciliation JobResult ProcessingEntity Reconciliation Python ClientMeasuring PerformanceSummary

10. Privacy-Preserving Record Linkage
An Introduction to Private Set IntersectionHow PSI WorksPSI Protocol Based on ECDHBloom FiltersGolomb-Coded SetsExample: Using the PSI ProcessEnvironment SetupServer CodeClient CodeFull MCA and Companies House Sample ExampleSummary
11. Further Considerations
Data ConsiderationsUnstructured DataData QualityTemporal EquivalenceAttribute ComparisonSet MatchingGeocoding Location MatchingAggregating ComparisonsPost ProcessingGraphical RepresentationReal-Time ConsiderationsPerformance EvaluationPairwise ApproachCluster-Based ApproachFuture of Entity Resolution
Index
About the Author

Content preview from Hands-On Entity Resolution

Chapter 2. Data Standardization

As we discussed in Chapter 1, before we can successfully match or deduplicate data sources we need to ensure our data is presented in a consistent manner and that any anomalies are removed or corrected. We will use the term data standardization to cover both the transformation of datasets into consistent formats and the cleansing of data to remove unhelpful extra characters that would otherwise interfere with the matching process.

In this chapter, we will get hands on and work through a real-world example of this process. We will create our working environment, acquire the data we need, cleanse that data, and then perform a simple entity resolution exercise to allow us to perform some simple analysis. We will conclude by examining the performance of our data matching process and consider how we might improve it.

First, let’s introduce our example and why we need entity resolution to solve it.

Sample Problem

Let’s work through an example problem to illustrate some of the common challenges we see in resolving entities between data sources and why data cleansing is an essential first step. As we are constrained to use openly available public sources of data, the example is slightly contrived but hopefully illustrates the need for entity resolution.

Let’s imagine we are researching factors that may influence whether members of the House of Commons, the lower house of the Parliament of the United Kingdom (UK), are reelected. We surmise that politicians ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781098148478Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Hands-On Entity Resolution

by Michael Shearer

Chapter 2. Data Standardization

Sample Problem

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.