book

Hands-On Entity Resolution

Name: Hands-On Entity Resolution
Author: Michael Shearer
ISBN: 9781098148485

by Michael Shearer

February 2024

Intermediate to advanced

198 pages

4h 41m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Includes

Quizzes

Preface
Who Should Read This BookWhy I Wrote This BookNavigating This BookConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Introduction to Entity Resolution
What Is Entity Resolution?Why Is Entity Resolution Needed?Main Challenges of Entity ResolutionLack of Unique NamesInconsistent Naming ConventionsData Capture InconsistenciesWorked ExampleDeliberate ObfuscationMatch PermutationsBlind Matching?The Entity Resolution ProcessData StandardizationRecord BlockingAttribute ComparisonMatch ClassificationClusteringCanonicalizationWorked ExampleMeasuring PerformanceGetting Started
2. Data Standardization
Sample ProblemEnvironment SetupAcquiring DataWikipedia DataTheyWorkForYou DataCleansing DataWikipediaTheyWorkForYouAttribute ComparisonConstituencyMeasuring PerformanceSample CalculationSummary
3. Text Matching
Edit Distance MatchingLevenshtein DistanceJaro SimilarityJaro-Winkler SimilarityPhonetic MatchingMetaphoneMatch Rating ApproachComparing the TechniquesSample ProblemFull Similarity ComparisonMeasuring PerformanceSummary
4. Probabilistic Matching
Sample ProblemSingle Attribute Match ProbabilityFirst Name Match ProbabilityLast Name Match ProbabilityMultiple Attribute Match ProbabilityProbabilistic ModelsBayes’ Theoremm Valueu ValueLambda ( λ ) ValueBayes FactorFellegi-Sunter ModelMatch WeightExpectation-Maximization AlgorithmIteration 1Iteration 2Iteration 3Introducing SplinkConfiguring SplinkSplink PerformanceSummary
5. Record Blocking
Sample ProblemData AcquisitionWikipedia DataUK Companies House DataData StandardizationWikipedia DataUK Companies House DataRecord Blocking and Attribute ComparisonRecord Blocking with SplinkAttribute ComparisonMatch ClassificationMeasuring PerformanceSummary
6. Company Matching
Sample ProblemData AcquisitionData StandardizationCompanies House DataMaritime and Coastguard Agency DataRecord Blocking and Attribute ComparisonMatch ClassificationMeasuring PerformanceMatching New EntitiesSummary
7. Clustering
Simple Exact Match ClusteringApproximate Match ClusteringSample ProblemData AcquisitionData StandardizationRecord Blocking and Attribute ComparisonData AnalysisExpectation-Maximization Blocking RulesMatch Classification and ClusteringCluster VisualizationCluster AnalysisSummary
8. Scaling Up on Google Cloud
Google Cloud SetupSetting Up Project StorageCreating a Dataproc ClusterConfiguring a Dataproc ClusterEntity Resolution on SparkMeasuring PerformanceTidy Up!Summary
9. Cloud Entity Resolution Services
Introduction to BigQueryEnterprise Knowledge Graph APISchema MappingReconciliation JobResult ProcessingEntity Reconciliation Python ClientMeasuring PerformanceSummary

10. Privacy-Preserving Record Linkage
An Introduction to Private Set IntersectionHow PSI WorksPSI Protocol Based on ECDHBloom FiltersGolomb-Coded SetsExample: Using the PSI ProcessEnvironment SetupServer CodeClient CodeFull MCA and Companies House Sample ExampleSummary
11. Further Considerations
Data ConsiderationsUnstructured DataData QualityTemporal EquivalenceAttribute ComparisonSet MatchingGeocoding Location MatchingAggregating ComparisonsPost ProcessingGraphical RepresentationReal-Time ConsiderationsPerformance EvaluationPairwise ApproachCluster-Based ApproachFuture of Entity Resolution
Index
About the Author

Content preview from Hands-On Entity Resolution

Chapter 5. Record Blocking

In Chapter 4, we introduced probabilistic matching techniques to allow us to combine exact equivalence on individual attributes into a weighted composite score. That score allowed us to calculate the overall probability that two records refer to the same entity.

So far we have sought to resolve only small-scale datasets where we could exhaustively compare every record with every other to find all possible matches. However, in most entity resolution scenarios, we will be dealing with larger datasets where this approach isn’t practical or affordable.

In this chapter we will introduce record blocking to reduce the number of permutations we need to consider while minimizing the likelihood of missing a true positive match. We will leverage the Splink framework, introduced in the last chapter, to apply the Fellegi-Sunter model and use the expectation-maximization algorithm to estimate the model parameters.

Lastly, we will consider how to measure our matching performance over this larger dataset.

Sample Problem

In previous chapters, we considered the challenge of resolving entities across two datasets containing information about members of the UK House of Commons. In this chapter, we extend this resolution challenge to a much larger dataset containing a list of the persons with significant control of registered UK companies.

In the UK, Companies House is an executive agency sponsored by the Department for Business and Trade. It incorporates and dissolves ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781098148478Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Hands-On Entity Resolution

by Michael Shearer

Chapter 5. Record Blocking

Sample Problem

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.