Hands-On Entity Resolution

Book description

Entity resolution is a key analytic technique that enables you to identify multiple data records that refer to the same real-world entity. With this hands-on guide, product managers, data analysts, and data scientists will learn how to add value to data by cleansing, analyzing, and resolving datasets using open source Python libraries and cloud APIs.

Author Michael Shearer shows you how to scale up your data matching processes and improve the accuracy of your reconciliations. You'll be able to remove duplicate entries within a single source and join disparate data sources together when common keys aren't available. Using real-world data examples, this book helps you gain practical understanding to accelerate the delivery of real business value.

With entity resolution, you'll build rich and comprehensive data assets that reveal relationships for marketing and risk management purposes, key to harnessing the full potential of ML and AI. This book covers:

  • Challenges in deduplicating and joining datasets
  • Extracting, cleansing, and preparing datasets for matching
  • Text matching algorithms to identify equivalent entities
  • Techniques for deduplicating and joining datasets at scale
  • Matching datasets containing persons and organizations
  • Evaluating data matches
  • Optimizing and tuning data matching algorithms
  • Entity resolution using cloud APIs
  • Matching using privacy-enhancing technologies

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Who Should Read This Book
    2. Why I Wrote This Book
    3. Navigating This Book
    4. Conventions Used in This Book
    5. Using Code Examples
    6. O’Reilly Online Learning
    7. How to Contact Us
    8. Acknowledgments
  2. 1. Introduction to Entity Resolution
    1. What Is Entity Resolution?
    2. Why Is Entity Resolution Needed?
    3. Main Challenges of Entity Resolution
      1. Lack of Unique Names
      2. Inconsistent Naming Conventions
      3. Data Capture Inconsistencies
      4. Worked Example
      5. Deliberate Obfuscation
      6. Match Permutations
      7. Blind Matching?
    4. The Entity Resolution Process
      1. Data Standardization
      2. Record Blocking
      3. Attribute Comparison
      4. Match Classification
      5. Clustering
      6. Canonicalization
      7. Worked Example
    5. Measuring Performance
    6. Getting Started
  3. 2. Data Standardization
    1. Sample Problem
    2. Environment Setup
    3. Acquiring Data
      1. Wikipedia Data
      2. TheyWorkForYou Data
    4. Cleansing Data
      1. Wikipedia
      2. TheyWorkForYou
    5. Attribute Comparison
    6. Constituency
    7. Measuring Performance
    8. Sample Calculation
    9. Summary
  4. 3. Text Matching
    1. Edit Distance Matching
      1. Levenshtein Distance
      2. Jaro Similarity
      3. Jaro-Winkler Similarity
    2. Phonetic Matching
      1. Metaphone
      2. Match Rating Approach
    3. Comparing the Techniques
    4. Sample Problem
    5. Full Similarity Comparison
    6. Measuring Performance
    7. Summary
  5. 4. Probabilistic Matching
    1. Sample Problem
    2. Single Attribute Match Probability
      1. First Name Match Probability
      2. Last Name Match Probability
    3. Multiple Attribute Match Probability
    4. Probabilistic Models
      1. Bayes’ Theorem
      2. m Value
      3. u Value
      4. Lambda ( λ ) Value
      5. Bayes Factor
      6. Fellegi-Sunter Model
      7. Match Weight
    5. Expectation-Maximization Algorithm
      1. Iteration 1
      2. Iteration 2
      3. Iteration 3
    6. Introducing Splink
      1. Configuring Splink
      2. Splink Performance
    7. Summary
  6. 5. Record Blocking
    1. Sample Problem
    2. Data Acquisition
      1. Wikipedia Data
      2. UK Companies House Data
    3. Data Standardization
      1. Wikipedia Data
      2. UK Companies House Data
    4. Record Blocking and Attribute Comparison
      1. Record Blocking with Splink
      2. Attribute Comparison
    5. Match Classification
    6. Measuring Performance
    7. Summary
  7. 6. Company Matching
    1. Sample Problem
    2. Data Acquisition
    3. Data Standardization
      1. Companies House Data
      2. Maritime and Coastguard Agency Data
    4. Record Blocking and Attribute Comparison
    5. Match Classification
    6. Measuring Performance
    7. Matching New Entities
    8. Summary
  8. 7. Clustering
    1. Simple Exact Match Clustering
    2. Approximate Match Clustering
    3. Sample Problem
      1. Data Acquisition
      2. Data Standardization
    4. Record Blocking and Attribute Comparison
      1. Data Analysis
      2. Expectation-Maximization Blocking Rules
    5. Match Classification and Clustering
    6. Cluster Visualization
    7. Cluster Analysis
    8. Summary
  9. 8. Scaling Up on Google Cloud
    1. Google Cloud Setup
      1. Setting Up Project Storage
    2. Creating a Dataproc Cluster
    3. Configuring a Dataproc Cluster
    4. Entity Resolution on Spark
    5. Measuring Performance
    6. Tidy Up!
    7. Summary
  10. 9. Cloud Entity Resolution Services
    1. Introduction to BigQuery
    2. Enterprise Knowledge Graph API
      1. Schema Mapping
      2. Reconciliation Job
      3. Result Processing
      4. Entity Reconciliation Python Client
    3. Measuring Performance
    4. Summary
  11. 10. Privacy-Preserving Record Linkage
    1. An Introduction to Private Set Intersection
    2. How PSI Works
    3. PSI Protocol Based on ECDH
      1. Bloom Filters
      2. Golomb-Coded Sets
    4. Example: Using the PSI Process
      1. Environment Setup
      2. Server Code
      3. Client Code
      4. Full MCA and Companies House Sample Example
    5. Summary
  12. 11. Further Considerations
    1. Data Considerations
      1. Unstructured Data
      2. Data Quality
      3. Temporal Equivalence
    2. Attribute Comparison
      1. Set Matching
      2. Geocoding Location Matching
      3. Aggregating Comparisons
    3. Post Processing
    4. Graphical Representation
    5. Real-Time Considerations
    6. Performance Evaluation
      1. Pairwise Approach
      2. Cluster-Based Approach
    7. Future of Entity Resolution
  13. Index
  14. About the Author

Product information

  • Title: Hands-On Entity Resolution
  • Author(s): Michael Shearer
  • Release date: February 2024
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098148485