Chapter 9. Cloud Entity Resolution Services

In the last chapter, we saw how to scale up our entity resolution process to run on a Google Cloud–managed Spark cluster. This approach allowed us to match larger datasets in a reasonable time but it required us to do quite a bit of setup and management ourselves.

An alternative approach is to use entity resolution API provided by a cloud provider to perform the hard work for us. Google, Amazon, and Microsoft all offer these services.

In this chapter, we will use the entity reconciliation service, provided as part of Google’s Enterprise Knowledge Graph API, to resolve the MCA and Companies House datasets we examined in Chapters 6 and 8. We will:

  • Upload our standardized datasets to Google’s data warehouse, BigQuery.
  • Provide a mapping of our data schema to a standard ontology.
  • Invoke the API from the console (we will also invoke the API using a Python script).
  • Use some basic SQL to process the results.

​​To complete the chapter we will examine how well the service performs.

Introduction to BigQuery

BigQuery is Google’s fully managed, serverless data warehouse that enables scalable analysis over petabytes of data. It is a platform as a service that supports data querying and analysis using a dialect of SQL.

To begin, we select the BigQuery product from the Google Cloud console. Under ANALYSIS we select “SQL workspace.”

Our first step is to select “Create dataset” from the ellipsis menu alongside your project name, as shown in Figure 9-1 ...

Get Hands-On Entity Resolution now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.