Chapter 8. Scaling Up on Google Cloud

In this chapter, we will work through how to scale up our entity resolution process to enable us to match large datasets in reasonable timeframes. We will use a cluster of virtual machines running in parallel on Google Cloud Platform (GCP) to divide up the workload and reduce the time taken to resolve our entities.

We will walk through how to register a new account on the Cloud Platform and how to configure the storage and compute services we will need. Once our infrastructure is ready, we will rerun our company matching example from Chapter 6, splitting both model training and entity resolution steps across a managed cluster of compute resources.

Lastly, we will check that our performance is consistent and make sure we tidy up fully, deleting the cluster and returning the virtual machines we have borrowed to ensure we don’t continue to run up any additional fees.

Get Hands-On Entity Resolution now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.