Chapter 7. Working with Data
One of the greatest paradigm shifts when working with cloud computing is the nearly unlimited storage now available to users. Cheap, scalable blob storage in the form of Google Cloud Storage (GCS) allows administrators to start from a standpoint of “never delete data.” Services like BigQuery and Spark on Dataproc allow you to pay for long-lived storage separately from the compute resources, which you pay for by the second. Generally, compute is more expensive than storage, so this paradigm saves on a great deal of engineering effort trying to move, archive, and retrieve data between disparate storage systems.
The recipes in this chapter show tips and tricks when working with the various data layers of Google Cloud, from moving data round GCS buckets faster, to automatically archiving long-term data, to some more advanced database techniques.
All code samples for this chapter are in this book’s GitHub repository. You can follow along and copy the code for each recipe by going to the folder with that recipe’s number.
7.1 Speeding Up Cloud Storage Bulk Transfers by Multiprocessing
Problem
Although the gsutil tool performs well and is a great CLI solution for interacting with GCS, sometimes you want to max out your CPU and network bandwidth for a faster transfer. You’ll often do this when transferring a large number of files, either to or from GCS or within the GCS service.
Solution
You can leverage the -m flag when using the gsutil command-line tool ...
Get Google Cloud Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.