Chapter 9. Data Processing Tools
Google Cloud offers a variety of scalable data processing tools. Dataflow and Dataproc are the most commonly used (outside of BigQuery, covered in another chapter). These tools allow you to run open source Apache Spark or Apache Beam pipelines in a serverless or near-serverless environment. Cloud Dataflow, in particular, is an excellent environment for running large-scale, mission-critical, streaming pipelines for real-time analytics, data ingestion, and business logic. There are also low and no-code data processing toolsets, such as Cloud Data Fusion. These recipes are examples of some of the most common tasks you’ll perform as you implement solutions on these tools and include a few more advanced Dataflow pipeline patterns.
All code samples for this chapter are in this book’s GitHub repository. You can follow along and copy the code for each recipe by going to the folder with that recipe’s number.
9.1 Cleaning Data Using the Data Fusion GUI
Problem
You want to clean and join data sets in a repeatable pipeline in a low or no-code, GUI-driven tool.
Solution
Cloud Data Fusion allows users to interact with data from sources such as GCS and BigQuery and author-repeatable pipelines in a GUI, and execute them scalably and on a schedule, using Dataproc under the hood.
In this example, we will ingest some data from CSV to BigQuery, applying transformations and filters along the way.
From the Cloud Console, navigate to the Data Fusion page. You may ...
Get Google Cloud Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.