Chapter 8. Data Processing Tools

Google Cloud offers a variety of scalable, data processing tools. Dataflow and Dataproc are the most commonly used (outside of BigQuery, covered in another chapter). Both of these tools allow you to run open source Apache Spark and Apache Beam pipelines in a serverless or near-serverless environment. Cloud Dataflow, in particular, is an excellent environment for running large-scale, mission critical streaming pipelines for real-time analytics, data ingestion, and business logic. These recipes are examples of some of the most common tasks you’ll perform as you implement solutions on these tools.

Building a Streaming Pipeline in Dataflow SQL

Problem

You want to build a streaming pipeline using various PubSub or BigQuery sources but don’t want to write a Python or Java Apache Beam pipeline to execute on Dataflow

Solution

Dataflow SQL allows you to author pipelines purely ...

Get Google Cloud Cookbook now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.