Chapter 8. Data Processing Tools
Google Cloud offers a variety of scalable, data processing tools. Dataflow and Dataproc are the most commonly used (outside of BigQuery, covered in another chapter). Both of these tools allow you to run open source Apache Spark and Apache Beam pipelines in a serverless or near-serverless environment. Cloud Dataflow, in particular, is an excellent environment for running large-scale, mission critical streaming pipelines for real-time analytics, data ingestion, and business logic. These recipes are examples of some of the most common tasks you’ll perform as you implement solutions on these tools.
Building a Streaming Pipeline in Dataflow SQL
You want to build a streaming pipeline using various PubSub or BigQuery sources but don’t want to write a Python or Java Apache Beam pipeline to execute on Dataflow
Dataflow SQL allows you to author pipelines purely ...