Chapter 12. Orchestrating Dataproc Workloads
After developing and testing your big data applications, the next crucial step is orchestration, which ties everything together in the data-to-value lifecycle. In GCP, there are several options for orchestrating Spark and Hadoop jobs. These include Cloud Composer, Vertex AI Pipelines, Cloud Functions, and Dataproc workflows. Which option you choose will depend on your preferences and organizational needs.
In this chapter, you’ll get hands-on experience and insights into all of these options:
- Cloud Composer
Learn how to configure and orchestrate Dataproc jobs using Python DAGs.
- Vertex AI Pipelines
Discover how to leverage Vertex AI for running Dataproc Serverless workflows.
- Cloud Functions
Understand how to use Cloud Functions for lightweight, event-driven orchestration.
- Dataproc workflows
Understand managing and automating your data-processing tasks.
Understanding the Prerequisites for Installing Cloud Composer
Problem
Setting up Cloud Composer comes with a list of things you need to have in place beforehand. You need to ensure that you have met all the requirements before beginning the installation.
Solution
Before installing Cloud Composer, a managed Apache Airflow service on Google Cloud, ensure the following prerequisites are met:
-
Enable the Composer service in a Google Cloud project
-
Determine the appropriate sizes for the scheduler, trigger, web server, and worker
-
Select the network configuration
-
Set up the necessary ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access