Chapter 4. Serverless Spark and Ephemeral Dataproc Clusters
Dataproc Serverless Spark is an autoscaling serverless product for Spark that simplifies the execution of Spark applications because the user doesn’t have to think about infrastructure in order to run Spark jobs.
Ephemeral Dataproc clusters, also known as transient clusters, are temporary clusters that run until specific jobs are completed or terminated. Throughout this book, we’ll refer to them as ephemeral clusters. Both serverless and ephemeral Dataproc clusters support running jobs within a VPC or default network.
In this chapter, you will gain a clear understanding of when to use Dataproc Serverless Spark and ephemeral Dataproc clusters. You will also learn:
-
How to submit Spark jobs to Dataproc Serverless
-
How to create and run jobs on ephemeral clusters
-
How to configure a Spark history server
-
How to leverage Spark RAPIDS Accelerator
-
How to price and monitor serverless Spark jobs
Let’s dive in and explore how to scale your Spark workloads efficiently!
Running on Dataproc: Serverless Versus Ephemeral Clusters
Problem
Scenario 1: you’re tasked with running a Spark job on Dataproc, with the following three criteria:
-
You want to avoid managing or customizing the cluster, including hardware selection.
-
There’s no requirement to sequence Spark jobs on the same cluster.
-
The objective is to execute a Spark job, not Hadoop MapReduce jobs or HiveQL scripts.
Scenario 2: you have a Spark job, a Hadoop job, ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access