book

Dataproc Cookbook

by Narasimha Sadineni, Anuyogam Venkataraman

June 2025

Beginner to intermediate

438 pages

9h 17m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Includes

Sandbox

Preface
Who Should Read This BookWhy We Wrote This BookNavigating This BookConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Creating a Dataproc Cluster
Installing Google Cloud CLIProblemSolutionDiscussionGranting Identity and Access Management Privileges to a UserProblemSolutionDiscussionConfiguring a Network and Firewall RulesProblemSolutionDiscussionSee AlsoCreating a Dataproc Cluster from a Web UIProblemSolutionDiscussionCreating a Dataproc Cluster Using GcloudProblemSolutionDiscussionCreating a Dataproc Cluster Using API EndpointsProblemSolutionDiscussionCreating a Dataproc Cluster Using TerraformProblemSolutionDiscussionCreating a Cluster Using PythonProblemSolutionDiscussionDuplicating a Dataproc ClusterProblemSolutionDiscussion
2. Running Hive, Spark, and Sqoop Workloads
Adding Required Privileges for JobsProblemSolutionDiscussionSee AlsoGenerating 1 TB of Data Using a MapReduce JobProblemSolutionDiscussionRunning a Hive Job to Show Records from an Employee TableProblemSolutionDiscussionSee AlsoConverting XML Data to Parquet Using Scala Spark on DataprocProblemSolutionDiscussionSee AlsoConverting XML Data to Parquet Using PySpark on DataprocProblemSolutionDiscussionSee AlsoSubmitting a SparkR JobProblemSolutionDiscussionMigrating Data from Cloud SQL to Hive Using Sqoop JobProblemSolutionDiscussionChoosing Deployment Modes When Submitting a Spark Job to DataprocProblemSolutionDiscussionSee Also
3. Advanced Dataproc Cluster Configuration
Creating an Autoscaling PolicyProblemSolutionDiscussionAttaching an Autoscaling Policy to a Dataproc ClusterProblemSolutionDiscussionOptimizing Cluster Costs with a Mixed On-Demand and Spot Instance Autoscaling PolicyProblemSolutionDiscussionAdding Local SSDs to Dataproc Worker NodesProblemSolutionDiscussionCreating a Cluster with a Custom ImageProblemSolutionDiscussionBuilding a Cluster with Custom Machine TypesProblemSolutionDiscussionBootstrapping Dataproc Clusters with Initialization ScriptsProblemSolutionDiscussionScheduling Automatic Deletion of Unused ClustersProblemSolutionDiscussionOverriding Hadoop ConfigurationsProblemSolutionDiscussion
4. Serverless Spark and Ephemeral Dataproc Clusters
Running on Dataproc: Serverless Versus Ephemeral ClustersProblemSolutionDiscussionRunning a Sequence of Jobs on an Ephemeral ClusterProblemSolutionDiscussionExecuting a Spark Batch Job to Convert XML Data to Parquet on Dataproc ServerlessProblemSolutionDiscussionSee AlsoRunning a Serverless Job Using the Premium Tier ConfigurationProblemSolutionDiscussionGiving a Unique Custom Name to a Dataproc Serverless Spark JobProblemSolutionDiscussionSee AlsoCloning a Dataproc Serverless Spark JobProblemSolutionDiscussionRunning a Serverless Job on Spark RAPIDS AcceleratorProblemSolutionDiscussionSee AlsoConfiguring a Spark History ServerProblemSolutionDiscussionSee AlsoWriting Spark Events to the Spark History Server from Dataproc ServerlessProblemSolutionDiscussionMonitoring Serverless Spark JobsProblemSolutionDiscussionSee AlsoCalculating the Price of a Serverless BatchProblemSolutionDiscussionSee Also
5. Dataproc on Google Kubernetes Engine
Creating a Kubernetes ClusterProblemSolutionDiscussionCreating a Dataproc Cluster on a GKE ClusterProblemSolutionDiscussionRunning Spark Jobs on a Dataproc GKE ClusterProblemSolutionDiscussionCustomizing Node PoolsProblemSolutionDiscussionAutoscaling in a GKE ClusterProblemSolutionDiscussionAchieving Zonal High Availability for Dataproc JobsProblemSolutionDiscussion
6. Dataproc Metastore
Creating a Dataproc Metastore Service InstanceProblemSolutionDiscussionSee AlsoAttaching a DPMS Instance to One or More ClustersProblemSolutionDiscussionCreating Tables and Verifying Metadata in DPMSProblemSolutionDiscussionInstalling an External Hive MetastoreProblemSolutionDiscussionAttaching an External Apache Hive Metastore to the ClusterProblemSolutionDiscussionSearching for Metadata in a Dataplex Data CatalogProblemSolutionDiscussionAutomating the Backup of a DPMS InstanceProblemSolutionDiscussion
7. Connecting from Dataproc to GCP Services
Reading from GCS and Writing to a BigQuery TableProblemSolutionDiscussionReading from a Cloud SQL TableProblemSolutionDiscussionWriting to GCS in Delta FormatProblemSolutionDiscussionIntegrating a Dataproc-Managed Delta Lake with BigLakeProblemSolutionDiscussionConnecting to GCP Services Using Dataproc TemplatesProblemSolutionDiscussionSpark Job Running on Dataproc Reading from GCS and Writing to BigtableProblemSolutionDiscussionSee Also
8. Configuring Logging in Dataproc
Understanding Different Types of Logs in DataprocProblemSolutionDiscussionUnderstanding Cloud LoggingProblemSolutionDiscussionViewing Logs in Cloud LoggingProblemSolutionDiscussionRouting Dataproc Logs to Cloud LoggingProblemSolutionDiscussionAttaching Custom Labels to LoggingProblemSolutionDiscussionOptimizing Cloud Logging CostsProblemSolutionDiscussionSinking Logs to BigQueryProblemSolutionDiscussion
9. Setting Up Monitoring and Dashboards
Monitoring Cluster StatusProblemSolutionDiscussionExploring Predefined Metrics ChartsProblemSolutionDiscussionCreating Charts Using Metrics ExplorerProblemSolutionDiscussionCreating Dashboards Using Metrics ExplorerProblemSolutionDiscussionSetting Up AlertsProblemSolutionDiscussionMigrating Dashboards from One Project to AnotherProblemSolutionDiscussionCreating Custom Log-Based MetricsProblemSolutionDiscussion

10. Dataproc Security
Managing Identities in Dataproc ClustersProblemSolutionDiscussionSecuring Your Perimeter Using VPC Service ControlsProblemSolutionDiscussionAuthenticating Using KerberosProblemSolutionDiscussionInstalling RangerProblemSolutionDiscussionSecuring Cluster Resources Using RangerProblemSolutionDiscussionManaging Credentials in the Google Cloud EnvironmentProblemSolutionDiscussionEnforcing Restrictions Across All ClustersProblemSolutionDiscussionTokenizing Sensitive DataProblemSolutionDiscussion
11. Performance Tuning and Cost Optimization
Sizing a Dataproc ClusterProblemSolutionDiscussionChoosing the Right Disks for Big Data Workloads on DataprocProblemSolutionDiscussionBenchmarking Clusters with Performance TuningProblemSolutionDiscussionNavigating the Spark UIProblemSolutionDiscussionOptimizing Spark JobsProblemSolutionDiscussionInstalling Sparklens for Profiling Spark ApplicationsProblemSolutionDiscussionSee AlsoIdentifying Spark Job Errors and BottlenecksProblemSolutionDiscussionUnderstanding the YARN UIProblemSolutionDiscussionCalculating the Cost of a Dataproc ClusterProblemSolutionDiscussionOptimizing Cost in Dataproc ClustersProblemSolutionDiscussion
12. Orchestrating Dataproc Workloads
Understanding the Prerequisites for Installing Cloud ComposerProblemSolutionDiscussionDeploying a Cloud Composer EnvironmentProblemSolutionDiscussionScheduling a Job in ComposerProblemSolutionDiscussionParameterizing VariablesProblemSolutionDiscussionScaling Up a Cloud Composer EnvironmentProblemSolutionDiscussionRunning Spark Jobs Using Vertex AI Machine Learning PipelinesProblemSolutionDiscussionScheduling a Dataproc Job in Event Driven Using a Cloud FunctionProblemSolutionDiscussionUsing Dataproc Workflow TemplatesProblemSolutionDiscussion
13. Using Spark Notebooks on Dataproc
Deciding Which Notebook Environments to ChooseProblemSolutionDiscussionConfiguring Notebooks on a Dataproc ClusterProblemSolutionDiscussionRunning Spark Scala and PySpark Notebooks on DataprocProblemSolutionDiscussionManaging Libraries and ConfigsProblemSolutionDiscussionCreating Dataproc-Enabled Vertex AI Workbench InstancesProblemSolutionDiscussionExecuting Notebooks Using Spark Serverless SessionsProblemSolutionDiscussion
14. Migrating from On-Premises and Public Cloud Services to GCP
Planning MigrationProblemSolutionDiscussionSee AlsoData Migration StrategiesProblemSolutionDiscussionMigrating Data with STSProblemSolutionDiscussionAccessing AWS S3 Data Using BigLake TablesProblemSolutionDiscussionMigrating MetadataProblemSolutionDiscussionMigrating Applications to Google CloudProblemSolutionDiscussion
Index
About the Authors

Content preview from Dataproc Cookbook

Chapter 11. Performance Tuning and Cost Optimization

In the landscape of big data engineering, optimizing both performance and costs is important. This requires a deep understanding of and well-designed big data applications as well as careful benchmarking to ensure optimal performance. Profiling and benchmarking are usually carried out during or after the application-development phase, which helps to optimize both performance and cost.

This chapter delves into the crucial aspects of performance tuning and cost optimization within Dataproc. You’ll learn how to size Dataproc clusters, benchmark them, choose appropriate disks, utilize Spark and YARN UIs, optimize Spark jobs, profile them using Sparklens, identify errors, and calculate and optimize the cost of your Dataproc clusters.

Sizing a Dataproc Cluster

Problem

Your team is onboarding a new application in Dataproc and wants to estimate the size of the cluster.

Solution

Create a list to capture information about the storage and compute requirements and plan the cluster accordingly.

Discussion

To effectively size a Dataproc cluster, you should choose the right set of components in the cluster, including the following:

Nature of the cluster (static, ephemeral, or serverless)
Master and worker machine type
Number of primary workers
Number and type of secondary workers
Disk type and size to be attached to each worker node
Autoscaling configuration

To choose the proper components, start by asking the following ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781098157692Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills