book

Dataproc Cookbook

by Narasimha Sadineni, Anuyogam Venkataraman

June 2025

Beginner to intermediate

438 pages

9h 17m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Includes

Sandbox

Preface
Who Should Read This BookWhy We Wrote This BookNavigating This BookConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Creating a Dataproc Cluster
Installing Google Cloud CLIProblemSolutionDiscussionGranting Identity and Access Management Privileges to a UserProblemSolutionDiscussionConfiguring a Network and Firewall RulesProblemSolutionDiscussionSee AlsoCreating a Dataproc Cluster from a Web UIProblemSolutionDiscussionCreating a Dataproc Cluster Using GcloudProblemSolutionDiscussionCreating a Dataproc Cluster Using API EndpointsProblemSolutionDiscussionCreating a Dataproc Cluster Using TerraformProblemSolutionDiscussionCreating a Cluster Using PythonProblemSolutionDiscussionDuplicating a Dataproc ClusterProblemSolutionDiscussion
2. Running Hive, Spark, and Sqoop Workloads
Adding Required Privileges for JobsProblemSolutionDiscussionSee AlsoGenerating 1 TB of Data Using a MapReduce JobProblemSolutionDiscussionRunning a Hive Job to Show Records from an Employee TableProblemSolutionDiscussionSee AlsoConverting XML Data to Parquet Using Scala Spark on DataprocProblemSolutionDiscussionSee AlsoConverting XML Data to Parquet Using PySpark on DataprocProblemSolutionDiscussionSee AlsoSubmitting a SparkR JobProblemSolutionDiscussionMigrating Data from Cloud SQL to Hive Using Sqoop JobProblemSolutionDiscussionChoosing Deployment Modes When Submitting a Spark Job to DataprocProblemSolutionDiscussionSee Also
3. Advanced Dataproc Cluster Configuration
Creating an Autoscaling PolicyProblemSolutionDiscussionAttaching an Autoscaling Policy to a Dataproc ClusterProblemSolutionDiscussionOptimizing Cluster Costs with a Mixed On-Demand and Spot Instance Autoscaling PolicyProblemSolutionDiscussionAdding Local SSDs to Dataproc Worker NodesProblemSolutionDiscussionCreating a Cluster with a Custom ImageProblemSolutionDiscussionBuilding a Cluster with Custom Machine TypesProblemSolutionDiscussionBootstrapping Dataproc Clusters with Initialization ScriptsProblemSolutionDiscussionScheduling Automatic Deletion of Unused ClustersProblemSolutionDiscussionOverriding Hadoop ConfigurationsProblemSolutionDiscussion
4. Serverless Spark and Ephemeral Dataproc Clusters
Running on Dataproc: Serverless Versus Ephemeral ClustersProblemSolutionDiscussionRunning a Sequence of Jobs on an Ephemeral ClusterProblemSolutionDiscussionExecuting a Spark Batch Job to Convert XML Data to Parquet on Dataproc ServerlessProblemSolutionDiscussionSee AlsoRunning a Serverless Job Using the Premium Tier ConfigurationProblemSolutionDiscussionGiving a Unique Custom Name to a Dataproc Serverless Spark JobProblemSolutionDiscussionSee AlsoCloning a Dataproc Serverless Spark JobProblemSolutionDiscussionRunning a Serverless Job on Spark RAPIDS AcceleratorProblemSolutionDiscussionSee AlsoConfiguring a Spark History ServerProblemSolutionDiscussionSee AlsoWriting Spark Events to the Spark History Server from Dataproc ServerlessProblemSolutionDiscussionMonitoring Serverless Spark JobsProblemSolutionDiscussionSee AlsoCalculating the Price of a Serverless BatchProblemSolutionDiscussionSee Also
5. Dataproc on Google Kubernetes Engine
Creating a Kubernetes ClusterProblemSolutionDiscussionCreating a Dataproc Cluster on a GKE ClusterProblemSolutionDiscussionRunning Spark Jobs on a Dataproc GKE ClusterProblemSolutionDiscussionCustomizing Node PoolsProblemSolutionDiscussionAutoscaling in a GKE ClusterProblemSolutionDiscussionAchieving Zonal High Availability for Dataproc JobsProblemSolutionDiscussion
6. Dataproc Metastore
Creating a Dataproc Metastore Service InstanceProblemSolutionDiscussionSee AlsoAttaching a DPMS Instance to One or More ClustersProblemSolutionDiscussionCreating Tables and Verifying Metadata in DPMSProblemSolutionDiscussionInstalling an External Hive MetastoreProblemSolutionDiscussionAttaching an External Apache Hive Metastore to the ClusterProblemSolutionDiscussionSearching for Metadata in a Dataplex Data CatalogProblemSolutionDiscussionAutomating the Backup of a DPMS InstanceProblemSolutionDiscussion
7. Connecting from Dataproc to GCP Services
Reading from GCS and Writing to a BigQuery TableProblemSolutionDiscussionReading from a Cloud SQL TableProblemSolutionDiscussionWriting to GCS in Delta FormatProblemSolutionDiscussionIntegrating a Dataproc-Managed Delta Lake with BigLakeProblemSolutionDiscussionConnecting to GCP Services Using Dataproc TemplatesProblemSolutionDiscussionSpark Job Running on Dataproc Reading from GCS and Writing to BigtableProblemSolutionDiscussionSee Also
8. Configuring Logging in Dataproc
Understanding Different Types of Logs in DataprocProblemSolutionDiscussionUnderstanding Cloud LoggingProblemSolutionDiscussionViewing Logs in Cloud LoggingProblemSolutionDiscussionRouting Dataproc Logs to Cloud LoggingProblemSolutionDiscussionAttaching Custom Labels to LoggingProblemSolutionDiscussionOptimizing Cloud Logging CostsProblemSolutionDiscussionSinking Logs to BigQueryProblemSolutionDiscussion
9. Setting Up Monitoring and Dashboards
Monitoring Cluster StatusProblemSolutionDiscussionExploring Predefined Metrics ChartsProblemSolutionDiscussionCreating Charts Using Metrics ExplorerProblemSolutionDiscussionCreating Dashboards Using Metrics ExplorerProblemSolutionDiscussionSetting Up AlertsProblemSolutionDiscussionMigrating Dashboards from One Project to AnotherProblemSolutionDiscussionCreating Custom Log-Based MetricsProblemSolutionDiscussion

10. Dataproc Security
Managing Identities in Dataproc ClustersProblemSolutionDiscussionSecuring Your Perimeter Using VPC Service ControlsProblemSolutionDiscussionAuthenticating Using KerberosProblemSolutionDiscussionInstalling RangerProblemSolutionDiscussionSecuring Cluster Resources Using RangerProblemSolutionDiscussionManaging Credentials in the Google Cloud EnvironmentProblemSolutionDiscussionEnforcing Restrictions Across All ClustersProblemSolutionDiscussionTokenizing Sensitive DataProblemSolutionDiscussion
11. Performance Tuning and Cost Optimization
Sizing a Dataproc ClusterProblemSolutionDiscussionChoosing the Right Disks for Big Data Workloads on DataprocProblemSolutionDiscussionBenchmarking Clusters with Performance TuningProblemSolutionDiscussionNavigating the Spark UIProblemSolutionDiscussionOptimizing Spark JobsProblemSolutionDiscussionInstalling Sparklens for Profiling Spark ApplicationsProblemSolutionDiscussionSee AlsoIdentifying Spark Job Errors and BottlenecksProblemSolutionDiscussionUnderstanding the YARN UIProblemSolutionDiscussionCalculating the Cost of a Dataproc ClusterProblemSolutionDiscussionOptimizing Cost in Dataproc ClustersProblemSolutionDiscussion
12. Orchestrating Dataproc Workloads
Understanding the Prerequisites for Installing Cloud ComposerProblemSolutionDiscussionDeploying a Cloud Composer EnvironmentProblemSolutionDiscussionScheduling a Job in ComposerProblemSolutionDiscussionParameterizing VariablesProblemSolutionDiscussionScaling Up a Cloud Composer EnvironmentProblemSolutionDiscussionRunning Spark Jobs Using Vertex AI Machine Learning PipelinesProblemSolutionDiscussionScheduling a Dataproc Job in Event Driven Using a Cloud FunctionProblemSolutionDiscussionUsing Dataproc Workflow TemplatesProblemSolutionDiscussion
13. Using Spark Notebooks on Dataproc
Deciding Which Notebook Environments to ChooseProblemSolutionDiscussionConfiguring Notebooks on a Dataproc ClusterProblemSolutionDiscussionRunning Spark Scala and PySpark Notebooks on DataprocProblemSolutionDiscussionManaging Libraries and ConfigsProblemSolutionDiscussionCreating Dataproc-Enabled Vertex AI Workbench InstancesProblemSolutionDiscussionExecuting Notebooks Using Spark Serverless SessionsProblemSolutionDiscussion
14. Migrating from On-Premises and Public Cloud Services to GCP
Planning MigrationProblemSolutionDiscussionSee AlsoData Migration StrategiesProblemSolutionDiscussionMigrating Data with STSProblemSolutionDiscussionAccessing AWS S3 Data Using BigLake TablesProblemSolutionDiscussionMigrating MetadataProblemSolutionDiscussionMigrating Applications to Google CloudProblemSolutionDiscussion
Index
About the Authors

Content preview from Dataproc Cookbook

Chapter 2. Running Hive, Spark, and Sqoop Workloads

In the world of big data processing and analysis, Google Cloud’s Dataproc simplifies managing and executing large-scale data workloads. In this chapter, we will cover the essential steps for running various big data jobs on your Dataproc cluster. A job in this context represents a specific task or workload to be executed on the Dataproc cluster. This can be a Hive query for structured data processing, a Spark application for distributed computation, or a Sqoop data transfer for moving data between databases and Hadoop.

To effectively follow along with this chapter, you will need the following prerequisites:

Dataproc API: Ensure that the Dataproc API is enabled for your project. This API is essential for interacting with your cluster.
Existing Dataproc cluster: You will need a Dataproc cluster that has already been created and is running on GCP. If you haven’t set one up yet, Chapter 1 provides guidance on cluster creation.

We will explore the different methods you can use to submit these jobs to your Dataproc cluster. This includes using the Dataproc console UI as well as the gcloud CLI tool. Throughout the chapter, we’ll provide practical examples to illustrate these concepts.

Let’s get started!

Adding Required Privileges for Jobs

Problem

You need to grant users the necessary permissions to submit jobs to your Dataproc cluster.

Solution

Use Google Cloud’s IAM to assign appropriate roles to users. At the service level, predefined ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781098157692Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business