book

Dataproc Cookbook

by Narasimha Sadineni, Anuyogam Venkataraman

June 2025

Beginner to intermediate

438 pages

9h 17m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Includes

Sandbox

Preface
Who Should Read This BookWhy We Wrote This BookNavigating This BookConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Creating a Dataproc Cluster
Installing Google Cloud CLIProblemSolutionDiscussionGranting Identity and Access Management Privileges to a UserProblemSolutionDiscussionConfiguring a Network and Firewall RulesProblemSolutionDiscussionSee AlsoCreating a Dataproc Cluster from a Web UIProblemSolutionDiscussionCreating a Dataproc Cluster Using GcloudProblemSolutionDiscussionCreating a Dataproc Cluster Using API EndpointsProblemSolutionDiscussionCreating a Dataproc Cluster Using TerraformProblemSolutionDiscussionCreating a Cluster Using PythonProblemSolutionDiscussionDuplicating a Dataproc ClusterProblemSolutionDiscussion
2. Running Hive, Spark, and Sqoop Workloads
Adding Required Privileges for JobsProblemSolutionDiscussionSee AlsoGenerating 1 TB of Data Using a MapReduce JobProblemSolutionDiscussionRunning a Hive Job to Show Records from an Employee TableProblemSolutionDiscussionSee AlsoConverting XML Data to Parquet Using Scala Spark on DataprocProblemSolutionDiscussionSee AlsoConverting XML Data to Parquet Using PySpark on DataprocProblemSolutionDiscussionSee AlsoSubmitting a SparkR JobProblemSolutionDiscussionMigrating Data from Cloud SQL to Hive Using Sqoop JobProblemSolutionDiscussionChoosing Deployment Modes When Submitting a Spark Job to DataprocProblemSolutionDiscussionSee Also
3. Advanced Dataproc Cluster Configuration
Creating an Autoscaling PolicyProblemSolutionDiscussionAttaching an Autoscaling Policy to a Dataproc ClusterProblemSolutionDiscussionOptimizing Cluster Costs with a Mixed On-Demand and Spot Instance Autoscaling PolicyProblemSolutionDiscussionAdding Local SSDs to Dataproc Worker NodesProblemSolutionDiscussionCreating a Cluster with a Custom ImageProblemSolutionDiscussionBuilding a Cluster with Custom Machine TypesProblemSolutionDiscussionBootstrapping Dataproc Clusters with Initialization ScriptsProblemSolutionDiscussionScheduling Automatic Deletion of Unused ClustersProblemSolutionDiscussionOverriding Hadoop ConfigurationsProblemSolutionDiscussion
4. Serverless Spark and Ephemeral Dataproc Clusters
Running on Dataproc: Serverless Versus Ephemeral ClustersProblemSolutionDiscussionRunning a Sequence of Jobs on an Ephemeral ClusterProblemSolutionDiscussionExecuting a Spark Batch Job to Convert XML Data to Parquet on Dataproc ServerlessProblemSolutionDiscussionSee AlsoRunning a Serverless Job Using the Premium Tier ConfigurationProblemSolutionDiscussionGiving a Unique Custom Name to a Dataproc Serverless Spark JobProblemSolutionDiscussionSee AlsoCloning a Dataproc Serverless Spark JobProblemSolutionDiscussionRunning a Serverless Job on Spark RAPIDS AcceleratorProblemSolutionDiscussionSee AlsoConfiguring a Spark History ServerProblemSolutionDiscussionSee AlsoWriting Spark Events to the Spark History Server from Dataproc ServerlessProblemSolutionDiscussionMonitoring Serverless Spark JobsProblemSolutionDiscussionSee AlsoCalculating the Price of a Serverless BatchProblemSolutionDiscussionSee Also
5. Dataproc on Google Kubernetes Engine
Creating a Kubernetes ClusterProblemSolutionDiscussionCreating a Dataproc Cluster on a GKE ClusterProblemSolutionDiscussionRunning Spark Jobs on a Dataproc GKE ClusterProblemSolutionDiscussionCustomizing Node PoolsProblemSolutionDiscussionAutoscaling in a GKE ClusterProblemSolutionDiscussionAchieving Zonal High Availability for Dataproc JobsProblemSolutionDiscussion
6. Dataproc Metastore
Creating a Dataproc Metastore Service InstanceProblemSolutionDiscussionSee AlsoAttaching a DPMS Instance to One or More ClustersProblemSolutionDiscussionCreating Tables and Verifying Metadata in DPMSProblemSolutionDiscussionInstalling an External Hive MetastoreProblemSolutionDiscussionAttaching an External Apache Hive Metastore to the ClusterProblemSolutionDiscussionSearching for Metadata in a Dataplex Data CatalogProblemSolutionDiscussionAutomating the Backup of a DPMS InstanceProblemSolutionDiscussion
7. Connecting from Dataproc to GCP Services
Reading from GCS and Writing to a BigQuery TableProblemSolutionDiscussionReading from a Cloud SQL TableProblemSolutionDiscussionWriting to GCS in Delta FormatProblemSolutionDiscussionIntegrating a Dataproc-Managed Delta Lake with BigLakeProblemSolutionDiscussionConnecting to GCP Services Using Dataproc TemplatesProblemSolutionDiscussionSpark Job Running on Dataproc Reading from GCS and Writing to BigtableProblemSolutionDiscussionSee Also
8. Configuring Logging in Dataproc
Understanding Different Types of Logs in DataprocProblemSolutionDiscussionUnderstanding Cloud LoggingProblemSolutionDiscussionViewing Logs in Cloud LoggingProblemSolutionDiscussionRouting Dataproc Logs to Cloud LoggingProblemSolutionDiscussionAttaching Custom Labels to LoggingProblemSolutionDiscussionOptimizing Cloud Logging CostsProblemSolutionDiscussionSinking Logs to BigQueryProblemSolutionDiscussion
9. Setting Up Monitoring and Dashboards
Monitoring Cluster StatusProblemSolutionDiscussionExploring Predefined Metrics ChartsProblemSolutionDiscussionCreating Charts Using Metrics ExplorerProblemSolutionDiscussionCreating Dashboards Using Metrics ExplorerProblemSolutionDiscussionSetting Up AlertsProblemSolutionDiscussionMigrating Dashboards from One Project to AnotherProblemSolutionDiscussionCreating Custom Log-Based MetricsProblemSolutionDiscussion

10. Dataproc Security
Managing Identities in Dataproc ClustersProblemSolutionDiscussionSecuring Your Perimeter Using VPC Service ControlsProblemSolutionDiscussionAuthenticating Using KerberosProblemSolutionDiscussionInstalling RangerProblemSolutionDiscussionSecuring Cluster Resources Using RangerProblemSolutionDiscussionManaging Credentials in the Google Cloud EnvironmentProblemSolutionDiscussionEnforcing Restrictions Across All ClustersProblemSolutionDiscussionTokenizing Sensitive DataProblemSolutionDiscussion
11. Performance Tuning and Cost Optimization
Sizing a Dataproc ClusterProblemSolutionDiscussionChoosing the Right Disks for Big Data Workloads on DataprocProblemSolutionDiscussionBenchmarking Clusters with Performance TuningProblemSolutionDiscussionNavigating the Spark UIProblemSolutionDiscussionOptimizing Spark JobsProblemSolutionDiscussionInstalling Sparklens for Profiling Spark ApplicationsProblemSolutionDiscussionSee AlsoIdentifying Spark Job Errors and BottlenecksProblemSolutionDiscussionUnderstanding the YARN UIProblemSolutionDiscussionCalculating the Cost of a Dataproc ClusterProblemSolutionDiscussionOptimizing Cost in Dataproc ClustersProblemSolutionDiscussion
12. Orchestrating Dataproc Workloads
Understanding the Prerequisites for Installing Cloud ComposerProblemSolutionDiscussionDeploying a Cloud Composer EnvironmentProblemSolutionDiscussionScheduling a Job in ComposerProblemSolutionDiscussionParameterizing VariablesProblemSolutionDiscussionScaling Up a Cloud Composer EnvironmentProblemSolutionDiscussionRunning Spark Jobs Using Vertex AI Machine Learning PipelinesProblemSolutionDiscussionScheduling a Dataproc Job in Event Driven Using a Cloud FunctionProblemSolutionDiscussionUsing Dataproc Workflow TemplatesProblemSolutionDiscussion
13. Using Spark Notebooks on Dataproc
Deciding Which Notebook Environments to ChooseProblemSolutionDiscussionConfiguring Notebooks on a Dataproc ClusterProblemSolutionDiscussionRunning Spark Scala and PySpark Notebooks on DataprocProblemSolutionDiscussionManaging Libraries and ConfigsProblemSolutionDiscussionCreating Dataproc-Enabled Vertex AI Workbench InstancesProblemSolutionDiscussionExecuting Notebooks Using Spark Serverless SessionsProblemSolutionDiscussion
14. Migrating from On-Premises and Public Cloud Services to GCP
Planning MigrationProblemSolutionDiscussionSee AlsoData Migration StrategiesProblemSolutionDiscussionMigrating Data with STSProblemSolutionDiscussionAccessing AWS S3 Data Using BigLake TablesProblemSolutionDiscussionMigrating MetadataProblemSolutionDiscussionMigrating Applications to Google CloudProblemSolutionDiscussion
Index
About the Authors

Content preview from Dataproc Cookbook

Chapter 1. Creating a Dataproc Cluster

Dataproc is a paid Google Cloud service built on top of open source software Apache Hadoop, Apache Spark, and other big data technologies, such as Apache Kafka, JupyterHub, and Apache Solr. As a managed service, Dataproc abstracts creating, updating, managing, and deleting all the required cloud services and resources.

Dataproc offers three different environments for running it:

Dataproc on Google Compute Engine (GCE)
Dataproc on Google Kubernetes Engine (GKE)
Dataproc Serverless

In this chapter, we focus on the first option: running Dataproc on GCE (see Figure 1-1).

Before you start using this product, you should understand the billing and charges. This service has two types of charges: a charge for the software and a charge for the underlying components (compute engine, disks, cloud storage, network, etc.). Dataproc’s pay-as-you-go model allows you to pay for only the services you use. For more information on pricing, refer to the Dataproc pricing documentation. Dataproc Serverless has a different pricing model that we will discuss in Chapter 4.

Dataproc on GCE high-level architecture diagram

The first step in the process of creating a Dataproc cluster is to secure a Google Cloud account. If you don’t have one already, sign up for a new Google Cloud account at cloud.google.com.

Tip

Google encourages new users to try ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781098157692Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Dataproc Cookbook

by Narasimha Sadineni, Anuyogam Venkataraman

Chapter 1. Creating a Dataproc Cluster

Figure 1-1. Dataproc on GCE high-level architecture diagram

Tip

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.