book

Dataproc Cookbook

by Narasimha Sadineni, Anuyogam Venkataraman

June 2025

Beginner to intermediate

438 pages

9h 17m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Includes

Sandbox

Preface
Who Should Read This BookWhy We Wrote This BookNavigating This BookConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Creating a Dataproc Cluster
Installing Google Cloud CLIProblemSolutionDiscussionGranting Identity and Access Management Privileges to a UserProblemSolutionDiscussionConfiguring a Network and Firewall RulesProblemSolutionDiscussionSee AlsoCreating a Dataproc Cluster from a Web UIProblemSolutionDiscussionCreating a Dataproc Cluster Using GcloudProblemSolutionDiscussionCreating a Dataproc Cluster Using API EndpointsProblemSolutionDiscussionCreating a Dataproc Cluster Using TerraformProblemSolutionDiscussionCreating a Cluster Using PythonProblemSolutionDiscussionDuplicating a Dataproc ClusterProblemSolutionDiscussion
2. Running Hive, Spark, and Sqoop Workloads
Adding Required Privileges for JobsProblemSolutionDiscussionSee AlsoGenerating 1 TB of Data Using a MapReduce JobProblemSolutionDiscussionRunning a Hive Job to Show Records from an Employee TableProblemSolutionDiscussionSee AlsoConverting XML Data to Parquet Using Scala Spark on DataprocProblemSolutionDiscussionSee AlsoConverting XML Data to Parquet Using PySpark on DataprocProblemSolutionDiscussionSee AlsoSubmitting a SparkR JobProblemSolutionDiscussionMigrating Data from Cloud SQL to Hive Using Sqoop JobProblemSolutionDiscussionChoosing Deployment Modes When Submitting a Spark Job to DataprocProblemSolutionDiscussionSee Also
3. Advanced Dataproc Cluster Configuration
Creating an Autoscaling PolicyProblemSolutionDiscussionAttaching an Autoscaling Policy to a Dataproc ClusterProblemSolutionDiscussionOptimizing Cluster Costs with a Mixed On-Demand and Spot Instance Autoscaling PolicyProblemSolutionDiscussionAdding Local SSDs to Dataproc Worker NodesProblemSolutionDiscussionCreating a Cluster with a Custom ImageProblemSolutionDiscussionBuilding a Cluster with Custom Machine TypesProblemSolutionDiscussionBootstrapping Dataproc Clusters with Initialization ScriptsProblemSolutionDiscussionScheduling Automatic Deletion of Unused ClustersProblemSolutionDiscussionOverriding Hadoop ConfigurationsProblemSolutionDiscussion
4. Serverless Spark and Ephemeral Dataproc Clusters
Running on Dataproc: Serverless Versus Ephemeral ClustersProblemSolutionDiscussionRunning a Sequence of Jobs on an Ephemeral ClusterProblemSolutionDiscussionExecuting a Spark Batch Job to Convert XML Data to Parquet on Dataproc ServerlessProblemSolutionDiscussionSee AlsoRunning a Serverless Job Using the Premium Tier ConfigurationProblemSolutionDiscussionGiving a Unique Custom Name to a Dataproc Serverless Spark JobProblemSolutionDiscussionSee AlsoCloning a Dataproc Serverless Spark JobProblemSolutionDiscussionRunning a Serverless Job on Spark RAPIDS AcceleratorProblemSolutionDiscussionSee AlsoConfiguring a Spark History ServerProblemSolutionDiscussionSee AlsoWriting Spark Events to the Spark History Server from Dataproc ServerlessProblemSolutionDiscussionMonitoring Serverless Spark JobsProblemSolutionDiscussionSee AlsoCalculating the Price of a Serverless BatchProblemSolutionDiscussionSee Also
5. Dataproc on Google Kubernetes Engine
Creating a Kubernetes ClusterProblemSolutionDiscussionCreating a Dataproc Cluster on a GKE ClusterProblemSolutionDiscussionRunning Spark Jobs on a Dataproc GKE ClusterProblemSolutionDiscussionCustomizing Node PoolsProblemSolutionDiscussionAutoscaling in a GKE ClusterProblemSolutionDiscussionAchieving Zonal High Availability for Dataproc JobsProblemSolutionDiscussion
6. Dataproc Metastore
Creating a Dataproc Metastore Service InstanceProblemSolutionDiscussionSee AlsoAttaching a DPMS Instance to One or More ClustersProblemSolutionDiscussionCreating Tables and Verifying Metadata in DPMSProblemSolutionDiscussionInstalling an External Hive MetastoreProblemSolutionDiscussionAttaching an External Apache Hive Metastore to the ClusterProblemSolutionDiscussionSearching for Metadata in a Dataplex Data CatalogProblemSolutionDiscussionAutomating the Backup of a DPMS InstanceProblemSolutionDiscussion
7. Connecting from Dataproc to GCP Services
Reading from GCS and Writing to a BigQuery TableProblemSolutionDiscussionReading from a Cloud SQL TableProblemSolutionDiscussionWriting to GCS in Delta FormatProblemSolutionDiscussionIntegrating a Dataproc-Managed Delta Lake with BigLakeProblemSolutionDiscussionConnecting to GCP Services Using Dataproc TemplatesProblemSolutionDiscussionSpark Job Running on Dataproc Reading from GCS and Writing to BigtableProblemSolutionDiscussionSee Also
8. Configuring Logging in Dataproc
Understanding Different Types of Logs in DataprocProblemSolutionDiscussionUnderstanding Cloud LoggingProblemSolutionDiscussionViewing Logs in Cloud LoggingProblemSolutionDiscussionRouting Dataproc Logs to Cloud LoggingProblemSolutionDiscussionAttaching Custom Labels to LoggingProblemSolutionDiscussionOptimizing Cloud Logging CostsProblemSolutionDiscussionSinking Logs to BigQueryProblemSolutionDiscussion
9. Setting Up Monitoring and Dashboards
Monitoring Cluster StatusProblemSolutionDiscussionExploring Predefined Metrics ChartsProblemSolutionDiscussionCreating Charts Using Metrics ExplorerProblemSolutionDiscussionCreating Dashboards Using Metrics ExplorerProblemSolutionDiscussionSetting Up AlertsProblemSolutionDiscussionMigrating Dashboards from One Project to AnotherProblemSolutionDiscussionCreating Custom Log-Based MetricsProblemSolutionDiscussion

10. Dataproc Security
Managing Identities in Dataproc ClustersProblemSolutionDiscussionSecuring Your Perimeter Using VPC Service ControlsProblemSolutionDiscussionAuthenticating Using KerberosProblemSolutionDiscussionInstalling RangerProblemSolutionDiscussionSecuring Cluster Resources Using RangerProblemSolutionDiscussionManaging Credentials in the Google Cloud EnvironmentProblemSolutionDiscussionEnforcing Restrictions Across All ClustersProblemSolutionDiscussionTokenizing Sensitive DataProblemSolutionDiscussion
11. Performance Tuning and Cost Optimization
Sizing a Dataproc ClusterProblemSolutionDiscussionChoosing the Right Disks for Big Data Workloads on DataprocProblemSolutionDiscussionBenchmarking Clusters with Performance TuningProblemSolutionDiscussionNavigating the Spark UIProblemSolutionDiscussionOptimizing Spark JobsProblemSolutionDiscussionInstalling Sparklens for Profiling Spark ApplicationsProblemSolutionDiscussionSee AlsoIdentifying Spark Job Errors and BottlenecksProblemSolutionDiscussionUnderstanding the YARN UIProblemSolutionDiscussionCalculating the Cost of a Dataproc ClusterProblemSolutionDiscussionOptimizing Cost in Dataproc ClustersProblemSolutionDiscussion
12. Orchestrating Dataproc Workloads
Understanding the Prerequisites for Installing Cloud ComposerProblemSolutionDiscussionDeploying a Cloud Composer EnvironmentProblemSolutionDiscussionScheduling a Job in ComposerProblemSolutionDiscussionParameterizing VariablesProblemSolutionDiscussionScaling Up a Cloud Composer EnvironmentProblemSolutionDiscussionRunning Spark Jobs Using Vertex AI Machine Learning PipelinesProblemSolutionDiscussionScheduling a Dataproc Job in Event Driven Using a Cloud FunctionProblemSolutionDiscussionUsing Dataproc Workflow TemplatesProblemSolutionDiscussion
13. Using Spark Notebooks on Dataproc
Deciding Which Notebook Environments to ChooseProblemSolutionDiscussionConfiguring Notebooks on a Dataproc ClusterProblemSolutionDiscussionRunning Spark Scala and PySpark Notebooks on DataprocProblemSolutionDiscussionManaging Libraries and ConfigsProblemSolutionDiscussionCreating Dataproc-Enabled Vertex AI Workbench InstancesProblemSolutionDiscussionExecuting Notebooks Using Spark Serverless SessionsProblemSolutionDiscussion
14. Migrating from On-Premises and Public Cloud Services to GCP
Planning MigrationProblemSolutionDiscussionSee AlsoData Migration StrategiesProblemSolutionDiscussionMigrating Data with STSProblemSolutionDiscussionAccessing AWS S3 Data Using BigLake TablesProblemSolutionDiscussionMigrating MetadataProblemSolutionDiscussionMigrating Applications to Google CloudProblemSolutionDiscussion
Index
About the Authors

Content preview from Dataproc Cookbook

Chapter 8. Configuring Logging in Dataproc

In the world of distributed data processing, logging is an essential tool that empowers you to monitor the health of your clusters, pinpoint bottlenecks, and rapidly diagnose issues. However, the sheer volume and variety of logs generated across the Dataproc ecosystem can be overwhelming. This chapter equips you with the knowledge and strategies you need to navigate the Dataproc logging landscape effectively.

Before we dive in, let’s set the stage a bit. First, let’s briefly explore why logging matters. Logging is more than just a stream of text. It’s your window into the inner workings of Dataproc. Logging provides:

Visibility

See what’s happening at each stage of your cluster’s lifecycle, from creation to job execution:

Performance Optimization: Identify resource-intensive operations and fine-tune your configurations for maximum efficiency.
Debugging: Quickly isolate the root causes of errors and failures, saving you valuable time and effort.

Security

Monitor for suspicious activity or unauthorized access attempts.

There are challenges with Dataproc logging, though. For example, Dataproc generates logs from multiple sources, including:

Cluster logs: Capture events related to cluster creation, configuration, and operation
Initialization scripts: Record the output of scripts that customize your cluster environment
Service logs: Provide insights into the behavior of core Dataproc services (master, workers, etc.)
Application ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781098157692Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Dataproc Cookbook

by Narasimha Sadineni, Anuyogam Venkataraman

Chapter 8. Configuring Logging in Dataproc

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.