book

Dataproc Cookbook

by Narasimha Sadineni, Anuyogam Venkataraman

June 2025

Beginner to intermediate

438 pages

9h 17m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Includes

Sandbox

Preface
Who Should Read This BookWhy We Wrote This BookNavigating This BookConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Creating a Dataproc Cluster
Installing Google Cloud CLIProblemSolutionDiscussionGranting Identity and Access Management Privileges to a UserProblemSolutionDiscussionConfiguring a Network and Firewall RulesProblemSolutionDiscussionSee AlsoCreating a Dataproc Cluster from a Web UIProblemSolutionDiscussionCreating a Dataproc Cluster Using GcloudProblemSolutionDiscussionCreating a Dataproc Cluster Using API EndpointsProblemSolutionDiscussionCreating a Dataproc Cluster Using TerraformProblemSolutionDiscussionCreating a Cluster Using PythonProblemSolutionDiscussionDuplicating a Dataproc ClusterProblemSolutionDiscussion
2. Running Hive, Spark, and Sqoop Workloads
Adding Required Privileges for JobsProblemSolutionDiscussionSee AlsoGenerating 1 TB of Data Using a MapReduce JobProblemSolutionDiscussionRunning a Hive Job to Show Records from an Employee TableProblemSolutionDiscussionSee AlsoConverting XML Data to Parquet Using Scala Spark on DataprocProblemSolutionDiscussionSee AlsoConverting XML Data to Parquet Using PySpark on DataprocProblemSolutionDiscussionSee AlsoSubmitting a SparkR JobProblemSolutionDiscussionMigrating Data from Cloud SQL to Hive Using Sqoop JobProblemSolutionDiscussionChoosing Deployment Modes When Submitting a Spark Job to DataprocProblemSolutionDiscussionSee Also
3. Advanced Dataproc Cluster Configuration
Creating an Autoscaling PolicyProblemSolutionDiscussionAttaching an Autoscaling Policy to a Dataproc ClusterProblemSolutionDiscussionOptimizing Cluster Costs with a Mixed On-Demand and Spot Instance Autoscaling PolicyProblemSolutionDiscussionAdding Local SSDs to Dataproc Worker NodesProblemSolutionDiscussionCreating a Cluster with a Custom ImageProblemSolutionDiscussionBuilding a Cluster with Custom Machine TypesProblemSolutionDiscussionBootstrapping Dataproc Clusters with Initialization ScriptsProblemSolutionDiscussionScheduling Automatic Deletion of Unused ClustersProblemSolutionDiscussionOverriding Hadoop ConfigurationsProblemSolutionDiscussion
4. Serverless Spark and Ephemeral Dataproc Clusters
Running on Dataproc: Serverless Versus Ephemeral ClustersProblemSolutionDiscussionRunning a Sequence of Jobs on an Ephemeral ClusterProblemSolutionDiscussionExecuting a Spark Batch Job to Convert XML Data to Parquet on Dataproc ServerlessProblemSolutionDiscussionSee AlsoRunning a Serverless Job Using the Premium Tier ConfigurationProblemSolutionDiscussionGiving a Unique Custom Name to a Dataproc Serverless Spark JobProblemSolutionDiscussionSee AlsoCloning a Dataproc Serverless Spark JobProblemSolutionDiscussionRunning a Serverless Job on Spark RAPIDS AcceleratorProblemSolutionDiscussionSee AlsoConfiguring a Spark History ServerProblemSolutionDiscussionSee AlsoWriting Spark Events to the Spark History Server from Dataproc ServerlessProblemSolutionDiscussionMonitoring Serverless Spark JobsProblemSolutionDiscussionSee AlsoCalculating the Price of a Serverless BatchProblemSolutionDiscussionSee Also
5. Dataproc on Google Kubernetes Engine
Creating a Kubernetes ClusterProblemSolutionDiscussionCreating a Dataproc Cluster on a GKE ClusterProblemSolutionDiscussionRunning Spark Jobs on a Dataproc GKE ClusterProblemSolutionDiscussionCustomizing Node PoolsProblemSolutionDiscussionAutoscaling in a GKE ClusterProblemSolutionDiscussionAchieving Zonal High Availability for Dataproc JobsProblemSolutionDiscussion
6. Dataproc Metastore
Creating a Dataproc Metastore Service InstanceProblemSolutionDiscussionSee AlsoAttaching a DPMS Instance to One or More ClustersProblemSolutionDiscussionCreating Tables and Verifying Metadata in DPMSProblemSolutionDiscussionInstalling an External Hive MetastoreProblemSolutionDiscussionAttaching an External Apache Hive Metastore to the ClusterProblemSolutionDiscussionSearching for Metadata in a Dataplex Data CatalogProblemSolutionDiscussionAutomating the Backup of a DPMS InstanceProblemSolutionDiscussion
7. Connecting from Dataproc to GCP Services
Reading from GCS and Writing to a BigQuery TableProblemSolutionDiscussionReading from a Cloud SQL TableProblemSolutionDiscussionWriting to GCS in Delta FormatProblemSolutionDiscussionIntegrating a Dataproc-Managed Delta Lake with BigLakeProblemSolutionDiscussionConnecting to GCP Services Using Dataproc TemplatesProblemSolutionDiscussionSpark Job Running on Dataproc Reading from GCS and Writing to BigtableProblemSolutionDiscussionSee Also
8. Configuring Logging in Dataproc
Understanding Different Types of Logs in DataprocProblemSolutionDiscussionUnderstanding Cloud LoggingProblemSolutionDiscussionViewing Logs in Cloud LoggingProblemSolutionDiscussionRouting Dataproc Logs to Cloud LoggingProblemSolutionDiscussionAttaching Custom Labels to LoggingProblemSolutionDiscussionOptimizing Cloud Logging CostsProblemSolutionDiscussionSinking Logs to BigQueryProblemSolutionDiscussion
9. Setting Up Monitoring and Dashboards
Monitoring Cluster StatusProblemSolutionDiscussionExploring Predefined Metrics ChartsProblemSolutionDiscussionCreating Charts Using Metrics ExplorerProblemSolutionDiscussionCreating Dashboards Using Metrics ExplorerProblemSolutionDiscussionSetting Up AlertsProblemSolutionDiscussionMigrating Dashboards from One Project to AnotherProblemSolutionDiscussionCreating Custom Log-Based MetricsProblemSolutionDiscussion

10. Dataproc Security
Managing Identities in Dataproc ClustersProblemSolutionDiscussionSecuring Your Perimeter Using VPC Service ControlsProblemSolutionDiscussionAuthenticating Using KerberosProblemSolutionDiscussionInstalling RangerProblemSolutionDiscussionSecuring Cluster Resources Using RangerProblemSolutionDiscussionManaging Credentials in the Google Cloud EnvironmentProblemSolutionDiscussionEnforcing Restrictions Across All ClustersProblemSolutionDiscussionTokenizing Sensitive DataProblemSolutionDiscussion
11. Performance Tuning and Cost Optimization
Sizing a Dataproc ClusterProblemSolutionDiscussionChoosing the Right Disks for Big Data Workloads on DataprocProblemSolutionDiscussionBenchmarking Clusters with Performance TuningProblemSolutionDiscussionNavigating the Spark UIProblemSolutionDiscussionOptimizing Spark JobsProblemSolutionDiscussionInstalling Sparklens for Profiling Spark ApplicationsProblemSolutionDiscussionSee AlsoIdentifying Spark Job Errors and BottlenecksProblemSolutionDiscussionUnderstanding the YARN UIProblemSolutionDiscussionCalculating the Cost of a Dataproc ClusterProblemSolutionDiscussionOptimizing Cost in Dataproc ClustersProblemSolutionDiscussion
12. Orchestrating Dataproc Workloads
Understanding the Prerequisites for Installing Cloud ComposerProblemSolutionDiscussionDeploying a Cloud Composer EnvironmentProblemSolutionDiscussionScheduling a Job in ComposerProblemSolutionDiscussionParameterizing VariablesProblemSolutionDiscussionScaling Up a Cloud Composer EnvironmentProblemSolutionDiscussionRunning Spark Jobs Using Vertex AI Machine Learning PipelinesProblemSolutionDiscussionScheduling a Dataproc Job in Event Driven Using a Cloud FunctionProblemSolutionDiscussionUsing Dataproc Workflow TemplatesProblemSolutionDiscussion
13. Using Spark Notebooks on Dataproc
Deciding Which Notebook Environments to ChooseProblemSolutionDiscussionConfiguring Notebooks on a Dataproc ClusterProblemSolutionDiscussionRunning Spark Scala and PySpark Notebooks on DataprocProblemSolutionDiscussionManaging Libraries and ConfigsProblemSolutionDiscussionCreating Dataproc-Enabled Vertex AI Workbench InstancesProblemSolutionDiscussionExecuting Notebooks Using Spark Serverless SessionsProblemSolutionDiscussion
14. Migrating from On-Premises and Public Cloud Services to GCP
Planning MigrationProblemSolutionDiscussionSee AlsoData Migration StrategiesProblemSolutionDiscussionMigrating Data with STSProblemSolutionDiscussionAccessing AWS S3 Data Using BigLake TablesProblemSolutionDiscussionMigrating MetadataProblemSolutionDiscussionMigrating Applications to Google CloudProblemSolutionDiscussion
Index
About the Authors

Content preview from Dataproc Cookbook

Chapter 6. Dataproc Metastore

In the realm of data processing, efficiently handling metadata is essential for effectively maintaining and organizing data. This chapter dives into the practical aspects of working with metadata using Apache Metastore–based services. The focus will be on the Apache Hive Metastore, a crucial component of the Hive framework that leverages an RDBMS database for robust metadata storage.

The Hive Metastore plays a pivotal role in facilitating Spark and Hive jobs that operate on structured data stored in tables. These jobs rely on reading and storing metadata to perform their operations efficiently. By using Apache Metastore–based services, data engineers and analysts gain a powerful tool for managing and utilizing metadata effectively, ensuring the integrity and accessibility of their data. Let’s explore some of the key concepts, advantages, and integration methods of Hive Metastore.

The key concepts and components of Hive Metastore are:

Metastore: A centralized repository that stores and manages metadata about data, including table definitions, column schemas, and partition information
Catalog: A logical grouping of databases within the metastore
Database: A container for tables and other data objects within the metastore
Table: A collection of structured data organized into rows and columns, along with its schema and other properties
Partition: A logical division of a table based on specific criteria, enabling efficient data management and querying

The ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781098157692Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Dataproc Cookbook

by Narasimha Sadineni, Anuyogam Venkataraman

Chapter 6. Dataproc Metastore

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.