book

Data Engineering with Databricks Cookbook

by Pulkit Chadha

May 2024

Beginner to intermediate

438 pages

9h 41m

English

Packt Publishing

Read now

Unlock full access

ContributorsAbout the authorAbout the reviewers
The evolving landscape of data engineeringA pragmatic approach to data engineeringKey featuresWho this book is forWhat this book coversTo get the most out of this bookDownload the example code filesConventions usedSectionsGetting readyHow to do it…How it works…There’s more…See alsoGet in touchShare Your ThoughtsDownload a free PDF copy of this book
Technical requirementsReading CSV data with Apache SparkHow to do it...There’s more…See alsoReading JSON data with Apache SparkHow to do it...There’s more…See alsoReading Parquet data with Apache SparkHow to do it...See alsoParsing XML data with Apache SparkHow to do it…There’s more…See alsoWorking with nested data structures in Apache SparkHow to do it…There’s more…See alsoProcessing text data in Apache SparkHow to do it…There’s more…See alsoWriting data with Apache SparkHow to do it…There’s more…See also
Technical requirementsApplying basic transformations to data with Apache SparkHow to do it...There’s more…See alsoFiltering data with Apache SparkHow to do it…There’s more…See alsoPerforming joins with Apache SparkHow to do it...There’s more…See alsoPerforming aggregations with Apache SparkHow to do it...There’s more…See alsoUsing window functions with Apache SparkHow to do it...There’s more…Writing custom UDFs in Apache SparkHow to do it...There’s more…See alsoHandling null values with Apache SparkHow to do it...There’s more…See also
Technical requirementsCreating a Delta Lake tableHow to do it...There’s more…See alsoReading a Delta Lake tableHow to do it...There’s more...See alsoUpdating data in a Delta Lake tableHow to do it...See alsoMerging data into Delta tablesHow to do it...There’s more…See alsoChange data capture in Delta LakeHow to do it...See alsoOptimizing Delta Lake tablesHow to do it...There’s more...See alsoVersioning and time travel for Delta Lake tablesHow to do it...There’s more...See alsoManaging Delta Lake tablesHow to do it...See also
Technical requirementsConfiguring Spark Structured Streaming for real-time data processingGetting readyHow to do it…How it works…There’s more…See alsoReading data from real-time sources, such as Apache Kafka, with Apache Spark Structured StreamingGetting readyHow to do it…How it works…There’s more…See alsoDefining transformations and filters on a Streaming DataFrameGetting readyHow to do it…See alsoConfiguring checkpoints for Structured Streaming in Apache SparkGetting readyHow to do it…How it works…There’s more…See alsoConfiguring triggers for Structured Streaming in Apache SparkGetting readyHow to do it…How it works…See alsoApplying window aggregations to streaming data with Apache Spark Structured StreamingGetting readyHow to do it…There’s more…See alsoHandling out-of-order and late-arriving events with watermarking in Apache Spark Structured StreamingGetting readyHow to do it…There’s more…See also
Technical requirementsWriting the output of Apache Spark Structured Streaming to a sink such as Delta LakeGetting readyHow to do it…How it works…See alsoIdempotent stream writing with Delta Lake and Apache Spark Structured StreamingGetting readyHow to do it…See alsoMerging or applying Change Data Capture on Apache Spark Structured Streaming and Delta LakeGetting readyHow to do it…There’s more…Joining streaming data with static data in Apache Spark Structured Streaming and Delta LakeGetting readyHow to do it…There’s more…See alsoJoining streaming data with streaming data in Apache Spark Structured Streaming and Delta LakeGetting readyHow to do it…There’s more…See alsoMonitoring real-time data processing with Apache Spark Structured StreamingGetting readyHow to do it…There’s more…See also
Technical requirementsMonitoring Spark jobs in the Spark UIHow to do it…See alsoUsing broadcast variablesHow to do it…How it works…There’s more…Optimizing Spark jobs by minimizing data shufflingHow to do it…See alsoAvoiding data skewHow to do it…There’s more...Caching and persistenceHow to do it…There’s more…Partitioning and repartitioningHow to do it…There’s more…Optimizing join strategiesHow to do it…See also
Technical requirementsOptimizing Delta Lake table partitioning for query performanceHow to do it…There’s more…See alsoOrganizing data with Z-ordering for efficient query executionHow to do it…How it works…See alsoSkipping data for faster query executionHow to do it…See alsoReducing Delta Lake table size and I/O cost with compressionHow to do it…How it works…See also

Technical requirementsBuilding Databricks workflowsHow to do it…See alsoRunning and managing Databricks WorkflowsHow to do it...See alsoPassing task and job parameters within a Databricks WorkflowHow to do it...See alsoConditional branching in Databricks WorkflowsHow to do it...See alsoTriggering jobs based on file arrivalGetting readyHow to do it…See alsoSetting up workflow alerts and notificationsHow to do it…There’s more…See alsoTroubleshooting and repairing failures in Databricks WorkflowsHow to do it...See also
Technical requirementsCreating a multi-hop medallion architecture data pipeline with Delta Live Tables in DatabricksHow to do it…How it works…See alsoBuilding a data pipeline with Delta Live Tables on DatabricksHow to do it…See alsoImplementing data quality and validation rules with Delta Live Tables in DatabricksHow to do it…How it works…See alsoQuarantining bad data with Delta Live Tables in DatabricksHow to do it…See alsoMonitoring Delta Live Tables pipelinesHow to do it…See alsoDeploying Delta Live Tables pipelines with Databricks Asset BundlesGetting readyHow to do it…There’s more…See alsoApplying changes (CDC) to Delta tables with Delta Live TablesHow to do it…See also
Technical requirementsConnecting to cloud object storage using Unity CatalogGetting readyHow to do it…See alsoCreating and managing catalogs, schemas, volumes, and tables using Unity CatalogGetting readyHow to do it…See alsoDefining and applying fine-grained access control policies using Unity CatalogGetting readyHow to do it…See alsoTagging, commenting, and capturing metadata about data and AI assets using Databricks Unity CatalogGetting readyHow to do it…See alsoFiltering sensitive data with Unity CatalogGetting readyHow to do it…See alsoUsing Unity Catalogs lineage data for debugging, root cause analysis, and impact assessmentGetting readyHow to do it…See alsoAccessing and querying system tables using Unity CatalogGetting readyHow to do it…See also
Technical requirementsUsing Databricks Repos to store code in GitGetting readyHow to do it…There’s more…See alsoAutomating tasks by using the Databricks CLIGetting readyHow to do it…There’s more…See alsoUsing the Databricks VSCode extension for local development and testingGetting readyHow to do it…See alsoUsing Databricks Asset Bundles (DABs)Getting readyHow to do it…See alsoLeveraging GitHub Actions with Databricks Asset Bundles (DABs)Getting readyHow to do it…See also
Other Books You May EnjoyPackt is searching for authors like youShare Your ThoughtsDownload a free PDF copy of this book

Overview

In "Data Engineering with Databricks Cookbook," you'll learn how to efficiently build and manage data pipelines using Apache Spark, Delta Lake, and Databricks. This recipe-based guide offers techniques to transform, optimize, and orchestrate your data workflows.

What this Book will help me do

Master Apache Spark for data ingestion, transformation, and analysis.
Learn to optimize data processing and improve query performance with Delta Lake.
Manage streaming data processing with Spark Structured Streaming capabilities.
Implement DataOps and DevOps workflows tailored for Databricks.
Enforce data governance policies using Unity Catalog for scalable solutions.

Author(s)

Pulkit Chadha, the author of this book, is a Senior Solutions Architect at Databricks. With extensive experience in data engineering and big data applications, he brings practical insights into implementing modern data solutions. His educational writings focus on empowering data professionals with actionable knowledge.

Who is it for?

This book is ideal for data engineers, data scientists, and analysts who want to deepen their knowledge in managing and transforming large datasets. Readers should have an intermediate understanding of SQL, Python programming, and basic data architecture concepts. It is especially well-suited for professionals working with Databricks or similar cloud-based data platforms.