Delta Lake: Up and Running

Book description

With the surge in big data and AI, organizations can rapidly create data products. However, the effectiveness of their analytics and machine learning models depends on the data's quality. Delta Lake's open source format offers a robust lakehouse framework over platforms like Amazon S3, ADLS, and GCS.

This practical book shows data engineers, data scientists, and data analysts how to get Delta Lake and its features up and running. The ultimate goal of building data pipelines and applications is to gain insights from data. You'll understand how your storage solution choice determines the robustness and performance of the data pipeline, from raw data to insights.

You'll learn how to:

  • Use modern data management and data engineering techniques
  • Understand how ACID transactions bring reliability to data lakes at scale
  • Run streaming and batch jobs against your data lake concurrently
  • Execute update, delete, and merge commands against your data lake
  • Use time travel to roll back and examine previous data versions
  • Build a streaming data quality pipeline following the medallion architecture

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. How to Contact Us
    2. Conventions Used in This Book
    3. Using Code Examples
    4. O’Reilly Online Learning
    5. Acknowledgment
  2. 1. The Evolution of Data Architectures
    1. A Brief History of Relational Databases
    2. Data Warehouses
      1. Data Warehouse Architecture
      2. Dimensional Modeling
    3. Data Warehouse Benefits and Challenges
    4. Introducing Data Lakes
    5. Data Lakehouse
      1. Data Lakehouse Benefits
      2. Implementing a Lakehouse
    6. Delta Lake
    7. The Medallion Architecture
    8. The Delta Ecosystem
      1. Delta Lake Storage
      2. Delta Sharing
      3. Delta Connectors
    9. Conclusion
  3. 2. Getting Started with Delta Lake
    1. Getting a Standard Spark Image
    2. Using Delta Lake with PySpark
    3. Running Delta Lake in the Spark Scala Shell
    4. Running Delta Lake on Databricks
    5. Creating and Running a Spark Program: helloDeltaLake
    6. The Delta Lake Format
      1. Parquet Files
      2. Writing a Delta Table
    7. The Delta Lake Transaction Log
      1. How the Transaction Log Implements Atomicity
      2. Breaking Down Transactions into Atomic Commits
      3. The Transaction Log at the File Level
      4. Scaling Massive Metadata
    8. Conclusion
  4. 3. Basic Operations on Delta Tables
    1. Creating a Delta Table
      1. Creating a Delta Table with SQL DDL
      2. The DESCRIBE Statement
      3. Creating Delta Tables with the DataFrameWriter API
      4. Creating a Delta Table with the DeltaTableBuilder API
      5. Generated Columns
    2. Reading a Delta Table
      1. Reading a Delta Table with SQL
      2. Reading a Table with PySpark
    3. Writing to a Delta Table
      1. Cleaning Out the YellowTaxis Table
      2. Inserting Data with SQL INSERT
      3. Appending a DataFrame to a Table
    4. Using the OverWrite Mode When Writing to a Delta Table
      1. Inserting Data with the SQL COPY INTO Command
      2. Partitions
    5. User-Defined Metadata
      1. Using SparkSession to Set Custom Metadata
      2. Using the DataFrameWriter to Set Custom Metadata
    6. Conclusion
  5. 4. Table Deletes, Updates, and Merges
    1. Deleting Data from a Delta Table
      1. Table Creation and DESCRIBE HISTORY
      2. Performing the DELETE Operation
      3. DELETE Performance Tuning Tips
    2. Updating Data in a Table
      1. Use Case Description
      2. Updating Data in a Table
      3. UPDATE Performance Tuning Tips
    3. Upsert Data Using the MERGE Operation
      1. Use Case Description
      2. The MERGE Dataset
      3. The MERGE Statement
      4. Analyzing the MERGE operation with DESCRIBE HISTORY
      5. Inner Workings of the MERGE Operation
    4. Conclusion
  6. 5. Performance Tuning
    1. Data Skipping
    2. Partitioning
      1. Partitioning Warnings and Considerations
    3. Compact Files
      1. Compaction
      2. OPTIMIZE
    4. ZORDER BY
      1. ZORDER BY Considerations
    5. Liquid Clustering
      1. Enabling Liquid Clustering
      2. Operations on Clustered Columns
      3. Liquid Clustering Warnings and Considerations
    6. Conclusion
  7. 6. Using Time Travel
    1. Delta Lake Time Travel
      1. Restoring a Table
      2. Restoring via Timestamp
      3. Time Travel Under the Hood
      4. RESTORE Considerations and Warnings
      5. Querying an Older Version of a Table
    2. Data Retention
      1. Data File Retention
      2. Log File Retention
      3. Setting File Retention Duration Example
      4. Data Archiving
    3. VACUUM
      1. VACUUM Syntax and Examples
      2. How Often Should You Run VACUUM and Other Maintenance Tasks?
      3. VACUUM Warnings and Considerations
    4. Changing Data Feed
      1. Enabling the CDF
      2. Viewing the CDF
      3. CDF Warnings and Considerations
    5. Conclusion
  8. 7. Schema Handling
    1. Schema Validation
      1. Viewing the Schema in the Transaction Log Entries
      2. Schema on Write
      3. Schema Enforcement Example
    2. Schema Evolution
      1. Adding a Column
      2. Missing Data Column in Source DataFrame
      3. Changing a Column Data Type
      4. Adding a NullType Column
    3. Explicit Schema Updates
      1. Adding a Column to a Table
      2. Adding Comments to a Column
      3. Changing Column Ordering
      4. Delta Lake Column Mapping
      5. Renaming a Column
      6. Replacing the Table Columns
      7. Dropping a Column
      8. The REORG TABLE Command
      9. Changing Column Data Type or Name
    4. Conclusion
  9. 8. Operations on Streaming Data
    1. Streaming Overview
      1. Spark Structured Streaming
      2. Delta Lake and Structured Streaming
    2. Streaming Examples
      1. Hello Streaming World
      2. AvailableNow Streaming
      3. Updating the Source Records
      4. Reading a Stream from the Change Data Feed
    3. Conclusion
  10. 9. Delta Sharing
    1. Conventional Methods of Data Sharing
      1. Legacy and Homegrown Solutions
      2. Proprietary Vendor Solutions
      3. Cloud Object Storage
    2. Open Source Delta Sharing
      1. Delta Sharing Goals
    3. Delta Sharing Under the Hood
      1. Data Providers and Recipients
      2. Benefits of the Design
    4. The delta-sharing Repository
      1. Step 1: Installing the Python Connector
      2. Step 2: Installing the Profile File
      3. Step 3: Reading a Shared Table
    5. Conclusion
  11. 10. Building a Lakehouse on Delta Lake
    1. Storage Layer
      1. What Is a Data Lake?
      2. Types of Data
      3. Key Benefits of a Cloud Data Lake
    2. Data Management
    3. SQL Analytics
      1. SQL Analytics via Spark SQL
      2. SQL Analytics via Other Delta Lake Integrations
    4. Data for Data Science and Machine Learning
      1. Challenges with Traditional Machine Learning
      2. Delta Lake Features That Support Machine Learning
      3. Putting It All Together
    5. Medallion Architecture
      1. The Bronze Layer (Raw Data)
      2. The Silver Layer
      3. The Gold Layer
      4. The Complete Lakehouse
    6. Conclusion
  12. Index
  13. About the Author

Product information

  • Title: Delta Lake: Up and Running
  • Author(s): Bennie Haelen, Dan Davis
  • Release date: October 2023
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098139728