book

Delta Lake: Up and Running

by Bennie Haelen, Dan Davis

October 2023

Beginner to intermediate

264 pages

6h 45m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
How to Contact UsConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningAcknowledgment
1. The Evolution of Data Architectures
A Brief History of Relational DatabasesData WarehousesData Warehouse ArchitectureDimensional ModelingData Warehouse Benefits and ChallengesIntroducing Data LakesData LakehouseData Lakehouse BenefitsImplementing a LakehouseDelta LakeThe Medallion ArchitectureThe Delta EcosystemDelta Lake StorageDelta SharingDelta ConnectorsConclusion
2. Getting Started with Delta Lake
Getting a Standard Spark ImageUsing Delta Lake with PySparkRunning Delta Lake in the Spark Scala ShellRunning Delta Lake on DatabricksCreating and Running a Spark Program: helloDeltaLakeThe Delta Lake FormatParquet FilesWriting a Delta TableThe Delta Lake Transaction LogHow the Transaction Log Implements AtomicityBreaking Down Transactions into Atomic CommitsThe Transaction Log at the File LevelScaling Massive MetadataConclusion
3. Basic Operations on Delta Tables
Creating a Delta TableCreating a Delta Table with SQL DDLThe DESCRIBE StatementCreating Delta Tables with the DataFrameWriter APICreating a Delta Table with the DeltaTableBuilder APIGenerated ColumnsReading a Delta TableReading a Delta Table with SQLReading a Table with PySparkWriting to a Delta TableCleaning Out the YellowTaxis TableInserting Data with SQL INSERTAppending a DataFrame to a TableUsing the OverWrite Mode When Writing to a Delta TableInserting Data with the SQL COPY INTO CommandPartitionsUser-Defined MetadataUsing SparkSession to Set Custom MetadataUsing the DataFrameWriter to Set Custom MetadataConclusion
4. Table Deletes, Updates, and Merges
Deleting Data from a Delta TableTable Creation and DESCRIBE HISTORYPerforming the DELETE OperationDELETE Performance Tuning TipsUpdating Data in a TableUse Case DescriptionUpdating Data in a TableUPDATE Performance Tuning TipsUpsert Data Using the MERGE OperationUse Case DescriptionThe MERGE DatasetThe MERGE StatementAnalyzing the MERGE operation with DESCRIBE HISTORYInner Workings of the MERGE OperationConclusion
5. Performance Tuning
Data SkippingPartitioningPartitioning Warnings and ConsiderationsCompact FilesCompactionOPTIMIZEZORDER BYZORDER BY ConsiderationsLiquid ClusteringEnabling Liquid ClusteringOperations on Clustered ColumnsLiquid Clustering Warnings and ConsiderationsConclusion
6. Using Time Travel
Delta Lake Time TravelRestoring a TableRestoring via TimestampTime Travel Under the HoodRESTORE Considerations and WarningsQuerying an Older Version of a TableData RetentionData File RetentionLog File RetentionSetting File Retention Duration ExampleData ArchivingVACUUMVACUUM Syntax and ExamplesHow Often Should You Run VACUUM and Other Maintenance Tasks?VACUUM Warnings and ConsiderationsChanging Data FeedEnabling the CDFViewing the CDFCDF Warnings and ConsiderationsConclusion
7. Schema Handling
Schema ValidationViewing the Schema in the Transaction Log EntriesSchema on WriteSchema Enforcement ExampleSchema EvolutionAdding a ColumnMissing Data Column in Source DataFrameChanging a Column Data TypeAdding a NullType ColumnExplicit Schema UpdatesAdding a Column to a TableAdding Comments to a ColumnChanging Column OrderingDelta Lake Column MappingRenaming a ColumnReplacing the Table ColumnsDropping a ColumnThe REORG TABLE CommandChanging Column Data Type or NameConclusion
8. Operations on Streaming Data
Streaming OverviewSpark Structured StreamingDelta Lake and Structured StreamingStreaming ExamplesHello Streaming WorldAvailableNow StreamingUpdating the Source RecordsReading a Stream from the Change Data FeedConclusion
9. Delta Sharing
Conventional Methods of Data SharingLegacy and Homegrown SolutionsProprietary Vendor SolutionsCloud Object StorageOpen Source Delta SharingDelta Sharing GoalsDelta Sharing Under the HoodData Providers and RecipientsBenefits of the DesignThe delta-sharing RepositoryStep 1: Installing the Python ConnectorStep 2: Installing the Profile FileStep 3: Reading a Shared TableConclusion

10. Building a Lakehouse on Delta Lake
Storage LayerWhat Is a Data Lake?Types of DataKey Benefits of a Cloud Data LakeData ManagementSQL AnalyticsSQL Analytics via Spark SQLSQL Analytics via Other Delta Lake IntegrationsData for Data Science and Machine LearningChallenges with Traditional Machine LearningDelta Lake Features That Support Machine LearningPutting It All TogetherMedallion ArchitectureThe Bronze Layer (Raw Data)The Silver LayerThe Gold LayerThe Complete LakehouseConclusion
Index
About the Author

Content preview from Delta Lake: Up and Running

Chapter 7. Schema Handling

Traditionally, data lakes have always operated under the principle of schema on read, but have always had challenges enforcing schema on write. This means there is no predefined schema when data is written to storage, and a schema is only adapted when the data is processed. It is imperative for the case of analytics and data platforms that your table formats enforce the schema on write to prevent introducing change-breaking processes, and to maintain proper data quality and integrity.

And while it is essential to adhere to schema on write, we must also acknowledge that in today’s fast-paced business climate and evolving landscape of data management, data sources, analytics, and simply just data and its overall structure are constantly changing. These changes need to be accounted for with schemas that are flexible enough to evolve over time in order to capture new, changing information.

The schematic challenges often seen from traditional data lakes can be further classified into two key schema handling features that any data platform and table format, regardless of the storage layer, must support:

Schema enforcement: This is the process of ensuring that all data being added to a table conforms to that specific schema, where the schema defines a table structure by a list of column names, their data types, and any optional constraints. Enforcing data to fit to the structure of a defined schema helps to maintain the quality and consistency of the data, ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781098139711Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Delta Lake: Up and Running

by Bennie Haelen, Dan Davis

Chapter 7. Schema Handling

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.