Chapter 3. Essential Delta Lake Operations

This chapter explores the essential operations of using Delta Lake for your data management needs. Since Delta Lake functions as the storage layer and participates in the interaction layer of data applications, it makes perfect sense to begin with the foundational operations of persistent storage systems. You know that Delta Lake provides ACID guarantees already,1 but focusing on CRUD operations (see Figure 3-1) will point us more toward the question “How do I use Delta Lake?”2 This would be a woefully short story (and consequently this would be a short book) if that was all that you needed to know, however, so we will look at several additional things that are vital to interacting with Delta Lake tables: merge operations, conversion from so-called vanilla Parquet files, and table metadata.

Note

Except where specified, SQL will refer to the Spark SQL syntax for simplicity’s sake. If you are using Trino or some other SQL engine with Delta Lake, you can find additional details either in Chapter 4, which explores more of the Delta Lake ecosystem, or in the relevant documentation.3 The Python examples will all use the Spark-based Delta Lake API for the same reason. Equivalent examples are presented for both throughout. It is also possible to leverage the equivalent operations using PySpark, and examples of that are shown where it makes sense to do so.

Figure 3-1. Create, read, update, and delete (CRUD) operations are among the most fundamental ...

Get Delta Lake: The Definitive Guide now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.