Introduction

Whether your title is data engineer or another data-oriented profession (we see you, analysts and scientists), you’ve likely heard the term ETL. There’s a good chance ETL is a part of your life, even if you don’t know it!

Short for extract, transform, load, ETL is used to describe the foundational workflow most data practitioners are tasked with—taking data from a source system, changing it to suit their needs, and loading it to a target.

Want to help product leaders make data-driven decisions? ETL builds the critical tables for your reports. Want to train the next iteration of your team’s machine learning model? ETL creates quality datasets. Are you trying to bring more structure and rigor to your company’s storage policies to meet compliance requirements? ETL will bring process, lineage, and observability to your workflows.

If you want to do anything with data, you need a reliable process or pipeline. This fundamental truth holds true from classic business intelligence (BI) workloads to cutting-edge advancements, like large language models (LLMs) and AI.

The Brave New World of AI

The data world has seen many trends come and go; some have transformed the space, and some have turned out to be short-lived fads. The most recent is, without a doubt, generative AI.

At every turn, there’s chatter about AI, LLMs, and chatbots. This recent fascination with AI, largely brought by the release of OpenAI’s ChatGPT, extends beyond the media’s interest and among researchers—it is now seen by many as an essential strategic investment…and who wants to be left behind?

The true value in LLMs comes from embeddings or fine-tuning models on clean, curated datasets. These techniques allow for the creation of models with domain-specific knowledge, avoiding common errors, like hallucination.

Of course, meaningful embeddings are derived from, you guessed it—clean datasets. In that sense, AI is built on data transformation. Its success depends heavily on the ability to create consistent, high-quality datasets at scale. Data needs to be moved, mutated, and merged in a single location—one might say extracted, transformed, and loaded.

That’s right—even the most cutting-edge tech has roots back to ETL.

A Changing Data Landscape

In addition to the recent surge in generative AI, other trends have reshaped the data landscape over the past decade. One such trend is the increasing prominence of streaming data. Companies are now generating vast quantities of real-time data through sensors, websites, mobile applications, and more. This shift necessitates the real-time ingestion and processing of data for immediate decision making. Data engineers are therefore challenged to extend beyond traditional batch processing to construct and manage continuous pipelines capable of handling large volumes of streaming data.

Another noteworthy development is the emergence of data lakehouse architectures. The data lakehouse represents a novel concept, seeking to merge the capabilities of data warehouses and data lakes. Leveraging new storage technologies like Delta Lake, which enhance the reliability and performance of data lakes, the lakehouse model combines the cost-effective, scalable storage of data lakes with the efficient transaction processing of data warehouses. This amalgamation enables the execution of both AI workloads (typically handled in data lakes) and analytics workloads (usually conducted in data warehouses) within a singular framework. This integration significantly reduces the complexities associated with maintaining parallel architectures, ensuring consistent data governance, and managing data duplication.

While ETL is a long-standing concept in data management, its relevance remains undiminished in the modern data landscape. A critical consideration now is how ETL processes can adapt to encompass both batch and streaming data, and how they can be effectively integrated within a data lakehouse architecture. This guide aims to illuminate these aspects, helping you understand ETL in light of these evolving trends.

What About ELT (and Other Flavors)?

As you delve into data engineering, you may come across terms like ELT in addition to ETL. You might be thinking, “Wow, these guys should hire a proofreader,” but rest assured, they’re actually different terms.

The key difference in ELT lies in sequence: in ELT, everything is loaded into a staging resource, then transformed downstream. ELT has increasingly become the norm, supplanting ETL in many scenarios—as many say “storage is cheap.” The term “ETL” has been widely used for so long (since the creation of databases themselves) that it’s still commonly referred to, even when ELT is more accurate. We are now in an era of “store first, act later,” facilitated by decreasing costs of cloud storage and the ease of data generation.

For analysis, retaining all potentially useful data is prevalent. Technological advancements like the medallion architecture and data lakehouse support this approach with features like easy schema evolution and time travel. We’ll discuss those and more throughout this guide.

Although we predominantly use the term “ETL,” it’s important to note that the principles and considerations discussed are applicable to both ETL and ELT, as well as other variations like reverse ETL—the practice of ingesting cleaned data back into business tools from the ware- or lakehouse. No, reverse ETL != LTE, and yes, this is confusing, but we digress.

Whether the term “ETL” precisely describes your current process or not, comprehending the fundamentals of data ingestion, transformation, and orchestration remains crucial. This also extends to best practices in areas like observability, troubleshooting, scaling, and optimization. We hope that this guide will be a valuable resource, regardless of the specific data processing methodology you employ.

O’Reilly Online Learning

Note

For more than 40 years, O’Reilly Media has provided technology and business training, knowledge, and insight to help companies succeed.

Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit https://oreilly.com.

How to Contact Us

Please address comments and questions concerning this book to the publisher:

O’Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
800-889-8969 (in the United States or Canada)
707-827-7019 (international or local)
707-829-0104 (fax)
support@oreilly.com
https://www.oreilly.com/about/contact.html

We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/understandingETL.

For news and information about our books and courses, visit https://oreilly.com.

Find us on LinkedIn: https://linkedin.com/company/oreilly-media.

Watch us on YouTube: https://youtube.com/oreillymedia.

Acknowledgments

Though we all stand on the shoulders of giants, this guide in particular would not have been possible without mentorship, help, and support from some very dedicated and caring individuals.

First, thank you to my partners from O’Reilly and Databricks: Aaron Black, who gave me the opportunity to write; Gary O’Brien, who was a stellar development editor (and confidante!); Ori Zohar, who helped shape the guide as a whole; and both Sumit Makashir and Pier Paolo Ippolito for their excellent and attentive technical reviews.

Thank you to Zander Matheson for your help in understanding streaming and stream processing. Along with developing an amazing tool (Bytewax), Zander has been a great friend and a general data guru.

Thank you to Aleks Tordova and the Coalesce team, who partnered to write my first guide and have provided me with ample opportunities to learn and grow.

Thanks to my family, who provided unconditional support for my journey—in data and life—despite my flaws, idiosyncrasies, and general tomfoolery. Thank you, Jasmine, Violet, and Paul (and pups Enzo and Rocky!)

Next, I am blessed with some amazing friends who’ve supported me as I moved across the country, took a new job, wrote this guide, and continued my path of self-discovery. There were many texts, Slacks, phone calls, and memes that helped me through the tough times. In alphabetical order, thank you, JulieAnn, Kandace, Rob, Srini, and Tyson.

Last, thank you to the data community. To the individuals that contribute to open source and present at conferences, the practitioners that wake up every day looking to improve, the educators/mentors that keep us moving forward as a field, and all of the authors whose texts, ideas, and content have helped us get to where we are today: I can’t wait to see what we accomplish next!

Get Understanding ETL now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Understanding ETL by Matt Palmer