Chapter 2. Building Successful Data Lakes
Initial attempts to build data lakes ended up missing the mark and being labeled as data swamps. The key reason was too much focus on collecting data and admiring new big data technologies, and not enough on connecting the dots. The outcome was a mishmash of data with no clear definitions or governance. The current approach is much more structured, as this chapter shows.
This approach has more focus on discovering the source data, tagging it, and creating a semantic layer so that businesses can quickly consume the data. Time to value is of the essence. Also, the data in modern data lakes is subject to corporate or organizational policies. Finally, as data lakes have matured, automation has helped make them more reliable, repeatable, and flexible to incorporate new data sources and deliver more business use cases.
A modern data lake consists of the following building blocks:
-
Data ingestion and integration
-
Persistence
-
Governance
-
Analytics and business intelligence
-
Data science (ML and AI)
We start our journey by looking at data ingestion and integration.
Ingestion and Integration
Building data warehouses requires the well-known extract, transform, load (ETL) or extract, load, transform (ELT) process. In data lakes, the extract and load part of ETL is called data ingestion and is usually the first step in building a cloud data lake. The goal of the data ingestion architecture is to allow new data sources to be quickly and securely ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access