Chapter 2. Building Successful Data Lakes

Initial attempts to build data lakes ended up missing the mark and being labeled as data swamps. The key reason was too much focus on collecting data and admiring new big data technologies, and not enough on connecting the dots. The outcome was a mishmash of data with no clear definitions or governance. The current approach is much more structured, as this chapter shows.

This approach has more focus on discovering the source data, tagging it, and creating a semantic layer so that businesses can quickly consume the data. Time to value is of the essence. Also, the data in modern data lakes is subject to corporate or organizational policies. Finally, as data lakes have matured, automation has helped make them more reliable, repeatable, and flexible to incorporate new data sources and deliver more business use cases.

A modern data lake consists of the following building blocks:

  • Data ingestion and integration

  • Persistence

  • Governance

  • Analytics and business intelligence

  • Data science (ML and AI)

We start our journey by looking at data ingestion and integration.

Ingestion and Integration

Building data warehouses requires the well-known extract, transform, load (ETL) or extract, load, transform (ELT) process. In data lakes, the extract and load part of ETL is called data ingestion and is usually the first step in building a cloud data lake. The goal of the data ingestion architecture is to allow new data sources to be quickly and securely ...

Get What Is a Data Lake? now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.