Skip to Content
What Is a Data Lake?
book

What Is a Data Lake?

by Alex Gorelik
November 2020
Beginner to intermediate
68 pages
1h 44m
English
O'Reilly Media, Inc.
Content preview from What Is a Data Lake?

Chapter 2. Building Successful Data Lakes

Initial attempts to build data lakes ended up missing the mark and being labeled as data swamps. The key reason was too much focus on collecting data and admiring new big data technologies, and not enough on connecting the dots. The outcome was a mishmash of data with no clear definitions or governance. The current approach is much more structured, as this chapter shows.

This approach has more focus on discovering the source data, tagging it, and creating a semantic layer so that businesses can quickly consume the data. Time to value is of the essence. Also, the data in modern data lakes is subject to corporate or organizational policies. Finally, as data lakes have matured, automation has helped make them more reliable, repeatable, and flexible to incorporate new data sources and deliver more business use cases.

A modern data lake consists of the following building blocks:

  • Data ingestion and integration

  • Persistence

  • Governance

  • Analytics and business intelligence

  • Data science (ML and AI)

We start our journey by looking at data ingestion and integration.

Ingestion and Integration

Building data warehouses requires the well-known extract, transform, load (ETL) or extract, load, transform (ELT) process. In data lakes, the extract and load part of ETL is called data ingestion and is usually the first step in building a cloud data lake. The goal of the data ingestion architecture is to allow new data sources to be quickly and securely ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

Operationalizing the Data Lake

Operationalizing the Data Lake

Holden Ackerman, Jon King
Data Lakes

Data Lakes

Anne Laurent, Dominique Laurent, Cédrine Madera
Data Superstream: Data Lakes and Warehouses

Data Superstream: Data Lakes and Warehouses

Alistair Croll, Lena Hall, Vini Jaiswal, Einat Orr, Wannes Rosiers, Jessica Larson, Ryan Blue, Tejas Chopra

Publisher Resources

ISBN: 9781492088899