book

What Is a Data Lake?

Name: What Is a Data Lake?
Author: Alex Gorelik
ISBN: 9781492088882

by Alex Gorelik

November 2020

Beginner to intermediate

68 pages

1h 44m

English

O'Reilly Media, Inc.

Read now

Unlock full access

1. Introduction to Data Lakes
Building a Successful Data LakeAdvantages of a Cloud Data Lake PlatformThe Data SwampConclusion
2. Building Successful Data Lakes
Ingestion and IntegrationETL/ELT, MapReduce Self-Service Data Preparation Integration Platform as a Service Data VirtualizationPersistenceWhy Use Zones?Storage TechnologiesGovernanceRegulatory ComplianceAccess ControlData QualityBI and Self-Service AnalyticsAdvanced Analytics—Data Science, AI/MLConclusion
3. AWS, Azure, and GCP Architecture
Amazon Web ServicesMicrosoft AzureGoogle Cloud PlatformWhich Service Should You Use?Conclusion
4. Architecting Multiple Data Lakes
To Merge or Not to Merge?Reasons for Keeping Data Lakes SeparateAdvantages of Merging Data LakesBuilding Multiple Data Lakes on the Same Cloud PlatformVirtual Data LakesData FederationData FabricCatalogs and Data OceansConclusion

Content preview from What Is a Data Lake?

Chapter 2. Building Successful Data Lakes

Initial attempts to build data lakes ended up missing the mark and being labeled as data swamps. The key reason was too much focus on collecting data and admiring new big data technologies, and not enough on connecting the dots. The outcome was a mishmash of data with no clear definitions or governance. The current approach is much more structured, as this chapter shows.

This approach has more focus on discovering the source data, tagging it, and creating a semantic layer so that businesses can quickly consume the data. Time to value is of the essence. Also, the data in modern data lakes is subject to corporate or organizational policies. Finally, as data lakes have matured, automation has helped make them more reliable, repeatable, and flexible to incorporate new data sources and deliver more business use cases.

A modern data lake consists of the following building blocks:

Data ingestion and integration
Persistence
Governance
Analytics and business intelligence
Data science (ML and AI)

We start our journey by looking at data ingestion and integration.

Ingestion and Integration

Building data warehouses requires the well-known extract, transform, load (ETL) or extract, load, transform (ELT) process. In data lakes, the extract and load part of ETL is called data ingestion and is usually the first step in building a cloud data lake. The goal of the data ingestion architecture is to allow new data sources to be quickly and securely ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781492088899

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

What Is a Data Lake?

by Alex Gorelik

Chapter 2. Building Successful Data Lakes

Ingestion and Integration

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.