Skip to Content
The Enterprise Big Data Lake
book

The Enterprise Big Data Lake

by Alex Gorelik
March 2019
Beginner to intermediate
221 pages
6h 35m
English
O'Reilly Media, Inc.
Book available
Content preview from The Enterprise Big Data Lake

Chapter 4. Starting a Data Lake

As discussed in the previous chapter, the promise of the data lake is to store the enterprise’s data in a way that maximizes its availability and accessibility for analytics and data science. But what’s the best way to get started? This chapter discusses various paths enterprises take to build a data lake.

Apache Hadoop is an open source project that’s frequently used for this purpose. While there are many other alternatives, especially in the cloud, Hadoop-based data lakes provide a good representation of the advantages they provide, so we are going to use Hadoop as an example. We’ll begin by reviewing what it is and some of its key advantages for supporting a data lake.

The What and Why of Hadoop

Hadoop is a massively parallel storage and execution platform that automates many of the difficult aspects of building a highly scalable and available cluster. It has its own distributed filesystem, HDFS (although some Hadoop distributions, like MapR and IBM, provide their own filesystems to replace HDFS). HDFS automatically replicates data on the cluster to achieve high parallelism and availability. For example, if Hadoop uses the default replication factor of three, it stores each block on three different nodes. This way, when a job needs a block of data, the scheduler has a choice of three different nodes to use and can decide which one is the best based on what other jobs are running on it, what other data is located there, and so forth. Furthermore, ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Data Lake for Enterprises

Data Lake for Enterprises

Vivek Mishra, Tomcy John, Pankaj Misra
Operationalizing the Data Lake

Operationalizing the Data Lake

Holden Ackerman, Jon King

Publisher Resources

ISBN: 9781491931547Errata Page