Chapter 1. Introduction to Data Lakes

Data-driven decision making is changing how we work and live. From data science, machine learning, and advanced analytics to real-time dashboards, people are demanding data to help make decisions. Companies like Google, Amazon, and Facebook are data-driven juggernauts that are taking over traditional businesses by leveraging data. Financial services organizations and insurance companies have always been data driven, with quants and automated trading leading the way. The Internet of Things (IoT) is changing manufacturing, transportation, agriculture, and healthcare.

From governments and corporations in every vertical, to nonprofits and educational institutions, data is being seen as a game changer. Artificial intelligence (AI) and machine learning (ML) are permeating all aspects of our lives. According to Forbes in 2018, we have generated 90% of the world’s data in the last two years, and according to the World Economic Forum, we expect to generate more than 463 exabytes (that’s 463,000,000,000,000,000,000 bytes) per day by 2025. The world is literally bingeing on data because of the potential it represents.

We even have a term for this binge: big data, defined by Doug Laney of Gartner in terms of the three quantitative Vs (volume, variety, and velocity), to which he later added two qualitative Vs (veracity and value). Volume refers to the increased amount of data typically in petabytes and often generated by IoT devices. Variety refers to the wide range of data formats; popular formats include Parquet, JavaScript Object Notation (JSON), and Avro, but can be pretty much anything. Velocity refers to the constant stream of data from a variety of sources and, frequently, IoT devices. While this describes the shape of big data, veracity and value refer to another aspect: if data is not well understood or trustworthy (that is, if it doesn’t have veracity), extracting value from it will be difficult. This touches on the topic of infonomics, the methodology of assigning value to data.

With so much variety, volume, and velocity, the legacy systems and processes that were used in data warehousing can no longer support the data needs of the enterprise. A revolution is occurring in data management around the way data is stored, processed, managed, and provided to decision makers. Big data technology is enabling scalability and cost efficiency that are orders of magnitude greater than what’s possible with traditional data management infrastructure. Self-service is taking over from the carefully crafted manual and labor-intensive approaches of the past, where armies of IT professionals created well-governed data warehouses and data marts, but took months to make any changes.

The data lake is a popular approach that harnesses the power of big data technology and marries it with agility of self-service. Most large enterprises today either have deployed or are in the process of deploying data lakes.

The term data lake was invented and first described by James Dixon, CTO of Pentaho, who wrote in his blog: “If you think of a data mart as a store of bottled water—cleansed and packaged and structured for easy consumption—the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.” I italicized the critical points that refer to storing raw data usually in the original schema and making it available to many types of users and use cases.

While many data lakes started on premises, usually on top of Apache Hadoop, today most data lakes are built in the cloud. This movement to the cloud has occurred for many reasons, and this report covers the numerous advantages as well as the ways that major cloud providers are handling various aspects of a data lake.

This report is intended for IT executives and practitioners who are considering building a data lake in the cloud, migrating a data lake from on premises to the cloud, or struggling to make an existing cloud data lake productive and widely adopted. This chapter describes phases of data lake maturity, including the most common failure of data lakes—the data swamp—and explains what it takes to have a successful data lake. Let’s start with various big data deployments.

Building a Successful Data Lake

While we might tend to call any place that keeps data in the cloud using big data technology a data lake, not every deployment of data using big data technologies is a data lake. This section describes various deployments and explains that just uploading data to the cloud does not make a data lake. A data lake is a more advanced stage in an organization’s data maturity model that combines storage of data for current and future needs with user self-service for a broad user base. Figure 1-1 illustrates how big data deployments are classified based on the data lake maturity model.

The four stages of maturity
Figure 1-1. The four stages of maturity
Data puddle

A single-purpose or single-project data mart built using big data technology

Data pond

A collection of data puddles or an offload of an existing data warehouse

Data lake

This is different from a data pond in two important ways. First, a data lake supports self-service: business users are able to find and use data sets that they want to use without having to rely on help from the IT department. Second, a data lake contains data that business users might possibly want, even if no project requires it at this time. Usually, a data lake has multiple zones, ranging from a landing zone with raw data to a gold (or production) zone with clean, trustworthy data. We cover these in more detail in Chapter 2.

Data ocean

This expands self-service data and data-driven decision-making to all enterprise data, wherever it may be, regardless of whether it was loaded into the data lake. Many organizations are realizing that copying data from original sources to a common storage platform does not add any value until that data is needed. Data in a data lake is kept in its original form. Instead of copying it over all the time, a data ocean creates a sort of virtual data lake, so the data looks like it is in the data lake, but gets physically provisioned on demand, as described in Chapter 4.

As maturity grows from a puddle to a pond to a lake to an ocean, the amount and variety of data and the number of users grows—sometimes dramatically. The usage pattern moves from one of high-touch IT involvement to self-service, and the data grows beyond what’s needed for immediate projects.

So what does it take to have a successful data lake? As with any project, aligning it with the company’s business strategy and having executive sponsorship and broad buy-in are a must. In addition, based on discussions with dozens of companies deploying data lakes with varying levels of success, three key prerequisites can be identified:

  • The right platform that supports the three quantitative Vs of big data—volume, variety, and velocity—in a cost-effective way that is future-proof, since the lakes are being built for the long term

  • The right data, which includes data in various formats, loaded through different ingestion methods

  • The right interfaces that enable self-service for users through a variety of tools and APIs

The subsequent chapters cover how various cloud technologies and platforms support these requirements.

Advantages of a Cloud Data Lake Platform

Cloud platforms are great for data lakes and have a critical advantage over on-premises ones: they provide elasticity of computing and storage needs. With elastic storage, as data grows and shrinks, the platform takes care of expanding or shrinking storage automatically. Since most platforms provide multitier storage, data can also be distributed among different tiers to optimize price performance. Similarly, elastic computing allow users to scale the number and capacity of computing resources up and down as needed.

This provides two important advantages: you can quickly execute very large jobs (e.g., data ingestion, analytics, ML, or any other computing-intensive operations) by deploying more nodes, and you pay for only what you consume. For example, if an on-premises data lake typically needs about 10 computing nodes, but periodically, for some analytic jobs, needs 1,000, you may compromise and build a 100-node data lake with the result that most of the time 90 out of 100 nodes are idle, while some very large jobs take 10 times longer because they have to run on 100 nodes instead of 1,000. In the cloud, the same data lake would normally deploy (and pay for) 10 computing nodes, but when a large job is running, would provision 1,000 nodes to get the job completed quickly. With this cloud scale, cloud providers can offer low-cost multitier storage that optimizes cost based on usage.

Cloud providers also support the other requirements of a data lake by providing a variety of filesystems and object stores that allow users to store data in different formats, support schema on read (which does not require a predefined schema when writing new data sets), and provide access to this data to any approved project or program.

Because our requirements and the world we live in are in flux, it is critical to ensure that the data we have can be used to help with our future needs. Today, if data is stored in a relational database, it can be accessed only by that relational database. Object stores like Amazon Simple Storage Service (S3), Hadoop, and other big data object stores and filesystems are modular. The same file can be used by various processing engines and programs—from Apache Hive queries (Hive provides a SQL interface to files) to Python scripts to Apache Spark and many other types of programs. Structured and unstructured files can all be stored and processed in the same data lake. Because big data technology is evolving rapidly, this gives people confidence that their data lakes are future-proof, so future projects will still be able to access the data in the data lake.

Of course, cloud data lakes are not always the right answer, usually because of regulatory or security concerns. We cover these in more detail when we talk about hybrid architectures in Chapter 4.

The next section covers common mistakes companies make when building a data lake.

The Data Swamp

While data lakes always start out with good intentions, sometimes they take a wrong turn and end up as data swamps. A data swamp is a data pond that has grown to the size of a data lake but failed to attract a wide data analyst community, usually because of a lack of self-service and governance facilities.

At best, the data swamp is used like a data pond, and at worst, it is not used at all. Often, while various teams use small areas of the lake for their projects (the white data pond area in Figure 1-2), the majority of the data is dark, undocumented, and unusable.

A data swamp
Figure 1-2. A data swamp

When data lakes first came onto the scene, a lot of companies rushed out to provision object stores and Hadoop clusters and fill them with raw data, without a clear understanding of how it would be utilized. This led to the creation of massive data swamps with millions of files containing petabytes of data and no way to make sense of that data.

Only the most sophisticated users were able to navigate the swamps, usually by carving out small puddles that they and their teams could use. Furthermore, governance regulations precluded opening up the swamps to a broad audience without protecting sensitive data. Since no one could tell where the sensitive data was, users could not be given access, and the data largely remained unusable and unused. One data scientist shared his experience of how his company built a data lake and encrypted all the data in the lake to protect it. The company then required data scientists to prove that the data they wanted was not sensitive before it would unencrypt it and let them use it. This created a catch-22: because everything was encrypted, the data scientist I talked to couldn’t find anything, much less prove that it was not sensitive. As a result, no one was using the data lake (or, as he called it, the swamp). We cover techniques and tools that can be used to avoid building data swamps in the next chapter.


While building a data lake requires many things to come together—from executive sponsorship, organizational alignment, budgeting, and many other aspects inherent to any massive and often enterprise-wide project—this report focuses mostly on the technical aspects of building a data lake. We have now covered the phases of data lake maturity, common mistakes that lead to the creation of data swamps, and requirements for creating a successful data lake that should put you on the road to success.

The next step is architecting your data lake and selecting the right platform and technologies. Several major cloud platform vendors provide many, often overlapping, technologies. In addition, many other vendors offer unique solutions with specific advantages. The rest of this report is going to help you navigate through this confusing technology landscape. Chapter 2 covers ingestion options: how to populate or hydrate your data lake with data. Chapter 3 presents platform options. Chapter 4 covers advanced architectures for multivendor and hybrid cloud and on-premises data lakes.

Get What Is a Data Lake? now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.