Chapter 1. Overview

Almost every large organization has an enterprise data warehouse (EDW) in which to store important business data. The EDW is designed to capture the essence of the business from other enterprise systems such as customer relationship management (CRM), inventory, and sales transactions systems, and allow analysts and business users to gain insight and make important business decisions from that data.

But new technologies—including streaming and social data from the Web or from connected devices on the Internet of things (IoT)—is driving much greater data volumes, higher expectations from users, and a rapid globalization of economies. Organizations are realizing that traditional EDW technologies can’t meet their new business needs.

As a result, many organizations are turning to Apache Hadoop. Hadoop adoption is growing quickly, with 26% of enterprises surveyed by Gartner in mid-2015 already deploying, piloting, or experimenting with the next-generation data-processing framework. Another 11% plan to deploy within the year, and an additional 7% within 24 months.1

Organizations report success with these early endeavors in mainstream Hadoop deployments ranging from retail, healthcare, and financial services use cases. But currently Hadoop is primarily used as a tactical rather than strategic tool, supplementing as opposed to replacing the EDW. That’s because organizations question whether Hadoop can meet their enterprise service-level agreements (SLAs) for availability, scalability, performance, and security.

Until now, few companies have managed to recoup their investments in big data initiatives using Hadoop. Global organizational spending on big data exceeded $31 billion in 2013, and this is predicted to reach $114 billion in 2018.2 Yet only 13 percent of these companies have achieved full-scale production for their big-data initiatives using Hadoop.

One major challenge with traditional EDWs is their schema-on-write architecture, the foundation for the underlying extract, transform, and load (ETL) process required to get data into the EDW. With schema-on-write, enterprises must design the data model and articulate the analytic frameworks before loading any data. In other words, they need to know ahead of time how they plan to use that data. This is very limiting.

In response, organizations are taking a middle ground. They are starting to extract and place data into a Hadoop-based repository without first transforming the data the way they would for a traditional EDW. After all, one of the chief advantages of Hadoop is that organizations can dip into the database for analysis as needed. All frameworks are created in an ad hoc manner, with little or no prep work required.

Driven both by the enormous data volumes as well as cost—Hadoop can be 10 to 100 times less expensive to deploy than traditional data warehouse technologies—enterprises are starting to defer labor-intensive processes of cleaning up data and developing schema until they’ve identified a clear business need.

In short, they are turning to data lakes.

What Is a Data Lake?

A data lake is a central location in which to store all your data, regardless of its source or format. It is typically, although not always, built using Hadoop. The data can be structured or unstructured. You can then use a variety of storage and processing tools—typically tools in the extended Hadoop family—to extract value quickly and inform key organizational decisions.

Because all data is welcome, data lakes are an emerging and powerful approach to the challenges of data integration in a traditional EDW (Enterprise Data Warehouse), especially as organizations turn to mobile and cloud-based applications and the IoT.

Some of the benefits of a data lake include:

The kinds of data from which you can derive value are unlimited.
You can store all types of structured and unstructured data in a data lake, from CRM data, to social media posts.
You don’t have to have all the answers upfront.
Simply store raw data—you can refine it as your understanding and insight improves.
You have no limits on how you can query the data.
You can use a variety of tools to gain insight into what the data means.
You don’t create any more silos.
You gain a democratized access with a single, unified view of data across the organization.

The differences between EDWs and data lakes are significant. An EDW is fed data from a broad variety of enterprise applications. Naturally, each application’s data has its own schema. The data thus needs to be transformed to conform to the EDW’s own predefined schema.

Designed to collect only data that is controlled for quality and conforming to an enterprise data model, the EDW is thus capable of answering a limited number of questions. However, it is eminently suitable for enterprise-wide use.

Data lakes, on the other hand, are fed information in its native form. Little or no processing is performed for adapting the structure to an enterprise schema. The structure of the data collected is therefore not known when it is fed into the data lake, but only found through discovery, when read.

The biggest advantage of data lakes is flexibility. By allowing the data to remain in its native format, a far greater—and timelier—stream of data is available for analysis.

Table 1-1 shows the major differences between EDWs and data lakes.

Table 1-1. Differences between EDWs and data lakes
Attribute EDW Data lake
Schema Schema-on-write Schema-on-read
Scale Scales to large volumes at moderate cost Scales to huge volumes at low cost
Access methods Accessed through standardized SQL and BI tools Accessed through SQL-like systems, programs created by developers, and other methods
Workload Supports batch processing, as well as thousands of concurrent users performing interactive analytics Supports batch processing, plus an improved capability over EDWs to support interactive queries from users
Data Cleansed Raw
Complexity Complex integrations Complex processing
Cost/efficiency Efficiently uses CPU/IO Efficiently uses storage and processing capabilities at very low cost
Benefits
  • Transform once, use many

  • Clean, safe, secure data

  • Provides a single enterprise-wide view of data from multiple sources

  • Easy to consume data

  • High concurrency

  • Consistent performance

  • Fast response times

  • Transforms the economics of storing large amounts of data

  • Supports Pig and HiveQL and other high-level programming frameworks

  • Scales to execute on tens of thousands of servers

  • Allows use of any tool

  • Enables analysis to begin as soon as the data arrives

  • Allows usage of structured and unstructured content from a single store

  • Supports agile modeling by allowing users to change models, applications, and queries

Drawbacks of the Traditional EDW

One of the chief drawbacks of the schema-on-write of the traditional EDW is the enormous time and cost of preparing the data. For a major EDW project, extensive data modeling is typically required. Many organizations invest in standardization committees that meet and deliberate over standards, and can take months or even years to complete the task at hand.

These committees must do a lot of upfront definitions: first, they need to delineate the problem(s) they wish to solve. Then they must decide what questions they need to ask of the data to solve those problems. From that, they design a database schema capable of supporting those questions. Because it can be very difficult to bring in new sources of data once the schema has been finalized, the committee often spends a great deal of time deciding what information is to be included, and what should be left out. It is not uncommon for committees to be gridlocked on this particular issue for weeks or months.

With this approach, business analysts and data scientists cannot ask ad hoc questions of the data—they have to form hypotheses ahead of time, and then create the data structures and analytics to test those hypotheses. Unfortunately, the only analytics results are ones that the data has been designed to return. This issue doesn’t matter so much if the original hypotheses are correct—but what if they aren’t? You’ve created a closed-loop system that merely validates your assumptions—not good practice in a business environment that constantly shifts and surprises even the most experienced businesspersons.

The data lake eliminates all of these issues. Both structured and unstructured data can be ingested easily, without any data modeling or standardization. Structured data from conventional databases is placed into the rows of the data lake table in a largely automated process. Analysts choose which tag and tag groups to assign, typically drawn from the original tabular information. The same piece of data can be given multiple tags, and tags can be changed or added at any time. Because the schema for storing does not need to be defined up front, expensive and time-consuming modeling is not needed.

Key Attributes of a Data Lake

To be classified as a true data lake, a Big Data repository has to exhibit three key characteristics:

Should be a single shared repository of data, typically stored within a Hadoop Distributed File System (HDFS)
Hadoop data lakes preserve data in its original form and capture changes to data and contextual semantics throughout the data lifecycle. This approach is especially useful for compliance and internal auditing activities, unlike with a traditional EDW, where if data has undergone transformations, aggregations, and updates, it is challenging to piece data together when needed, and organizations struggle to determine the provenance of data.
Include orchestration and job scheduling capabilities (for example, via YARN)

Workload execution is a prerequisite for Enterprise Hadoop, and YARN provides resource management and a central platform to deliver consistent operations, security, and data governance tools across Hadoop clusters, ensuring analytic workflows have access to the data and the computing power they require.

Contain a set of applications or workflows to consume, process, or act upon the data

Easy user access is one of the hallmarks of a data lake, due to the fact that organizations preserve the data in its original form. Whether structured, unstructured, or semi-structured, data is loaded and stored as is. Data owners can then easily consolidate customer, supplier, and operations data, eliminating technical—and even political—roadblocks to sharing data.

The Business Case for Data Lakes

EDWs have been many organizations’ primary mechanism for performing complex analytics, reporting, and operations. But they are too rigid to work in the era of Big Data, where large data volumes and broad data variety are the norms. It is challenging to change EDW data models, and field-to-field integration mappings are rigid. EDWs are also expensive.

Perhaps more importantly, most EDWs require that business users rely on IT to do any manipulation or enrichment of data, largely because of the inflexible design, system complexity, and intolerance for human error in EDWs.

Data lakes solve all these challenges, and more. As a result, almost every industry has a potential data lake use case. For example, organizations can use data lakes to get better visibility into data, eliminate data silos, and capture 360-degree views of customers.

With data lakes, organizations can finally unleash Big Data’s potential across industries.

Freedom from the rigidity of a single data model

Because data can be unstructured as well as structured, you can store everything from blog postings to product reviews. And the data doesn’t have to be consistent to be stored in a data lake. For example, you may have the same type of information in very different data formats, depending on who is providing the data. This would be problematic in an EDW; in a data lake, however, you can put all sorts of data into a single repository without worrying about schemas that define the integration points between different data sets.

Ability to handle streaming data

Today’s data world is a streaming world. Streaming has evolved from rare use cases, such as sensor data from the IoT and stock market data, to very common everyday data, such as social media.

Fitting the task to the tool

When you store data in an EDW, it works well for certain kinds of analytics. But when you are using Spark, MapReduce, or other new models, preparing data for analysis in an EDW can take more time than performing the actual analytics. In a data lake, data can be processed efficiently by these new paradigm tools without excessive prep work. Integrating data involves fewer steps because data lakes don’t enforce a rigid metadata schema. Schema-on-read allows users to build custom schema into their queries upon query execution.

Easier accessibility

Data lakes also solve the challenge of data integration and accessibility that plague EDWs. Using Big Data Hadoop infrastructures, you can bring together ever-larger data volumes for analytics—or simply store them for some as-yet-undetermined future use. Unlike a monolithic view of a single enterprise-wide data model, the data lake allows you to put off modeling until you actually use the data, which creates opportunities for better operational insights and data discovery. This advantage only grows as data volumes, variety, and metadata richness increase.

Reduced costs

Because of economies of scale, some Hadoop users claim they pay less than $1,000 per terabyte for a Hadoop cluster. Although numbers can vary, business users understand that because it’s no longer excessively costly for them to store all their data, they can maintain copies of everything by simply dumping it into Hadoop, to be discovered and analyzed later.

Scalability

Big Data is typically defined as the intersection between volume, variety, and velocity. EDWs are notorious for not being able to scale beyond a certain volume due to restrictions of the architecture. Data processing takes so long that organizations are prevented from exploiting all their data to its fullest extent. Using Hadoop, petabyte-scale data lakes are both cost-efficient and relatively simple to build and maintain at whatever scale is desired.

Data Management and Governance in the Data Lake

If you use your data for mission-critical purposes—purposes on which your business depends—you must take data management and governance seriously. Traditionally, organizations have used the EDW because of the formal processes and strict controls required by that approach. But as we’ve already discussed, the growing volume and variety of data are overwhelming the capabilities of the EDW. The other extreme—using Hadoop to simply do a “data dump”—is out of the question because of the importance of the data.

In early use cases for Hadoop, organizations frequently loaded data without attempting to manage it in any way. Although situations still exist in which you might want to take this approach—particularly since it is both fast and cheap—in most cases, this type of data dump isn’t optimal. In cases where the data is not standardized, where errors are unacceptable, and when the accuracy of the data is of high priority, a data dump will work against your efforts to derive value from the data. This is especially the case as Hadoop transitions from an add-on-feature to a truly central aspect of your data architecture.

The data lake offers a middle ground. A Hadoop data lake is flexible, scalable, and cost-effective—but it can also possess the discipline of a traditional EDW. You must simply add data management and governance to the data lake.

Once you decide to take this approach, you have four options for action.

Address the Challenge Later

The first option is the one chosen by many organizations, who simply ignore the issue and load data freely into Hadoop. Later, when they need to discover insights from the data, they attempt to find tools that will clean the relevant data.

If you take this approach, machine-learning techniques can sometimes help discover structures in large volumes of disorganized and uncleansed Hadoop data.

But there are real risks to this approach. To begin with, even the most intelligent inference engine needs to start somewhere in the massive amounts of data that can make up a data lake. This means necessarily ignoring some data. You therefore run the risk that parts of your data lake will become stagnant and isolated, and contain data with so little context or structure that even the smartest automated tools—or human analysts—don’t know where to begin. Data quality deteriorates, and you end up in a situation where you get different answers to the same question of the same Hadoop cluster.

Adapt Existing Legacy Tools

In the second approach, you attempt to leverage the applications and processes that were designed for the EDW. Software tools are available that perform the same ETL processes you used when importing clean data into your EDW, such as Informatica, IBM InfoSphere DataStage, and AB Initio, all of which require an ETL grid to perform transformation. You can use them when importing data into your data lake.

However, this method tends to be costly, and only addresses a portion of the management and governance functions you need for an enterprise-grade data lake. Another key drawback is the ETL happens outside the Hadoop cluster, slowing down operations and adding to the cost, as data must be moved outside the cluster for each query.

Write Custom Scripts

With the third option, you build a workflow using custom scripts that connect processes, applications, quality checks, and data transformation to meet your data governance and management needs.

This is currently a popular choice for adding governance and management to a data lake. Unfortunately, it is also the least reliable. You need highly skilled analysts steeped in the Hadoop and open source community to discover and leverage open-source tools or functions designed to perform particular management or governance operations or transformations. They then need to write scripts to connect all the pieces together. If you can find the skilled personnel, this is probably the cheapest route to go.

However, this process only gets more time-consuming and costly as you grow dependent on your data lake. After all, custom scripts must be constantly updated and rebuilt. As more data sources are ingested into the data lake and more purposes found for the data, you must revise complicated code and workflows continuously. As your skilled personnel arrive and leave the company, valuable knowledge is lost over time. This option is not viable in the long term.

Deploy a Data Lake Management Platform

The fourth option involves solutions emerging that have been purpose-built to deal with the challenge of ingesting large volumes of diverse data sets into Hadoop. These solutions allow you to catalogue the data and support the ongoing process of ensuring data quality and managing workflows. You put a management and governance framework over the complete data flow, from managed ingestion to extraction. This approach is gaining ground as the optimal solution to this challenge.

How to Deploy a Data Lake Management Platform

This book focuses on the fourth option, deploying a Data Lake Management Platform. We first define data lakes and how they work. Then we provide a data lake reference architecture designed by Zaloni to represent best practices in building a data lake. We’ll also talk about the challenges that companies face building and managing data lakes.

The most important chapters of the book discuss why an integrated approach to data lake management and governance is essential, and describe the sort of solution needed to effectively manage an enterprise-grade lake. The book also delves into best practices for consuming the data in a data lake. Finally, we take a look at what’s ahead for data lakes.

1 Gartner. “Gartner Survey Highlights Challenges to Hadoop Adoption.” May 13, 2015.

2 CapGemini Consulting. “Cracking the Data Conundrum: How Successful Companies Make Big Data Operational.” 2014.

Get Architecting Data Lakes now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.