Chapter 4. Curating the Data Lake

To leverage a data lake as a core data platform—and not just an adjunct staging area for the EDW—enterprises need to impose proper governance. Organizations that possess many potential use cases require the mature controls and context found in traditional EDWs, before they will trust their business-critical applications to a data lake. Although it is exciting to have a cost-effective scale-out platform, without controls in place, no one will trust it. It might work, but you still need a management and governance layer that organizations are accustomed to having in traditional EDW environments.

For example, consider a bank doing risk data aggregation across different lines of business into a common risk-reporting platform for the Basel Committee on Banking Supervision (BCBS) 239. The data has to be of very high quality, and have good lineage to ensure the reports are correct, because the bank depends on those reports to make key decisions about how much capital to carry. Without this lineage, there are no guarantees that the data is accurate.

Hadoop makes perfect sense for this kind of data, as it can scale out as you bring together large volumes of different risk data sets across different lines of business. From that perspective, Hadoop works well. But what Hadoop lacks is the metadata layer, as well as quality and governance controls. To succeed at applying data lakes to these kinds of business use cases, you need rigorous controls in place.

To achieve the balance between the rigid and inflexible structure of a traditional EDW and the performance and low-cost of the so-called “data swamp,” organizations can deploy integrated management and governance platforms that allow them to manage, automate, and execute operational tasks in the data lake. This saves them both development time and money.

Data Governance

It’s important to note that in addition to the tools required to maintain governance, having a process—frequently a manual process—is also required. Process can be as simple as assigning stewards to new data sets, or forming a data lake enterprise data council, to establish data definitions and standards.

Questions to ask when considering goals for data governance:

Quality and consistency: Is the data of sufficient quality and consistency to be useful to business users and data scientists in making important discoveries and decisions?
Policies and standards: What are the policies and standards for ingesting, transforming, and using data, and are they observed uniformly throughout the organization?
Security, privacy, and compliance: Is access to sensitive data limited to those with the proper authorization?
Data lifecycle management: How will we manage the lifecycle of the data? At what point will we move it from expensive, Tier-1 storage to less expensive storage mechanisms?

Integrating a Data Lake Management Solution

A data lake management solution needs to be integrated because the alternative is to perform the best practice functions listed above in siloes, thereby wasting a large amount of time stitching together different point products. You would end up spending a great deal of resources on the plumbing layer of the data lake—the platform—when you could be spending resources on something of real value to the business, like analyses and insights your business users gain from the data.

Having an integrated platform improves your time-to-market for insights and analytics tremendously, because all of these aspects fit together. As you ingest data, the metadata is captured. As you transform the data into a refined form, lineage is automatically captured. And as the data comes in, you have rules that inspect the data for quality—so whatever data you make available for consumption goes through these data quality checks.

Data Acquisition

Although you have many options when it comes to getting data into Hadoop, doing so in a managed way means that you have control over what data is ingested, where it comes from, when it arrives, and where in Hadoop it is stored. A well-managed data ingestion process simplifies the onboarding of new data sets and therefore the development of new use cases and applications.

As we discussed in Chapter 3, the first challenge is ingesting the data—getting the data into the data lake. An integrated data lake management platform will perform managed ingestion, which involves getting the data from the source systems into the data lake and making sure it is a process that is repeatable, and that if anything fails in the daily ingest cycle, there will be operational functions that take care of it.

For example, a platform implementing managed ingestion can raise notifications and captures logs, so that you can debug why an ingestion failed, fix it, and restart the process. This is all tied with post-processing once the data is stored in the data lake.

Additionally, as we see more and more workloads going toward a streaming scenario, whatever data management functions you applied to batch ingestion—when data was coming in periodically—now needs to be applied to data that is streaming in continuously. Integrated data lake management platforms should be able to detect if certain streams are not being ingested based on the SLAs you set.

A data lake management platform should ensure that the capabilities available in the batch ingestion layer are also available in the streaming ingestion layer. Metadata still needs to be captured and data quality checks need to be performed for streaming data. And you still need to validate that the record format is correct, and that the record values are correct by doing range checks or reference integrity checks.

By using a data management solution purpose-built to provide these capabilities, you build the foundation for a well-defined data pipeline. Of course, you need the right processes, too, such as assigning stewards for new data sets that get ingested.

Data Organization

When you store data, depending on the use case, you may have some security encryption requirements to consider. Data may need to be either masked or tokenized, and protected with proper access controls.

A core attribute of the data lake architecture is that multiple groups share access to centrally stored data. While this is very efficient, you have to make sure that all users have appropriate permission levels to view this information. For example, in a healthcare organization, certain information is deemed private by law, such as PHI (Protected Health Information), and violators—organizations that don’t protect this PHI—are severely penalized.

The data preparation stage is often where sensitive data, such as financial and health information, is protected. An integrated management platform can perform masking (where data from a field is completely removed) and tokenization (changing parts of the data to something innocuous). This type of platform ensures you have a policy-based mechanism, like access control lists, that you can enforce to make sure the data is protected appropriately.

It’s also important to consider the best format for storing the data. You may need to store it in the raw format in which it came, but you may also want to store it in a format that is more consumable for business users, so that queries will run faster. For example, queries run on columnar data sets will return much faster results than those in a typical row data set. You may also want to compress the data, as it may be coming in in large volumes, to save on storage.

Also, when storing data, the platform should ideally enable you to automate data lifecycle management functions. For example, you may store the data in different zones in the data lake, depending on different SLAs. For example, as raw data comes in, you may want to store it in a “hot zone” where data is stored that is used very frequently, for a certain amount of time, say 30 days. Then after that you may want to move it to a warm zone for 90 days, and after that, to a cold zone for seven years, from which the queries are much more infrequent.

Data Catalog

With the distributed HDFS filesystem, you ingest information that is first broken up into blocks, and then written in a distributed manner in the cluster. However, sometimes you need to see what data sets exist in the data lake, the properties of those data sets, the ingestion history of the data set, the data quality, and the key performance indicators (KPIs) of the data as it was ingested. You should also see the data profile, and all the metadata attributes, including those that are business, technical, and operational. All of these things need to be abstracted to a level to where the user can understand them, and use that data effectively—this is where the data lake catalog comes in.

Your management platform should make it easy to create a data catalog, and to provide that catalog to business users, so they can easily search it—whether searching for source system, schema attributes, subject area, or time range. This is essential if your business users are to get the most out of the data lake, and use it in a swift and agile way.

With a data catalog, users can find data sets that are curated, so that they don’t spend time cleaning up and preparing the data. This has already been done for them, particularly in cases of data that has made it to the trusted area. Users are thus able to select the data sets they want for model building without involving IT, which shortens the analytics timeline.

Capturing Metadata

Metadata is extraordinarily important to managing your data lake. An integrated data lake management platform makes metadata creation and maintenance an integral part of the data lake processes. This is essential, as without effective metadata, data dumped into a data lake may never be seen again.

You may have a lot of requirements that are defined by your organization’s central data authority, by your chief data officer or data stewards in your lines of business, who may want to specify the various attributes and entities of data that they are bringing into the data lake.

Metadata is critical for making sure data is leveraged to its fullest. Whether manually collected or automatically created during data ingestion, metadata allows your users to locate the data they want to analyze. It also provides clues for future users to understand the contents of a data set and how it could be reused.

As data lakes grow deeper and more important to the conduct of daily business, metadata is a vital tool in ensuring that the data we pour into these lakes can be found and harnessed for years to come. There are three distinct but equally important types of metadata to collect: technical, operational, and business data, as shown in Table 4-1.

Table 4-1. Three equally important types of metadata
Type of metadata	Description	Example
Technical	Captures the form and structure of each data set	Type of data (text, JSON, Avro), structure of the data (the fields and their types)
Operational	Captures lineage, quality, profile and provenance of the data	Source and target locations of data, size, number of records, lineage
Business	Captures what it all means to the user	Business names, descriptions, tags, quality and masking rules

Technical metadata captures the form and structure of each data set. For example, it captures the type of data file (text, JSON, Avro) and the structure of the data (the fields and their types), and other technical attributes. This is either automatically associated with a file upon ingestion or discovered manually after ingestion. Operational metadata captures the lineage, quality, profile, and provenance of the data at both the file and the record levels, the number of records, and the lineage. Someone must manually enter and tag entities with operational metadata. Business metadata captures what the user needs to know about the data, such as the business names, the descriptions of the data, the tags, the quality, and the masking rules for privacy. All of this can be automatically captured by an integrated data management platform upon ingestion.

All of these types of metadata should be created and actively curated—otherwise, the data lake is simply a wasted opportunity. Additionally, leading integrated data management solutions will possess file and record level watermarking features that enable you to see the data lineage, where data moves, and how it is used. These features safeguard data and reduce risk, as the data manager will always know where data has come from, where it is, and how it is being used.

Data Preparation

Making it easier for business users to access and use the data that resides in the Hadoop data lake without depending on IT assistance is critical to meeting the business goals the data lake was created to solve in the first place.

However, just adding raw data to the data lake does not make that data ready for use by data and analytics applications: data preparation is required. Inevitably, data will come into the data lake with a certain amount of errors, corrupted formats, or duplicates. A data management platform makes it easier to adequately prepare and clean the data using built-in functionality that delivers data security, quality, and visibility. Through workflow orchestration, rules are automatically applied to new data as it flows into the lake.

For example, Bedrock allows you to automatically orchestrate and manage the data preparation process from simple to complex, so that when your users are ready to analyze the data, the data is available.

Data preparation capabilities of an integrated data lake management platform should include:

Data tagging so that searching and sorting becomes easier
Converting data formats to make executing queries against the data faster
Executing complex workflows to integrate updated or changed data

Whenever you do any of these data preparation functions, you need metadata that shows the lineage from a transformation perspective: what queries were run? When did they run? What files were generated? You need to create a lineage graph of all the transformations that happen to the data as it flows through the pipeline.

Additionally, when going from raw to refined, you might want to watermark the data by assigning a unique ID for each record of the data, so you can trace a record back to its original file. You can watermark at either the record or file level. Similarly, you may need to do format conversions as part of your data preparation, for example, if you prefer to store the data in a columnar format.

Other issues can arise. You may have changes in data coming from source systems. How do you reconcile that changed data with the original data sets you brought in? You should be able to maintain a time series of what happens over a period of time.

A data management platform can do all of this, and ensure that all necessary data preparation is completed before the data is published into the data lake for consumption.

Data Provisioning

Self-service consumption is essential for a successful data lake. Different types of users consume the data, and they are looking for different things—but each wants to access the data in a self-service manner, without the help of IT:

The Executive

An executive is usually a person in senior management looking for high-level analyses that can help her make important business decisions. For example, an executive could be looking for predictive analytics of product sales based on history and analytical models built by data scientists. In an integrated data lake management platform, data would be ingested from various sources—some streaming, some batch, and then processed in batches to come up with insights, with the final data able to be visualized using Tableau or Excel. Another common example is an executive who needs a 360-degree view of a customer, including metrics from every level of the organization—pre-sales, sales, and customer support—in a single report.

The Data Scientist

Data scientists are typically looking at the data sets and trying to build models on top of them, performing exploratory ad hoc analyses to prove or come up with a thesis about what they see. Data scientists who want to build and test their models will find a data lake useful in that they have access to all of the data, and not just a sample. Additionally, they can build scripts in Python, and run it on a cluster to get a response in hours, rather than days.

The Business Analyst

Business analysts usually try to correlate some of the data sets, and create an aggregated view to slice and dice using a business intelligence or visualization tool.

With a traditional EDW, business analysts had to come up with reporting requirements and wait for IT to build a report, or export the data on their behalf. Now, business analysts can ask “what-if” questions from data lakes on their own. For example, a business analyst might ask how much sales were impacted due to weather patterns, based on historical data and information from public data sets, combined with in-house data sets in the data lake. Without involving IT, he could consult the catalog to see what data sets have been cleaned and standardized and run queries against that data.

A Downstream System

A fourth type of consumer is a downstream system, such as an application or a platform, which receives the raw or refined data. Leading companies are building new applications and products on top of their data lake, so they are also consumers of the data. They may also use RESTful APIs or some other API mechanisms, on an ongoing manner. For example, if the downstream application is a database, the data lake can ingest and transform the data, and send the final aggregated data to the downstream system for storage.

Benefits of an Automated Approach

Taking an integrated data management approach to a data lake ensures that each business unit does not build a separate cluster—a common practice with EDWs. Instead, you build a data lake with a shared enterprise cluster. An integrated management platform provides the governance and the multi-tenant capabilities to do this, and to implement best practices for governance without impacting the speed or agility of the data lake. This type of platform enables you to:

Understand provenance: Track the source and lineage of any data loaded into the data lake. This gives you traceability of the data, tells you where it came from, when it came in, how many records it has, and if the data set was created from other data sets. These details allow you to establish accountability, and you can use this information to do impact analysis on the data.
Understand context: Record data attributes, such as the purpose for which the data was collected, the sampling strategies employed in its collection, and any data dictionaries or field names associated with it. This pieces of information makes your organization much more productive as you progress along the analytics pipeline to derive insights from the data.
Track updates: Log each time new data is loaded from the same source and record any changes to the original data introduced during an update. You need to do this in cases where data formats keep changing. For example, say you are working with a retail chain, with thousands of point-of-sale (POS) terminals sending data from 8,000-plus stores in the United States. These POS terminals are gradually upgraded to newer versions, but not everything can be upgraded on a given day—and now you have multiple formats of data coming in. How do you keep track as the data comes in? How do you know what version it maps to? How do you associate it with the right metadata and right structures so that it can be efficiently used for building the analytics? All of these questions can be answered with a robust integrated data lake management platform.
Track modifications recording: Record when data is actively changed, and know by whom and how it was done. If there are format changes, you are able to track them as you go from version 1 to version 2, so you know which version of the data you are processing, and the structure or schemes associated with that version.
Perform transformations: Convert data from one format to another to de-duplicate, correct spelling, expand abbreviations, or add labels. Driven by metadata, these transformations are greatly streamlined. And because they are based on metadata, the transformations can accommodate changes in a much more dynamic manner. For example, you have a record format with 10 fields and perform a transformation based on metadata of 10 fields. If you decide to add an additional field, you can adjust that transformation without having to go back to the beginning implementation of the transformation. In other words, the transformation is driven by and integrated with the metadata.
Track transformations: Performing transformations is a valuable ability, but an additional, essential requirement involves keeping track of the transformations you have accomplished. With a leading integrated data management platform, you can record the ways in which data sets are transformed. Say you perform a transformation from a source to a target format: you can track the lineage so that you know, for example, that this file and these records were transformed to a new file in this location and in this format, which now has this many records.
Manage metadata: Manage all of the metadata associated with all of the above, making it easy to track, search, view, and act upon all of your data. Because you are using an integrated approach, much of technical metadata can be discovered from the data coming in, and the operational data can be automatically captured without any manual steps. This capability provides you with a much more streamlined approach for collecting metadata.

Get Architecting Data Lakes now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Architecting Data Lakes by Ashish Thusoo, Ben Sharma