The "Jung Hua" dam and reservoir.
The "Jung Hua" dam and reservoir. (source: By Vegafish on Wikimedia Commons)

The early appeal of Hadoop—the unconstrained flexibility and freedom to store any type and volume of raw data—is actually what stalls many Hadoop projects as they move to production use cases. That’s not to say storing raw data has no value. Hadoop’s “big bucket of data” approach to storing information is still compelling and can add significant value to businesses that embrace it. For example, there are situations in which this relatively cheap and rapid approach makes sense, such as when the data lacks variety, errors are acceptable or the data is in a sandbox, and the accuracy or comprehensiveness of the data isn’t critical.

However, when data is complex, in flux, derived from various sources, or undergoes regulatory scrutiny, a managed approach to data curation in the Hadoop data lake becomes very important. Without data management, data lakes are challenged by a lack of visibility, transparency, and quality control, and run into operational inefficiencies and issues with data governance, security, and compliance.

While there’s nothing about Hadoop’s architecture that makes data curation difficult, it is possible to skip over it—overlooking data management as an important step in deriving value from your Hadoop data lake.

You can have the best of both worlds

Now that Hadoop is becoming more of a central component to next-generation enterprise data architecture, in industries where controls, permissions, safeguards, and data audits are required, enterprises must consider data governance practices for the data lake. Data governance is a set of policies and procedures that manage the use, access, availability, quality, and security of data across an enterprise. Data governance policies make data available to more business users by putting enterprise-wide permissions and controls in place. They also protect businesses and their customers from risks such as fraud, and enable companies to comply with regulations in sensitive industries like health care, financial services, and national security.

The challenge for IT is to deliver the necessary controls without turning Hadoop into something as slow, inflexible, and expensive as the enterprise data warehouse (EDW) of the past. By using a data management platform built for Hadoop, it’s now possible to find a middle ground: a data lake implementation that retains Hadoop’s flexibility, scalability, and cost-effectiveness, while introducing some of the controls and rigor of a traditional EDW deployment.

Advantages of a managed data lake

When companies have many different sources of data, and maybe even multiple instances of Hadoop and other analytics solutions, it’s easy to lose track of basic information, i.e., metadata. Metadata, or “data about data,” describes data from a technical, operational, or business standpoint. Technical metadata defines the structure and form of the data. Operational metadata tracks where the data came from, who loaded it, and how it has moved from raw data to transformed data sets. Business metadata is the information users need to find data for analysis. Without metadata, companies don’t know what data they have, and they can’t trust the data’s quality—making it impossible to implement data governance, and difficult to derive business value from the data.

Just as EDW operators learned in the past, there are huge advantages to retaining control over the ways in which data is ingested, stored, managed, modified, and used. Having a well-managed data ingestion process provides operational benefits such as the ability for IT to troubleshoot and diagnose ingestion issues. More importantly, it simplifies the onboarding of new data sets and therefore the development of new use cases and applications.

What is a “well-managed data ingestion process?” It means having control over how data is ingested, where it comes from, when it arrives and where it lands in the data lake. All steps of the data ingestion pipeline should be defined in advance, tracked and logged, and the process should be repeatable and scalable.

Enter the data management platform

To be effective, a data management platform for the Hadoop ecosystem must be able to:

  • Track the source of any data loaded into the data lake.
  • Record attributes, such as the reason for which data was collected, the sampling strategies employed in its collection, and any data dictionaries, field names, etc., associated with it.
  • Track updates, such as logging, when new data is loaded from the same source, and record any changes to the original data introduced during an update.
  • Record when data is actively changed, by whom, and in what way.
  • Perform transformations, such as converting data from one format to another, deduplicating, correcting spelling, expanding abbreviations, adding labels, etc.
  • Track transformations by recording the ways in which data sets are transformed.
  • Manage metadata—making it easy to track, search, view, and act upon data.

Competitive enterprises need ways to maximize and accelerate the value that can be derived from data. A unified data management platform removes the obstacles associated with building and operationalizing an enterprise-ready Hadoop data lake. Over time, data lakes will only grow deeper and more important to everyday business operations.

This post is part of a collaboration between O'Reilly Media and Zaloni. See our statement of editorial independence.

Article image: The "Jung Hua" dam and reservoir. (source: By Vegafish on Wikimedia Commons).