Deploying a hybrid Hadoop architecture

Metadata, governance, and other considerations for building ground-to-cloud.

By Ben Sharma

February 18, 2016

Endlich himmelblau. (source: Till Krech on Flickr)

As clouds—both private and public—mature from security and multi-tenancy perspectives, the trend toward hybrid Hadoop deployment will continue to escalate. According to a September 2015 Gartner report, nearly 30% of enterprises plan continued or new use of a hybrid Hadoop deployment and more than a quarter plan to deploy Hadoop in the cloud. In fact, Gartner predicts that the hybrid cloud model is about two to five years away from becoming mainstream.

Why hybrid plus Hadoop?

Today, most Hadoop data lakes are on-premise. At the same time, a growing number of enterprises are building cloud-based data lakes to complement and even replace on-premise Hadoop deployments. One driver is the rapid growth of data being generated by sensors and devices that is useful to mash up with other data, such as customer data.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

A hybrid approach allows companies to keep customer data on-premise, while leveraging the cloud to store and process data that is less sensitive, such as that coming from sensors and other devices. Increased demand for faster time-to-insight to support business decisions is also fueling the shift toward a hybrid approach. Building a data lake in the cloud allows companies to create a federated data lake management platform that spans on-premise and cloud-based computing. This type of platform gives end users a holistic view of, and access to, data across the enterprise. In addition, cost is a big factor. Cloud services like Amazon Web Services (AWS) or Microsoft Azure provide cheaper, on-demand storage and enable elastic compute, where companies can spin Hadoop clusters up and down as business demands.

Key challenges: Visibility and data governance

At Zaloni, we are seeing this trend playing out in our customer base. It is bringing to the forefront unique data management challenges. In a hybrid environment, keeping track of data can be extremely challenging. That’s why it’s critical to think about how data—of all types and formats—will be ingested. Some key considerations include: what metadata needs to be captured in order to maintain a catalog, and how to create automated, repeatable ingestion pipelines that provide the information end users need about the data, no matter where it resides. Metadata is essential for managing, migrating, accessing, and deploying a big data solution; without it, enterprises have limited visibility into the data itself and cannot trust in its quality—which negates the value of having the data in the first place.

Another challenge is data governance. While 82% of companies know they face external regulatory requirements, 44% say they don’t have a defined data governance policy, according to a Rand survey. Data governance is a company’s policies and procedures that manage the use, access, availability, quality, and security of data. It’s a complex undertaking to bring consistency to data practices across departments and data sources. However, data governance is becoming more urgent for companies as more hybrid environments come online, the volume of big data continues to increase, and Hadoop becomes attractive to industries in which data controls and audits are required.

“Ground-to-cloud” data lake management

Data lake management is the key to maximizing and accelerating the value that can be derived from big data. No matter if a Hadoop data lake is deployed on-premise or in the cloud, enterprises need to be proactive at capturing metadata. A practical way to do this, and reduce some of the complexity, is with a vendor-neutral data lake management platform. Working seamlessly across cloud and physical environments, a data lake management platform helps enterprises ensure data quality, improve data visibility, and protect sensitive data.

Ensure data quality

Metadata is what allows users to find the right data, understand it, trust the quality of the data, and trust the validity of their analyses. A data lake management platform captures more than technical metadata; operational and business metadata are also important so that business end users can discover the data they need. As enterprises move the data lake to the cloud to deploy Hadoop on a broader scale, metadata allows business users to confidently access various types of data using more intuitive analytics tools.

Create end-to-end visibility

A hybrid environment requires an enterprise-wide data catalog to keep track of data and enable search and query across big data systems, including Hadoop clusters in the cloud. With a data lake management platform, file- and record-level watermarking allows users to see data lineage, where data moves and how it is used. This safeguards data and reduces risk, as the data manager will always know where data has come from, where it is, and how it is being used.

Protect sensitive information

In certain industries, like health care, keeping data such as patient medical information private is a huge concern. By using a data lake management platform, enterprises can implement a multi-tenancy structure and specify permissions for end users to view information and implement protections like masking (where a field is completely removed) and tokenization (changing the field to something innocuous) to enforce data privacy.

Conclusion

A hybrid environment makes a lot of sense for many enterprises, and will be a natural progression as more companies look to leverage the cloud for their data lakes. However, to be successful, companies must pause to review and address their data lake management foundation—and not underestimate how important metadata is to deriving value from big data, no matter where it resides.

This post is part of a collaboration between O’Reilly and Zaloni. See our statement of editorial independence.

Post topics: Big Data Tools and Pipelines