Tips for managing metadata in a data lake
Metadata is central to a modern data architecture.
Modern data architectures promise broader access to more and different types of data in order to enable an increasing number of data consumers to employ data for business-critical use cases. Examples of such use cases include product development, personalized customer experience, fraud detection, regulatory compliance, and data monetization.
Data-focused enterprises must explore several key questions, including what, exactly, is a “modern data architecture”? How can we ensure what we build successfully supports our business strategy? And how do we make our system agile enough to scale and accommodate new types of data in the future? The answers to these questions all have to do with metadata.
Three categories of metadata
Metadata, or information about data, gives you the ability to understand lineage, quality, and lifecycle, and provides crucial visibility into today’s data-rich environments. Metadata also enables data governance, which consists of policies and standards for the management, quality, and use of data, all critical for managing data and data access at the enterprise level. Without proper governance, many “modern” data architectures built to democratize data access initially show promise, but fail to deliver.
Metadata falls into three categories: technical, operational, and business. Technical metadata captures the form and structure of each data set, such as the size and structure of the schema or type of data. Operational metadata captures the lineage, quality, profile, and provenance of data. It may also record the number of rejected records and the success or failure of a job. Business metadata captures what the data means to the end user to make data fields easier to find and understand, including business names, descriptions, tags, quality, and masking rules.
Managing with a data lake
Today’s forward-looking organizations increasingly rely on a data lake in order to create a 360-degree view of their data as well as for more flexibility for data analysis and discovery to support evolving business strategies.
To successfully manage data in a data lake, you need a framework for capturing technical, operational, and business metadata so you can discover and leverage your data for various use cases. A data lake management platform is one way to automate the management of your metadata. For example, a platform can automate the capture of metadata on arrival, as you’re doing transformations, and tie it to specific definitions, for instance in an enterprise business glossary. An enterprise-wide business glossary, with definitions agreed upon by business users, ensures all users are consistently interpreting the same data by a set of rules and concepts—and can be automatically updated as your metadata changes.
A data lake relies on effective metadata management capabilities to simplify and automate common data management tasks. An incorrect metadata architecture can prevent data lakes from making the transition from an analytical sandbox or proof of concept (POC) using limited data sets and one use case, to a production-ready, enterprise-wide data platform supporting many users and multiple use cases—in other words, a modern data architecture.
Data lake architectures look very different from traditional data architectures. One central difference is that data lakes should be organized into zones that serve specific functions. This is important to create a transparent, logical system that will support ingestion and management of different types of data now and in the future. Metadata is critical here, as data is organized into zones based on the metadata applied to it:
- Start with a staging area: This is where data first comes into the data lake. Particularly for businesses in highly regulated industries, the staging area or transient zone is where data can be tokenized or masked to protect personally identifiable information (PII) or other sensitive data. From the staging area, it is common to create new and different transformed data sets that either feed net-new applications running directly on the data lake, or if desired, feed these transformations into existing EDW platforms.
- Create additional zones: It’s important to define a raw zone, refined zone, trusted zone, and sandbox area. The raw zone contains data in its original form. The refined zone is where you can create refined data sets from raw data, define new structures for common data models, and do some data cleansing and quality checks. The trusted zone is an area for master data sets, such as product codes, that can be combined with refined data to create data sets for end-user consumption. And finally, the sandbox is an area for data scientists or business analysts to play with data and to build more efficient analytical models on top of the data lake.
- Automate metadata capture: Ideally, you want to automate the capture of metadata upon data ingestion, and create repeatable and reliable ingestion processes. A data lake management platform can automatically generate metadata based on ingestions by importing Avro, JSON, or XML files, or when data from relational databases is ingested into the data lake. Automation is essential for building a scalable architecture, one that will grow with your business over time.
A deeper dive into metadata
To realize maximum value from a data lake, you must be able to ensure data quality and reliability, and democratize access to data. Democratizing access means giving access to more users across the organization and making it faster for users to identify the data they want to use. All of this critical functionality is dependent on putting in place a robust, scalable framework that captures and manages metadata. Metadata is truly the key to a successful next-generation data architecture.
This post is a collaboration between O’Reilly and Zaloni. See our statement of editorial independence.