Chapter 1. Data Catalogs

A data catalog is a collection of metadata describing data assets and their usage. Modern data catalogs provide relevant functionality to support metadata management, enrichment, and search. They not only help users find relevant data but guide them on proper use of that data. Data catalogs help answer the questions:

  • How can I find relevant data?

  • Once I find data, can I use it?

  • Should I use it?

  • How should I use it?

Cataloging and managing metadata in enterprises is not a new practice. Metadata repositories have existed since the 1970s and relational databases have had metadata catalogs since their early days. However, in the years since, the technology surrounding data and the role of data in the enterprise have both changed substantially.

Enterprise data landscapes have grown more sophisticated—the “3 Vs” of big data (volume, velocity, and variety) are widely known. And the legislative environment mandating compliant data usage continues to grow in complexity as more people (and AI-powered programs) access and use data in new ways.1 Moreover, the growing adoption of cloud computing and SaaS results in more data residing outside the enterprise infrastructure and control. As a result, collecting, managing, and using comprehensive and accurate metadata has become paramount; and modern data catalogs are the tools that enable best practices.

Modern data catalogs have grown in maturity and sophistication to address new and increasingly complex challenges. They now provide a comprehensive set of functionalities to integrate with other enterprise data tools and to support automatic collection and enrichment of metadata, using advanced techniques such as machine learning, natural language processing, and crowdsourcing.

Companies and developers alike recognize the increasing importance of modern data catalogs. In fact, a proliferation of tools and projects to build enterprise data catalogs reflects this growing interest. There are currently a number of companies specializing in enterprise data catalogs, such as Alation, Informatica, and Collibra. Many companies have built their own data catalog software and some have made them available for free.2 Additionally, all major cloud providers (AWS, GCP, and Microsoft Azure) have data catalog offerings.3

In this chapter, we describe the content of a data catalog, present a sample of features and example applications, and conclude with a summarizing framework of data catalog features.

What Is in a Data Catalog?

Data catalogs contain metadata describing data assets and other related assets in an enterprise. To make this more concrete, it is helpful to take a closer look at the various types of dataset metadata and some related examples.

Google Data Catalog distinguishes between technical metadata (e.g., schema information) and business metadata (structured tags), whereas the Ground project provides a more comprehensive framework to understand metadata. The Ground project introduces the ABC model of metadata, which categorizes metadata into application (information that describes how the data can be interpreted for use), behavioral (information about how data was created and used over time), and change (information about the version history of data).

It is important to note that metadata can describe various aspects of data assets and their relationships. Table 1-1 lists common metadata categories.

Table 1-1. Common metadata categories
Metadata category Examples
Core metadata Title, description, creation date, and owner
Access metadata Information about systems that host the data and how the data can be accessed
Schema Information about the various fields in the data along with their descriptions, type information, and other related information
Classification and tagging Tags that link a dataset to a business glossary or to some defined classification within an enterprise
Versioning Links to previous and newer versions
Relationships Relationships with other data assets and relationships with other entities within an enterprise, such as people and dashboards
Content description Statistics about the content of the various fields in the data
Lineage Links from the data to its upstream and downstream datasets and other derived data products
Usage Information about how often the data is used and by whom
Data quality Information about the completeness, accuracy, and validity of the data

This list is not exhaustive, nor does it mean that a data catalog must include all these types of metadata. However, the list is provided to highlight the key role a data catalog can play by integrating information from various systems into one central, accessible place. In the next section, we describe how the richness of a data catalog’s content enables an equally rich set of applications and opportunities.

Data Catalog Features and Example Applications

A data catalog supports search and discovery of data assets for both data consumers and producers. Robust search requires that a data catalog have the ability to collect metadata about datasets, keep them updated, and make them searchable. A data catalog, therefore, should support extracting metadata from common data sources, such as databases, file systems, APIs, and business intelligence (BI) tools. It should also adapt to new, popular data types, such as unstructured and streaming data.

As an enterprise tool, data catalogs should ensure secure access to their contents. Data catalogs should also be scalable. It is a misconception to think that data is big, and metadata is small. On the contrary! When metadata is tracking various aspects related to different versions of a large number of assets, that metadata repository itself will grow large. Thus, data catalogs need to be architected to be scalable and performant; they must be designed to handle large amounts of data.

Beyond these basic functionalities, other features support innovation and collaboration. The following list illustrates how a data catalog enables innovation around data in an enterprise:

Recommendation and guided navigation

Searching datasets is not as simple or straightforward as a text search. A data catalog can use various explicit and implicit quality signals when ranking datasets for recommendation. Like a Google search, a data catalog can guide users to the most trusted data that comes from a reliable source and is frequently used. Furthermore, a data catalog can recommend domain experts who are automatically identified based on actual data usage.

Intelligent extraction of implicit and missing information

Metadata, such as description and tags, is typically provided by data creators and stewards. Other metadata, such as schema and creation date, is provided by tools that manage data. Other metadata is implicit—it can be inferred by looking at the data itself and the context of its usage. The following sidebar describes an innovative use of implicit metadata.

Techniques like data profiling can infer valuable metadata about data quality. Moreover, behavioral information extracted from data usage provides social signals about data quality. Which datasets are the most popular? How are they used? Who uses them? Odds are, if a dataset is widely used, it’s safe to trust. Approaches based on machine learning and natural language processing (NLP) can also be utilized here (see the following sidebar).

Collaboration and crowdsourcing

A data catalog can capture metadata from users actively, by soliciting their feedback. Wiki-like articles around data assets play host to common knowledge within an enterprise—and serve as living documents, open for experts to update. Ratings and resources are also useful; a data catalog can, for instance, allow users to rate a dataset or link to related resources and help articles.

Furthermore, a data catalog can enable discussion or questions/answers about a dataset to take place within the catalog itself. This keeps all information in one place, fosters a sense of community among data users, and supports a self-service learning environment.

Managing sensitive data

Sensitive data such as personally identifiable information (PII) need to be managed carefully in order to comply with regulations like GDPR and CCPA. A data catalog needs to support discovering, classifying, and tagging sensitive data assets. It needs to go beyond identifying sensitive data and guide users on compliant data usage as well. At the minimum, a data catalog can surface compliance information to the user at the point of data consumption.

Interoperability and extensibility

A data catalog can provide further functionality (e.g., for visualization and lineage analysis) by integrating with other specialized tools. Furthermore, a data catalog can expose its internal services via open and expressive APIs to allow building custom functionality (see the following sidebar).

A Framework to Characterize Data Catalogs

The list of features and functionalities a data catalog provides can be daunting and hard to comprehend. This makes it difficult to understand the main traits of a given data catalog or to compare various data catalogs. In this section, we provide a framework to understand and judge a data catalog along three key aspects: broad connectivity, intelligence, and active governance.

Broad connectivity

Data catalogs with broad connectivity have flexible and extensible data models. They capture metadata and represent not only data assets in an enterprise, but related entities, such as metrics, charts, AI features, and users. Catalogs with broad connectivity are designed to easily integrate with other systems in an enterprise. They expose their internal services via open and expressive APIs to allow for further extensibility.

Intelligence

Intelligence allows catalogs to go beyond capturing only explicit metadata. Intelligence enables catalogs to incorporate human knowledge, both passively (by tracking human usage and popularity of assets) and actively (by crowdsourcing tribal knowledge and incorporating users’ feedback.) These catalogs employ advanced techniques, such as machine learning and NLP to enrich collected metadata, extract links and relationships, and infer implicit and missing information.

Active data governance

Active governance guides users as they find and use data. A data catalog with active governance will surface compliance information about sensitive data at point of use, so as to encourage users to use canonical and high-quality data assets; it will also provide a way to ask domain experts for help. They actively help users to ensure compliant usage of data with features such as masking, which anonymizes PII for given user personas who are restricted from viewing it per the GDPR.

Summary

We introduced enterprise data catalogs in this chapter. We described the role of a data catalog, its structure, and the functionality and features it provides. We provided a framework to understand the characteristics of a data catalog along three aspects: broad connectivity, intelligence, and active governance. In the next chapter, we discuss types of data catalogs.

Get Implementing a Modern Data Catalog to Power Data Intelligence now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.