Chapter 4. Enterprise Data Catalog Business Impact
In this chapter, we describe the business impact—both qualitative and quantitative—of data catalogs. We then provide a number of concrete use cases that a data catalog supports. In particular, we discuss self-service analytics, data governance and guided data usage, data operations, and cloud and multicloud migration.
Catalog Business Impact
As described before, the core value of an enterprise data catalog is that it is the central place to bring together all information about data in an enterprise. The emphasis on central and all information makes an enterprise data catalog more efficient than a combination of siloed tool-adjunct data catalogs in an enterprise. Furthermore, the enterprise data catalog integrates, interlinks, and enriches the various pieces of metadata. It is the place where the value of the whole becomes greater than the sum of its parts.
Not many resources quantitatively report on data catalogs’ impact, but one notable report, created by Forrester Consulting and commissioned by Alation, conducted a total economic impact (TEI) study to examine the potential return on investment (ROI) enterprises may realize by deploying Alation.
Forrester interviewed seven customers with experience using the Alation Data Catalog. Based on this, in October 2019, Forrester reported the following risk-adjusted present value (PV) quantified benefits of using enterprise data catalogs:
-
Analyst productivity improved due to shortened data discovery. Improvements amount to savings of $2.7 million.
-
Business user productivity improved from self-service. Improvements amount to savings of $584,182.
-
Data engineer productivity improved due to user self-service. Improvements amount to savings of $165,065.
-
Savings from faster onboarding of new analysts amount to $286,085.
A 2019 Gartner report predicted that by 2021, organizations that offer a curated catalog of internal and external data to diverse users will realize twice the business value from their data and analytics investments than those that do not. The report also stated that by 2022, over 60% of traditional IT-led data catalog projects that do not use ML to assist in finding and inventorying data distributed across a hybrid/multicloud ecosystem will fail to be delivered on time.
Data catalogs facilitate data discovery and usage, support data governance and collaboration around data, and help ensure compliance use of data. In the next section, we discuss how these functionalities can be utilized in a number of use cases to drive business value.
Catalog Use Cases
This section discusses the support data catalogs can provide in four use cases: self-service business intelligence, data governance, data operations, and cloud migration.
Self-Service Business Intelligence
In the era of digital disruptions, the businesses that win are more agile and can make decisions at the speed of market change and competition. This means that the pool of decision makers needs to be enlarged. The historical request model—which makes IT a bottleneck—no longer works.
Instead, business users, business analysts, and data scientists need the ability to self-discover trustworthy data. Business users need to then make decisions with timely, trustworthy data. This puts the data catalog front and center in self-service BI.
Self-service BI initiatives help organizations become more data-driven and democratize access to data. But data can’t be used if it can’t be found. Search and discovery of trustworthy data is a core value of enterprise data catalogs, and the value extends well beyond business users.
It is often said that data scientists and data analysts spend only 20% of their time doing data analysis work, with 80% consumed by data “issues.” The bulk of their time is spent finding, evaluating, understanding, and preparing data before analysis can begin. A data catalog inverts this principle by enabling data analysts and data scientists to spend 20% of their time looking for data and 80% performing analysis.
The value of this—to both the organization as a whole and to individual analysts—cannot be overstated. Not only does the organization benefit from improved efficiency, collaboration, and innovation at scale, analysts and others benefit tremendously from improved job satisfaction. Analysts do not enjoy hunting, gathering, and verifying the trustworthiness of data, and have ranked these tasks as the most unpleasant chores of the job.
Another important item highlighted in the Forrester report referenced previously is the role of a data catalog to speed up onboarding new analysts. In a project the authors worked on, catalog data was used to provide personalized recommendations of datasets that are potentially of interest to a new employee based on looking at access patterns of other team members and people with similar roles.
By facilitating self-service BI, data catalogs improve employees’ productivity, reduce time to insight, and positively impact employees’ satisfaction and, therefore, retention.
Data Governance and Guided Data Usage
Data users need to first know where to find relevant data, and data catalogs are essential tools to address this need. However, after finding the data, users need to understand the data, as well as know how (and whether) to use it. There are a number of ways a data catalog can guide the proper usage of data.
Dataset and expert recommendation
In a self-service environment with multiple publishers, it’s impossible to completely avoid data redundancy and overlapping. Multiple data assets with similar content, but possibly with varying quality, will exist. A data catalog can guide users to trusted data that comes from a reliable source and is frequently used. A data catalog can also use various explicit and implicit quality signals when ranking datasets for recommendation. Some of those signals are discussed next. Furthermore, a data catalog can recommend domain experts who are automatically identified based on actual data usage.
Certified datasets
Subject matter experts can provide endorsement to high-quality datasets that can be trusted. This can be in the form of a star ranking or certification flag. Similarly, deprecated or unmaintained data can be flagged or given a low star ranking, not unlike restaurant reviews on Yelp or product reviews on Amazon. These certifications are automatically utilized by the catalog at the point of data use. Potential data consumers can quickly identify trustworthy data and save time. Data certification complements the recommendations by offering a way to promote data through curation.
Data quality
In addition to explicit dataset certifications, a data catalog can integrate data quality signals from dedicated external systems or perform data profiling to surface quality characteristics of various data sources. These quality signals are accessible to the users intending to use a dataset and can also be used when recommending a dataset or ranking it in a search result.
Data Operations
Data flows within an enterprise are becoming increasingly complex, due to the increased use of SaaS tools to manage and process data as well as the growing number of initiatives for self-service ETL. Consequently, managing these flows can be expensive. Without the support of tools, manual management of these data flows requires a potentially large dedicated team of engineers. Furthermore, it becomes very challenging to avoid the risks associated with using low-quality data to support decisions and noncompliant use of sensitive data.
Here are a number of related use cases that a data catalog can support.
Maintaining data delivery SLAs
It is common for data teams to provide datasets bound to defined SLAs in terms of frequency of updates and freshness. The freshness of a given dataset is a function of the freshness of all its upstream data sources. Accurate management of data SLAs is essential to obtaining user confidence in data and to support informed decision making. A data catalog’s support of automatic querying and visualization of the data flow helps assess the feasibility of SLAs and impact of upstream changes and delay.
Handling data quality issues and incident response
When a particular dataset has a quality issue, the data operations team needs to understand the impact of the issue on all downstream data products. Once the issue is fixed, downstream data products need to be backfilled, usually in a specific order that respects their interdependency. Lineage data within a data catalog can be used to support, or possibly automate, such tasks.
Data deprecation
Similar to data quality issues, deprecating a dataset or a field of a dataset requires understanding the impact on downstream data products. This requires a data operations team to understand not only what is affected but who is affected and needs to be contacted.
In summary, a data catalog is an essential tool for data operations teams. A data catalog supports team efficiency and improves the quality of the data provided as a foundation for decision making.
Cloud and Multicloud Migration
Cloud computing services provide enterprises many promising advantages including cost reductions, instant scalability, and new innovations available only in the cloud. Organizations are increasingly migrating from on-premise resources to the cloud or adopting a hybrid model with only part of the infrastructure managed and owned by the organization itself.
However, migrating to the cloud is a challenging task. The seamless scalability provided by the cloud and its pay-per-usage cost model can complicate migration initiatives and often incur unanticipated costs. During the migration, organizations find themselves in a state of flux as data is spread across on-premises and cloud environments. Data catalogs can help organizations during the planning and the execution of data migration to the cloud.
An attempt to do a complete “lift-and-shift” strategy will bring all the organization data to the cloud. Data that is never used will still incur an unnecessary additional cost. During planning data migration, data catalogs support identifying commonly used, relevant data assets that need to be moved to the cloud, reducing costs and optimizing the cloud data environment. Additionally, the data catalog helps IT prioritize and manage the migration by providing information about data dependency.
During data migration, analysts can no longer be sure where to go for the best, most appropriate data. Data catalogs signal to users when and where data is being migrated from the old to new system. Therefore, data consumers can continue to find the data they need, regardless of where it resides during the migration journey.
In summary, data catalogs support and accelerate business goals when migrating to the cloud. Data catalogs help reduce the cost of migrating to the cloud, increase productivity by providing transparency throughout the migration journey, and accelerate adoption, increasing the value of migrated data to the business.
Summary
This chapter described the potential business impact of data catalogs as well as discussed a number of use cases, which illustrate the potential of data catalogs as a business-critical investment. It is worth keeping in mind two things:
-
Remember the previous chapter’s recommended practice of having a clear use case to start with when implementing a data catalog and to continuously measure the achieved impact.
-
Be aware of the common pitfall of using a specialized catalog for a chosen use case without considering the enterprise’s future needs.
Get Implementing a Modern Data Catalog to Power Data Intelligence now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.