Chapter 1. How to Think About Data Governance

Data governance involves establishing robust data and process controls, implementing data standards, and employing effective data-handling practices that optimize data utilization to improve business outcomes while minimizing risk. This fosters trust and enables informed decision making across the organization. But knowing how to implement data governance is even more important than knowing what data governance is.

Promoters of data governance often justify the program by espousing the value of data governance, focusing on areas such as data quality and consistency, data integration and interoperability, and data access and security. This approach is misguided. Instead, it is better to work backwards from important (that is, funded) business initiatives.

It’s vital to think about the true purpose of data governance. The purpose of data governance is to ensure that data supports business initiatives. It’s that simple. But it’s also powerful—and often missed. Every successful data governance program starts by attaching itself to one or more funded business initiatives and delivering the required governance for targeted initiatives—not simply by chasing the value of data governance in and of itself. If you stick to this principle, your data governance program will support the most important strategic goals of the company (via funded business initiatives) while also building coherent, organized, and trustworthy data resources in the process.

Let Business Initiatives Drive Your Data Governance Program

Smart companies understand that data governance initiatives should support—not compete with—their business initiatives.

To illustrate the importance of this seemingly subtle difference, let’s consider an example. We once worked with an agricultural firm where an internal audit revealed some crucial data governance shortcomings, such as ineffective data quality management, lack of identified data owners and stewards, limited use of a data catalog, and other issues.

While all of these observations were accurate and important, the problems occurred when the company began attempting to close the gaps that were uncovered in the audit. Instead of aligning data governance capabilities to specific business initiatives, the company addressed the gaps directly, planning the implementation of data quality capability, a data catalog, role assignments, and so on. The company had even clearly articulated a business value proposition and respectable return on investment estimates based on closing these gaps.

Although this approach seemed appropriate, the progress of the data governance program was slow, with little sense of urgency and waning executive support. The results were particularly frustrating because data leaders in the organization were in fact following popular advice they had acquired from a variety of experts.

We advised a simple but profound shift. Instead of proposing the value of data governance directly, we advised the company to identify a funded business initiative and align data governance to support it. It turned out that there was an ongoing initiative to transform the company to precision farming.

The data governance program approached the leadership of the precision farming initiative and offered to prioritize its data governance work to support the precision farming business initiative.

The team shifted from focusing on the goal of “good data” or a “more mature data governance capability” to supporting the business initiative of precision farming—with dramatic results.

By aligning systematically, there was:

A new sense of urgency
Rededicated focus
Contagious momentum
Clear understanding of how data governance can support precision farming

These shifts occurred because the data governance program was now considered vital to the success of the company transformation already under way. Each use case delivered for precision farming relied on the data governance program. For example, the program enabled the integrity and quality of key metrics, such as data about yield, soil, weather, and crops obtained through sensors, drones, satellites, and a variety of internal and external sources. The program also identified new data handling policies to ensure that personally identifiable information (PII) about the farms was not inadvertently used at the risk of violating U.S. Department of Agriculture policies. The initiative was a success.

With the data and associated data management capabilities in place, the company was then in a position to support other business initiatives by reusing and extending data resources.

At Amazon, we call this philosophy “working backwards.” The idea is to start with a vision of the final business outcome. In the case of the agricultural firm, the final outcome was associated with the precision farming business initiative, not data governance. After the data governance program was positioned as a supporting character in the larger play rather than the star of the show, progress was much easier, more focused, and dramatically more valuable.

What Are the Key Challenges with Data Governance?

Gartner emphasizes that successful digital businesses need solid data governance, a framework that Gartner defines as the specification of decision rights and an accountability that ensures the appropriate behavior in the valuation, creation, consumption, and control of data and analytics.¹ Gartner also names artificial intelligence (AI) risk, trust, and security management as the top strategic technology trend for 2024. In addition, continuous threat exposure and democratizing AI are in the top 10. Each of these are associated with critical aspects of data governance.

Yet a recent Gartner study also predicts that through 2025, 80% of organizations seeking to scale digital business will fail because they do not take a modern approach to data and analytics governance.²

With data governance identified as such a critical priority, why does it so often go so wrong? Even when appropriately aligning data governance to support business initiatives, organizations find implementing data governance challenging for several reasons:

Data grows exponentially in both size and variability, requiring systems to constantly monitor for data quality or model bias changes in real time.
Data spreads across multiple purpose-built data stores, and getting access to analyze an organization’s data is slow and typically not understood by all team members.
As data is used by more users for more use cases, it becomes challenging to tie a data or model inference to an acceptable use policy.
Machine learning (ML) models, model features, and data transformations are not always transparent.
Data workers are not sure they can trust that generative AI models are returning results that align with acceptable use policies.
Skill set gaps and staff turnover means fewer employees have historical knowledge of proprietary data or acceptable use policies.
Ethics policies in data governance become more important as ML and AI are more widely adopted.
Proving compliance to a regulation includes collecting the appropriate level of information that ties to that regulation, which is time-consuming.

Let’s take a look at how these problems manifest themselves in the real world.

Consider Big Finance, a financial services firm in the United States. Big Finance recognized the importance of data governance and therefore has implemented strict access controls to protect client data. However, its approach was to limit data access even in cases where clients expected their financial advisor to have access to the information. In addition, data resources were disconnected across lines of business (banking, mortgage, credit card, etc.), making a holistic view of customer data difficult. As a result, the relationship with the customer was hindered in several ways:

Limited data access: With restricted access to client data, financial advisors had difficulty gaining access to data associated with the client’s transactions across lines of business, even when explicitly authorized by the customer. The client was then required to personally convey detailed information to the advisor or wait for access to be granted to the advisor, wasting valuable time. Customers became frustrated, creating a higher risk of churn.
Inefficient issue resolution and service experience: Even when data access was granted, financial services advisors had to navigate through multiple systems and rely on manual processes to gather the necessary information to better understand the customer’s history, profile, preferences, and so on. This not only hindered the advisor’s ability to counsel the client, but it also made it difficult for the advisor to help resolve issues and act on the client’s behalf in other lines of business, such as to negotiate better credit card rates in consideration of the total business with the client. Despite their earnest efforts, clients still experienced disparities and had to repeat information when contacting representatives from other lines of business, leading to even more customer frustration.
Missed cross-sell opportunities: Lacking a holistic view of product data across lines of business, financial advisors had inadequate information to identify potential cross-selling or upselling opportunities when relevant, such as umbrella insurance policies or mortgage refinances, resulting in lost revenue and suboptimal customer experience.
Inaccurate recommendations: Robo advisors via ML models that had access to inconsistent and siloed data made inaccurate recommendations to the customer. These suboptimal recommendations created a loss of trust for the customer, who felt that the firm did not understand their financial positions or goals, leading to risk in bad financial advice.

Although Big Finance focused on data governance and protecting customer data, the ineffective application of policy and lack of data integrity derailed the ability for financial advisors to adequately advise the client, leaving the company vulnerable to competitors who could.

The Three Pillars of Good Data Governance

To ensure the data is ready for targeted business initiatives, you need to think holistically about what it means to manage the data effectively and what it means to make sure the data is in the right condition to support the business initiatives.

With the data governance program aligned to support business initiatives, we advocate adopting three fundamental pillars of effective data governance:

Curating your data: Identify and manage your most valuable data sources so you can limit the proliferation and transformation of critical data assets. Also, ensure that the right data is accurate, is fresh, and has sensitive information identified.
Understanding your data: Enhance your ability to capture and share the context and meaning of your data and ML models through data profiling, data lineage, automated business summaries, data cataloging, and model governance. This context is managed through metadata.
Protecting your data: Strike the right balance between data privacy, security, and access. Ensure data is protected through data security, data classification, and data lifecycle management.

A modern data governance framework not only facilitates data accessibility and control but also exhibits the following essential characteristics:

Eliminates friction between using data for decisions and avoiding business risk: Providing your business with automated approaches to discover, understand, request access, and start using data reduces time to value while also adhering to corporate policies.
Improves decision making and reduces time to value: Providing your business with high-quality data reduces the risk of making incorrect decisions and reduces time to value. This is accomplished by implementing data quality frameworks, utilizing data validation techniques, and establishing data stewardship roles to help provide more accurate, reliable, and consistent data for applications and decision makers.
Increases trust in your data: Providing your business with a clear understanding of the quality of the data will help them feel confident in making decisions when using that data. This confidence is accomplished through data quality reporting, data definitions, lineage information, and observability into data usage through data transparency. Here again, you manage this information in your metadata.
Reduces data management costs: Automating data management such as data cataloging and data profiling reduces data duplication, optimizes data storage and infrastructure utilization, improves query performance, and saves valuable time for data engineers. This results in cost savings by reducing fobtaining conse redundant and overlapping data pipelines, minimizing data management activities, and optimizing data storage and processing costs.
Complies with data privacy and data residency regulations: Obtaining consent for data usage, implementing data anonymization techniques, deleting data, and adhering to data localization requirements supports compliance with legal and regulatory frameworks, such as the Global Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA).

Overall, good data governance combines the right mix of people, processes, and technology to ensure that data is ready to meet the needs of targeted business initiatives.

Data Governance Considerations for Popular Architectures

Let’s take a look at how governing data applies to common elements of enterprise data architectures:

Data lakes

A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. Historically, data lakes bridged the business intelligence capabilities for data warehouses with storage capabilities more suited for ML. It also brought new data governance challenges with schema-on-read, evolving schemas, semistructured datasets, and unstructured datasets. The emergence of data lakes drove the wide adoption of cross-storage technical catalogs and business catalogs to bring together a better understanding of the data that needs to be governed with permissions.

Transactional data lakes: A transactional data lake is a type of data lake that not only stores data at scale but also supports transactional operations, ensures that data is accurate and consistent, and allows you to track how data and data structure change over time. Transactional data lakes are supported by open-table formats such as Apache Iceberg, Linux Foundation Delta Lake, and Apache Hudi. With this additional support of the transactional capabilities of traditional databases in a data lake, customers can more easily meet regulatory needs such as the right to be forgotten when they need to delete data.

Data warehouses

A data warehouse is a central repository of information that can be analyzed to make more informed decisions. Data flows into a data warehouse from transactional systems, data lakes, and other sources, typically on a regular cadence. Business analysts, data engineers, data scientists, and decision makers access the data through business intelligence (BI) tools, SQL clients, and other analytics applications. Data warehouses enable organizations to maintain a unified and consistent view of highly structured and refined data. By defining data ownership, access rights, and quality standards, data governance ensures that data in warehouses is well managed, making it easier for users to trust and utilize data for applications and business decision making.

Feature stores

Feature stores are purpose-built repositories to store, share, and manage features for ML models. Features are inputs to ML models used during training and inference. For example, in an application that recommends a music playlist, features could include song ratings, listening duration, and listener demographics. Features are used repeatedly by multiple teams, and feature quality is critical to ensure a highly accurate model.

Databases

Databases are used to store data of all kinds, using purpose-built and specialized databases, such as time-series, graph, relational, and vector databases. By defining data policies and auditing procedures and adhering to data quality rules, data governance facilitates data sharing among different applications while ensuring compliance with regulations and organizational standards.

Files

Files store data in widely accessible forms or for special purpose applications such as Microsoft Excel, PDFs, and images. Data governance policies set guidelines on data usage, version control, and data sharing to maintain data accuracy and security across these files.

Data mesh

A data mesh is an architectural framework that enables distributed, decentralized ownership. Organizations have multiple data sources from different lines of business that must be integrated for analytics. A data mesh architecture effectively unites the disparate data sources and links them together through centrally managed data sharing and governance guidelines. Business functions can maintain control over how shared data is accessed, who accesses it, and in what formats it’s accessed. A data mesh adds complexities to architecture but also brings efficiency by improving data access, security, and scalability.

Let’s dive deeper into the three pillars of data governance, starting with the first one: curating your data.

¹ “Data Governance”, Gartner Glossary, accessed March 21, 2024.

² Laurence Goasduff, “Choose Adaptive Data Governance over One-Size-Fits-All for Greater Flexibility”, Gartner, April 11, 2022.

Get Data Governance with AWS now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Data Governance with AWS by Kevin Lewis, Jason Berkowitz, Ina Felsheim, Joseph D. Stec

Chapter 1. How to Think About Data Governance

Let Business Initiatives Drive Your Data Governance Program

What Are the Key Challenges with Data Governance?

The Three Pillars of Good Data Governance

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly