Chapter 1. Why the Industry Now Needs Data Contracts

We believe that data contracts, an agreement between data producers and consumers that is established, updated, and enforced via an API, is necessary for scaling and maintaining data quality within an organization. Unfortunately, data quality and its foundations, such as data modeling, have been severely deprioritized with the rise of big data, cloud computing, and the Modern Data Stack. Though these advancements enabled the prolific use of data within organizations and codified professions such as data science and data engineering, its ease of use also came with a lack of constraints– leading many organizations to take on substantial data debt. With pressure for data teams to move from R&D to actually driving revenue, as well as the shift from model-centric to data-centric AI, organizations are once again accepting the merits of data quality being a must-have instead of a nice-to-have. Before going in depth about what data contracts are and their implementation, this chapter highlights why our industry forgone data quality best practices, why we’re prioritizing data quality again, and the unique conditions of the data industry post-2020 that warrant the need of data contracts to drive data quality.

Garbage-In Garbage-Out Cycle

Talk to any data professional and they will fervently state the mantra of “garbage-in garbage-out” as the root cause of most data mishaps and or limitations. Despite us all agreeing on what the problem is within the data lifecycle, we still struggle to produce and utilize quality data.

Modern Data Management

From the outside looking in, data management seems relatively simple—data is collected from a source system, moved through a series of storage technologies to form what we call a pipeline, and ultimately ends up in a dashboard that is used to make a business decision. This impression could be easily forgiven. The consumers of data such as analysts, product managers, and business executives rarely see the mass of infrastructure responsible for transporting data, cleaning and validating it, discovering the data, transforming it, and creating data models. Like an indescribably large logistics operation, the cost and scale of the infrastructure required to control the flow of data to the right parts of the organization is virtually invisible, working silently in the background out of sight.

At least, that should be the case. With the rise of machine learning and artificial intelligence, data is increasingly taking the spotlight– yet organizations still struggle to extract value from data. The pipelines used to manage data flow are breaking down, the data scientists hired to build ML models and deploy them can’t move forward until data quality issues are resolved, executives make million dollar “data-driven” decisions that turn out to be wrong. As the world continues its transition to the cloud, our silent data infrastructure is not so silent anymore. Instead, it’s groaning under the weight of scale, both in terms of volume and organizational complexity. Exactly at the point in time when data is poised to become the most operationally valuable it’s ever been, our infrastructure is in the worst position to deliver on that goal.

The data team is in disarray. Data engineering organizations are flooded with tickets as pipelines actively fail across the company. Even worse, silent failures result in data changing in backwards incompatible ways with no one noticing resulting in multi-million dollar outages and worse, a loss of trust in the data from employees and customers alike. Data engineers are often caught in the crossfire of downstream teams who don’t understand why their important data suddenly looks different today than it did yesterday, and data producers who ultimately are responsible for these changes have no insight into who is leveraging their data and for what reason. The business is often confused about the role data engineers are meant to play. Are they responsible for fixing any data quality issue, even those they didn’t cause? Are they accountable for rapidly rising cloud compute spend in an analytical database? If not, who is?

This state of the world is a result of data debt. Data debt refers to the outcome of incremental technology choices made to expedite the delivery of a data asset like a pipeline, dashboard, or training set for machine learning models. Data debt is the primary villain of this book. It inhibits our ability to deploy data assets when we need them, destroys trust in the data, and makes iterative data governance almost impossible. The pain of data engineering teams is caused by data debt - either managing it directly, or its secondary impacts on other teams. Over the subsequent chapters you will learn what causes data debt, why it is more difficult to handle than software debt, and how it can ultimately cripple the data functions of an organization.

What is data debt?

If you have worked in any form of engineering organization at scale you have likely seen the words ‘tech debt’ repeated dozens of times from concerned engineers who wipe sweat from their brow discussing a future with 100x the request volume to their service.

Simply put, tech debt is a result of short term decisions made to deploy code faster at the expense of long term stability. Imagine a software development team working on a web application for an e-commerce company. They have a tight deadline to release a new feature, so they decide to take a shortcut and implement the feature quickly without refactoring some existing code. The quick implementation works, and they meet their deadline, but it’s not a very efficient or maintainable solution.

Over time, the team starts encountering issues with the implementation. The new feature’s code is tightly coupled with the existing codebase which makes it challenging to add or modify other features without causing unintended side effects. Bugs related to the new feature keep popping up, and every time they try to make changes, it takes longer than expected due to the lack of proper documentation. The cost incurred to fix the initial implementation with something more scalable is debt. At some point, this debt has to be paid or the engineering team will suffer slowing deployment velocity to a crawl.

Like tech debt, data debt is a result of short term decisions made for the benefit of speed. However, data debt is much worse than software oriented tech debt for a few reasons. First, in software, the typical tradeoff which results in tech debt is speed in favor of maintainability and scale: Meaning how easy is it for engineers to work within this codebase, and how many customers/requests can we service? The operational function of the application is still being delivered which is intended to solve a core customer problem. In data however, the primary value proposition is trustworthiness. If the data that appears in our dashboards, machine learning models, and embedded in customer facing applications can’t be trusted then it is worthless. The less trust we have in our data, the less valuable it will be. Data debt directly affects trustworthiness. By building data pipelines quickly without the core components of a high quality infrastructure such as data modeling, documentation, and semantic validity, we are directly impacting the core value of the data itself. As data debt piles up, the data becomes more untrustworthy over time.

Going back to our e-commerce example, imagine the shortcut implementation didn’t just make the code difficult to maintain, but also every additional feature layered on top actually made the product increasingly difficult to use until there were no customers left. That would be the equivalent of data debt. Second, data debt is far harder to unwind than technical debt. In most modern software engineering organizations teams have moved or are currently moving to microservices. Microservices are an architectural pattern that changes how applications are structured by decomposing them into a collection of loosely connected services that communicate through lightweight protocols. A key objective of this approach is to enable the development and deployment of services by individual teams, free from dependencies on others.

By minimizing interdependencies within the code base, developers can evolve their services with minimal constraints. As a result, organizations scale easily, integrate with off-the-shelf tooling as and when it’s needed, and organize their engineering teams around service ownership.The structure of microservices allow for tech debt to be self-contained and locally addressed. Tech debt that affects one service does not necessarily affect other services to the same degree, and this allows each team to manage their own backlog without having to consider scaling challenges for the entire monolith.

The data ecosystem is based on a set of entities which represent common business objects. These business objects correspond to important real world or conceptual domains within a business. As an example, a freight technology company might leverage entities such as shipments, shippers, carriers, trucks, customers, invoices, contracts, accidents, and facilities. Entities are nouns - they are the building blocks of questions which can ultimately be answered by queries.

However, most useful data in an organization goes through a set of transformations built by data engineers, analysts, or analytics engineers. Transformations combine real world domain level data into logical aggregates called facts, which are leveraged in metrics used to judge the health of a business. A “customer churn” metric for example combined data from the customer entity, and the payment entity. “Completed shipments per facility” would combine data from the shipment entity and the facility entity. Because constructing metrics can be time consuming, most queries written in a company depend on both core business objects and aggregations. Meaning, data teams are tightly coupled to each other and the artifacts they produce - a distinct difference from microservices.

This tight coupling means that data debt which builds up in a data environment can’t be easily changed in isolation. Even a small adjustment to a single query can have huge downstream implications, radically altering reports and even customer facing data initiatives. It’s almost impossible to know where data is being used, how it’s being used, and the level of importance the data asset in question is to the business. The more data debt piles up between producers and consumers, the more challenging it is to untangle the web of queries, filters, and poorly constructed data models which limits visibility downstream.

To summarize: Data debt is a vicious cycle. It depreciates trust in the data, which attacks the core value of what data is meant to provide. Because data teams are tightly coupled to each other, data debt cannot be easily fixed without causing ripple effects through the entire organization. As data debt grows, the lack of trustworthiness compounds exponentially, eventually infecting nearly every data domain and resulting in organizational chaos. The spiral of data debt is the biggest problem to solve in data, and it’s not even close.

Garbage In / Garbage Out

Data debt is prominent across virtually all industry verticals. At first glance, it appears as though managing debt is simply the default state of data teams: a fate to which every data organization is doomed to follow even when data is taken seriously at a company. However, there is one company category that rarely experiences data debt for a reason you might not expect: startups.

When we say startup, we are referring to an early stage company in the truest sense of the word. Around 20 software engineers or less, a lean but functioning data team (though it may be only one or two data engineers and a few analysts) and a product that has either found product market fit or is well on its way. We have spoken to dozens of companies that fit this profile, and nearly 100% of them report not only having minimal data debt, but of having virtually no data quality issues at all. The reason this occurs is simple: The smaller the engineering organization, the easier it is to communicate when things change.

Most large companies have complex management hierarchies with many engineers and data teams rarely interacting with each other. For example, Convoy’s engineering teams were split into ‘pods,’ a term taken from Spotify’s product organizational model. Pods are small teams built around core customer problems or business domains that maximize for agility, independent decision making, and flexibility. One pod focused on supporting our operations team by building machine learning models to prioritize the most important issues filed by customers. Another worked on Convoy’s core pricing model, while a third might focus on supplying real-time analytics on shipment ETA to our largest partners.

While each team rolled up to a larger organization, the roadmaps were primarily driven by product managers in individual contributor roles. The product managers rarely spoke to other pods unless they needed something from them directly. This resulted in some significant problems arising for data teams when new features were ultimately shipped. A software engineer managing a database may decide to change a column name, remove a column name, change the business logic, stop emitting an event, or any other number of issues that are problematic for downstream consumers. Data consumers would often be the first to notice the change because something looked off in their dashboard, or the machine learning began to produce incorrect predictions.

At smaller startups, data engineers and other data developers have not yet split into multiple siloed teams with differing strategies. Everyone is part of the same team, with the same strategy. Data developers are aware of virtually every change that is deployed, and can easily raise their hand in a meeting or pull the lead engineer aside to explain the problem. In a fast moving organization with dozens, to hundreds, or thousands of engineers this is no longer possible to accomplish. This breakdown in communication results in the most often referenced phenomena in data quality: Garbage In, Garbage Out (GIGO).

GIGO occurs when data that does not meet a stakeholder’s expectations enters a data pipeline. GIGO is problematic because it can only be dealt with retrospectively, meaning there will always be some cost to resolve the problem. In some cases, that cost could be severe, such as lost revenue from an executive making a poor decision off a low quality dashboard whose results changed meaningfully overnight or a machine learning model making incorrect predictions about a customer’s buying behavior. In other cases, the cost could be less severe - a dashboard shows wrong numbers which can be easily fixed before the next presentation with a simple CASE statement. However, even straight forward hotfixes underlie a more serious issue brewing beneath the surface: the growth of data debt.

As the amount of retroactive fixes grows over time, institutional knowledge hotspots begin to build up in critical areas within the data ecosystem. SQL files 1000 lines long are completely indecipherable to everyone besides the first data engineer in the company. It is not clear what the data means, who owns it, where it comes from, or why an active_customers table seems to be transforming NULLs into a value called ‘returned’ without any explanation in the documentation.

Over time, data debt caused by GIGO starts to increase exponentially as the ratio of software to data developers grows larger. The number of deployments increase from a few times per week, to hundreds or thousands of times per day. Breaking changes become a common occurrence, while many data quality issues impacting the contents of the data itself (business logic) can go unnoticed for days, weeks, or even months.When the problem has grown enough that it is noticeably slowing down analysts and data scientists from doing their work, you have already reached a tipping point: Without drastic action, there is essentially no way out. The data debt will continue to mount and create an increasingly more broken user experience. Data engineers and data scientists will quit the business for a less painful working environment, and the business value of a company’s most meaningful data asset will degrade.

While GIGO is the most prominent cause of data debt, challenges around data are also rooted in the common architectures we adopt.

The Death of Data Warehouses

Beginning in the late 1980s and extending through today, the Data Warehouse has remained a core component of nearly every data ecosystem, and the foundation of virtually all analytical environments.

The Data Warehouse is one of the most cited concepts in all of data, and is an essential concept to understand at a root level as we dive into the causes of the explosion of data debt. Bill Inmon is known as the “Father of the Data warehouse.” And for good reason: he created the concept. In Inmon’s own words:

A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management’s decision-making process.

According to Bill, the data warehouse is more than just a repository; it’s a subject-oriented structure that aligns with the way an organization thinks about its data from a semantic perspective. This structure provides a holistic view of the business, allowing decision-makers to gain a deep understanding of trends, patterns, ultimately leveraging data for visualizations, machine learning, and operational use cases.

In order for a data structure to be a warehouse it must fulfill three core capabilities.

  • First, the data warehouse is designed around the key subjects of an organization, such as customers, products, sales, and other domain-specific aspects.

  • Second, data in a warehouse is sourced from a variety of upstream systems across the organization. The data is unified into a single common format, resolving inconsistencies and redundancies.This integration is what creates a single source of truth and allows data consumers to take reliable upstream dependencies without worrying about replication.

  • Third, data in a warehouse is collected and stored over time, enabling historical analysis. This is essential for time bounded analytics, such as understanding how many customers purchased a product over a 30 day window, or observing trends in the data that can be leveraged in machine learning or other forms of predictive analytics. Unlike operational databases that constantly change as new transactions occur, a data warehouse is non-volatile. Once data is loaded into the warehouse, it remains unchanged, providing a stable environment for analysis.

The creation of a Data Warehouse usually begins with an Entity Relationship Diagram (ERD), as illustrated in Figure 1-1. ERD’s represent the logical and semantic structure of a business’s core operations and are meant to provide a map that can be used to guide the development of the Warehouse. An entity is a business subject that can be expressed in a tabular format, with each row corresponding to a unique subject unit. Each entity is paired with a set of dimensions that contain specific details about the entity in the form of columns. For example, a customer entity might contain dimensions such:

Customer_id

Which identifies a unique string for each new customer registered to the website

Birthday

A datetime which a customer fills out during the registration process

FirstName

The first name of the customer

LastName

The last name of the customer

Figure 1-1. Example of an entity relationship diagram [This needs to be recreated]

An important dimension in ERD design are foreign keys. Foreign keys are unique identifiers which allow analysts to combine data across multiple entities in a single query. As an example, the customers_table might contain the following relevant foreign keys:

Address_id

A unique address field which maps to the address_table and contains city, county, and zip code.

Account_id

A unique account identifier which contains details on the customers account data, such as their total rewards points, account registration date, and login details.

By leveraging foreign keys, it is possible for a data scientist to easily derive the number of logins per user, or count the number of website registrations by city or state.

The relationship that any entity has to another is called its cardinality. Cardinality is what allows analysts to understand the nature of the relationship between entities, which is essential to performing trustworthy analytics at scale. For instance, if the customer entity has a 1-to-1 relationship with the accounts entity, then we would never expect to see more than one account tied to a user or email address.

These mappings can’t be done through intuition alone. The process of determining the ideal set of entities, their dimensions, foreign keys, and cardinality is called a conceptual data model, while the outcome of converting this semantic map into a tables, columns, and indices which can be queried through analytical languages like SQL is the physical data model. Both practices taken together represent the process of data modeling. It is only through rigorous data modeling that a Data Warehouse can be created.

The original meaning of Data Warehousing and Data Modeling are both essential components to understand in order to grasp why the data ecosystem today is in such disrepair. But before we get to the modern era, lets’ understand a bit more about how Warehouses were used in their heyday.

The Pre-Modern Era

Data Warehouses predate the popularity of the cloud, the rise of software, and even the proliferation of the Internet. During the pre-internet era, setting up a data warehouse involved a specialized implementation within an organization’s internal network. Companies needed dedicated hardware and storage systems due to the substantial volumes of data that data warehouses were designed to handle. High-performance servers were deployed to host the data warehouse environment, wherein the data itself was managed using Relational Database Management Systems (RDBMS) like Oracle, IBM DB2, or Microsoft SQL Server.

The use cases that drove the need for Data Warehouses were typically operational in nature. Retailers could examine customer buying behavior, seasonal foot traffic patterns, and buying preferences between franchises in order to create more robust inventory management. Manufacturers could identify bottlenecks in their supply chain and track production schedules. Ford famously saved over $1 billion by leveraging a Data Warehouse along with Kaizen-based process improvements to streamline operations with better data. Airlines leveraged Data Warehouses to plan their optimal flight routes, departure times, and crew staffing sizes.

However, creating Data Warehouses was not a cheap process. Specialists needed to be hired to design, construct, and manage the implementations of ERDs, data models, and ultimately the Warehouse itself. Software engineering teams needed to work in tight coordination with data leadership in order to coordinate between OLTP and OLAP systems. Expensive ETL tools like Informatica, Microsoft SQL Server Integration Services (SSIS), Talend and more required experts to implement and operate. All in all, the transition to a functioning Warehouse could take several years, millions of dollars in technology spends, and dozens of specialized headcount.

The supervisors of this process were called Data Architects. Data Architects were multi disciplinary engineers with computer science backgrounds with a specialty in data management. The architect would design the initial data model, build the implementation roadmap, buy and onboard the tooling, communicate the roadmap to stakeholders, and manage the governance and continuous improvement of the Warehouse over time. They served as bottlenecks to the data, ensuring their stakeholders and business users were always receiving timely, reliable data that was mapped to a clear business need. This was a best-in-class model for a while, but then things started to change…

Software Eats the World

In 2011, Venture Capitalist and founder of legendary VC firm Adreesen Horowitz, Marc Andreessen wrote an essay titled “Why Software Is Eating the World,” published in The Wall Street Journal. In the essay, Andreessen explained the rapid and transformative impact that software and technology were having across various industries best exemplified by the following quote:

Software programming tools and Internet-based services make it easy to launch new global software-powered start-ups in many industries — without the need to invest in new infrastructure and train new employees. In 2000, when my partner Ben Horowitz was CEO of the first cloud computing company, Loudcloud, the cost of a customer running a basic Internet application was approximately $150,000 a month. Running that same application today in Amazon’s cloud costs about $1,500 a month

M. Andreesen

The 2000s marked a period of incredible change in the business sector.. After the dotcom bubble of the late 90s, new global superpowers had emerged in the form of high margin, low cost internet startups with a mind boggling pace of technology innovation and growth. Sergey Brin and Larry Page had grown Google from a search engine operating out of a Stanford dorm room in 1998, to a global advertising behemoth with a market capitalization of over $23 billion by the late 2000s. Amazon had all but replaced Walmart as the dominant force in commerce, Netflix had killed Blockbuster, and Facebook had grown from a college people-search app to a $153 million a year in revenue in only 3 years.

One of the most important internal changes caused by the rise of software companies was the propagation of AGILE. AGILE was a software development methodology popularized by the consultancy Thoughtworks. Up until this point, software releases were managed sequentially and typically required teams to design, build, test, and deploy entire products end-to-end. The waterfall model was similar to movie releases, where the customer gets a complete product that has been thoroughly validated and gone through rigorous QA. However, AGILE was different. Instead of waiting for an entire product to be ready to ship, releases were managed far more iteratively with a heavy focus on rapid change management and short customer feedback loops.

Companies like Google, Facebook, and Amazon were all early adopters of the AGILE. Mark Zuckerberg once famously said that Facebook’s development philosophy was to ‘move fast and break things.’ This speed of deployment allowed AGILE companies to rapidly pivot their companies closer and closer to what customers were most likely to pay for. The alignment of customer needs and product features achieved a sort of nirvana referred to by venture capitalists as ‘product market fit.’ What took traditional businesses decades to achieve, internet companies could achieve in only a few years.

With AGILE as the focal point of software oriented businesses, the common organizational structure began to evolve. The software engineer became the focus of the R&D department, which was renamed to Product for the sake of brevity and accuracy. New roles began to emerge which played support to the software engineer: UX designers that specialized in web and application design. Product Managers, that combined project management with product strategy and business positioning to help create the roadmap. Analysts and Data Scientists that collected logs emitted by applications to determine core business metrics like sign-up rates, churn, and product usage. Teams became smaller and more specialized, which allowed engineers to ship code even faster than before.

With the technical divide growing, traditional offline businesses were beginning to feel pressure from the market and investors to make the transition into becoming AGILE tech companies. The speed that software-based businesses were growing was alarming, and there was a deep seated concern that if competitors adopted this mode of company building could emerge out the other end as a competitor with too much of an advantage to catch. In around 2015s, the term ‘digital transformation’ began to explode in popularity as top management consultant firms such as McKinsey, Delloitte, and Accenture pushed offerings to modernize the technical infrastructure of many traditionally offline companies by building apps, websites, and most importantly for these authors - driving a move from on-premise databases to the cloud.

The biggest promise of the cloud was one of cost savings and speed. Companies like Amazon (AWS), Microsoft (Azure), and Google (GCP) removed the need to maintain physical servers which eliminated millions of dollars in infrastructure and human capital overhead. This fit within the AGILE paradigm, which allowed companies to move far faster by offloading complexity to a service provider. Companies like McDonalds, Coca-Cola, Unilever, General Electric, Capital One, Disney, and Delta Airlines are all examples of entrenched Fortune 500 businesses that made massive investments in order to digitally transform and ultimately transition to the cloud.

A Move Towards Microservices

In the early 2000s, by far the most common software architectural style was a monolith. A monolithic architecture is a traditional software design approach where an application is built as a single, interconnected unit, with tightly integrated components and a unified database. All modules are part of the same codebase, and the application is deployed as a whole. Due to the highly coupled nature of the codebase, it is challenging for developers to maintain monoliths as they become larger leading to a significant slowdown in shipping velocity. In 2011 (The same year as Marc Andreesen’s iconic Wall Street Journal article) a new paradigm was introduced called Microservices.

Microservices are an architectural pattern that changes how applications are structured by decomposing them into a collection of loosely connected services that communicate through lightweight protocols. A key objective of this approach is to enable the development and deployment of services by individual teams, free from dependencies on others. By minimizing interdependencies within the code base, developers can evolve their services with minimal constraints. As a result, organizations scale easily, integrate with off-the-shelf tooling as and when it’s needed, and organize their engineering teams around service ownership. The term Microservice was first introduced by Thoughtworks consultants and gained prominence through Martin Fowler’s influential blog.

Software companies excitedly transitioned to microservices in order to further decouple their infrastructure and increase development velocity. Companies like Amazon, Netflix, Twitter, and Uber were some of the fastest growing businesses leveraging the architectural pattern and the impact on scalability was immediate.

Our services are built around microservices. A microservices-based architecture, where software components are decoupled and independently deployable, is highly adaptable to changes, highly scalable, and fault-tolerant. It enables continuous deployment and frequent experimentation.

Werner Vogels, CTO of Amazon

Data Architecture in Disrepair

In the exciting world of microservices, the cloud, and AGILE - there was one silent victim - the data architect. Data architects original role was to be bottlenecks, designing ERDs, controlling access to the flow of data, designing OLTP and OLAP systems, and acting as a vendor for a centralized source of truth. In the new world of decentralization, siloed software engineering teams, and high development velocity the dta architect was perceived as (rightly) a barrier to speed.

The years it took to implement a functional data architecture was far too long. Spending months creating the perfect ERD was too slow. The data architect lost control - they could no longer dictate to software engineers how to design their databases, and without a central data model from which to operate the upstream conceptual and physical data models fell out of sync. Data needed to move fast, and product teams didn’t want a 3rd party providing well curated data on a silver platter. Product teams were more interested in building MVPs, experimenting, and iterating until they found the most useful answer. The data didn’t need to be perfect, at least not to begin with.

In order to facilitate more rapid data access from day 1 the data lake pattern emerged. Data lakes are centralized repos where structured, unstructured, and semi structured data can be stored at any scale for little cost. Examples of data lakes are Amazon S3, Azure Data Lake Storage, and Google Cloud Storage. Analytics and Data Science teams can either extract data from the lake, or analyze/query it directly with frameworks like Apache Spark, or SaaS vendors such as Looker or Tableau.

While data lakes were effective in the short term, they lacked well defined schemas required by OLAP databases. Ultimately, this prevented the data from being used in a more structured way by the broader organizations. Analytical databases emerged like Redshift, BigQuery, and Snowflake where data teams could begin to do more complex analysis over longer periods of time. Data was pulled from source systems into the data lake and analytical environments so that data teams had access to fresh data when and where it was needed.

Businesses felt they needed data engineers who built pipelines that moved data between OLTP systems, data lakes, and analytical environments more than they needed data architects and their more rigid, inflexible design structures.

With the elimination of the data architect, so too resulted in the gradual phasing out of the Data Warehouse. In many businesses, Data Warehouses exist in name only. There is no cohesive data model, no clearly defined entities, and what does exist certainly does not act as a comprehensive integration layer that is a near 1:1 of business units reflected in code. Data Engineers do their best these days to define common business units in their analytical environment, but they are pressured from both sides - data consumers and data producers. The former is always pushing to go faster and ‘break things’ and the latter operate in silos, rarely aware of where the data they produce is going or how it is being used.

There is likely no way to put the genie back in the bottle. AGILE, software oriented businesses, and microservices add real value, and allow businesses to experiment faster and more effectively than their slower counterparts. What is required now is an evolution of both the Data Architecture role and the Data Warehouse itself. A solution built for modern times, delivering incremental value where it matters.

Rise of the Modern Data Stack

The term ‘Modern Data Stack’ (MDS) refers to a suite of tools that are designed for cloud based data architectures. These tools have overlap with their offline counterparts but in most cases are more componentized and expand to a variety of additional use cases. The Modern Data Stack is a hugely important tool in a startups toolkit. Because new companies are cloud native - meaning they were designed in the cloud from day 1 - in order to perform analytics or machine learning at any scale will require an eventual adoption of some or all of the Modern Data Stack.

The Big Players

Snowflake is the largest and most popular of the Modern Data Stack, often being cited as the company that first established the term. A cloud oriented analytical database, Snowflake went public with its initial public offering (IPO) on September 16, 2020. During its IPO, the company was valued at around $33.3 billion, making it one of the largest software IPOs in history. Its cloud-native architecture, which separates compute from storage, offers significant scalability, eliminating hardware limitations and manual tuning. This ensures high-performance even under heavy workloads, with dynamic resource allocation for each query. Snowflake could save compute-heavy companies hundreds of thousands to millions of dollars in cost savings, providing a slew of integrations with other data products.

There are other popular alternatives to Snowflake like Google BigQuery, Amazon Redshift, and Databricks. While Snowflake has effectively corned the market on cloud-based SQL transformations, Databricks provides the most complete environment for unified analytics built on top of the worlds most popular open source large scale distributed data processing framework - Spark. With Databricks, data scientists and analysts can easily manage their Spark clusters, generate visualizations using languages like Python or Scale, train and deploy ML models, and much more. Snowflake and Databricks have entered a certifiable data arms race, their competition for supremacy ushering in a new wave or data-oriented startups over the course of the late 2010s and early 2020s.

The analytical databases however, are not alone. Extract, Load, and Transform (ELT) tool Fivetran has gained widespread recognition as well. While not publicly traded as of 2023, Fivetran’s impact on the modern data landscape remains notable. Thanks to a collection of user-friendly interface and pre-built connectors, Fivetran allows data engineers to connect directly to data sources and destinations, which organizations can quickly leverage to extract and load data from databases, applications, and APIs. Fivetran has become the defacto mechanism for early stage companies to move data between sources and destinations.

Short for Data Build Tool, dbt is one of the fastest growing open source components for the Modern Data Stack. With modular transformations driven by YAML, dbt provides a CLI which allows data and analytics engineers to create transformations while leveraging a software engineering oriented workflow. The hosted version of dbt extends the product from just transformations, to a YAML based metrics layer that allows data teams to define and store facts and their metadata which can leveraged in experimentation and analysis downstream.

This is but a sampling of the tooling in the MDS. Dozens of companies and categories have emerged over the past decade, ranging from Orchestrations systems such as Airflow and Dagster, Data Observability tooling such as Monte Carlo and Anomalo, cloud-native Data Catalogs like Atlan and Select Star, Metrics repositories, feature stores, experimentation platforms, and on and on. (Disclaimer: These tools are all startups formed during the COVID valuation boom - some or all of them may not be around by the time you read this book!)

The reason why the velocity of data tooling has accelerated in recent years is primarily due to the simplicity of integrations most tools have with the most dominant analytical databases in the space: Snowflake, BigQuery, Redshift, and Databricks. These tools all provide developer friendly APIs and expose well structured metadata which can be accessed and leveraged to perform analysis, write transformations, or otherwise queried.

Rapid Growth

The Modern Data Stack grew rapidly in the ten years between 2012-2022. This was the time teams began transitioning from on-prem only applications to the cloud, and it was sensible for their data environments to follow shortly after. After adopting the core tooling like S3 for data lakes and an analytical data environment such as Snowflake or Redshift, businesses realized they lacked core functionality in data movement, data transformation, data governance, and data quality. Companies needed to replicate their old workflows in a new environment, which led data teams to rapidly acquire a suite of tools in order to make all the pieces work smoothly.

Other internal factors contributed to the acquisition of new tools as well. IT teams which were most commonly responsible for procurement began to become supplanted with the rise of Product Led Growth (PLG). The main mode of selling software for the previous decade was top down sales. Sales people would get into rooms with important business executives, walk through a lengthy slide deck that laid out the value proposition of their platform and pricing, and then work on a proof of concept (POC) over a period of many months in order to prove the value of the software. This process was slow, required significant sign-off from multiple stakeholders across the organization, and ultimately led to much more expensive day 1 platform fees. PLG changed that.

Product Led Growth is a sales process that allows the ‘product to do the talking.’ SaaS vendors would make their products free to try or low cost enough that teams could independently manage a sign-off without looping in an IT team. This allowed them to get hands-on with the product, integrate with their day-to-day workflows and see if the tool solved the problems they had without a big initial investment. It was good for the vendors as well.

Data infrastructure companies were often funded by venture capital firms due to the high upfront R&D required. In the early days of a startup, venture capitalists tend to put more weight on customer growth over revenue. This is because high usage implies product market fit, and getting a great set of “logos” (customers) implies that if advanced, well known businesses were willing to take a risk on an early, unknown product then many other companies would be willing to do the same. By making it far easier for individual teams to use the tool for free or cheap, vendors could radically increase the number of early adopters and take credit for onboarding well known businesses.

In addition to changing procurement methodologies and sales processes, the resource allocation for data organizations grew substantially over the last decade [source]. As Data Science evolved from a fledgling discipline into a multi-billion dollar per year category, businesses began to invest more than ever into headcount for scientists, researchers, data engineers, analytics engineers, analysts, managers and CDOs.

Finally, early and growth stage data companies were suddenly venture capital darlings after the massive Snowflake initial public offering. Data went from an interesting nice to have, to undisputable longterm opportunity. Thanks to the low interest rates and a massive economic boost to tech companies during COVID, billions of dollars were poured into data startups resulting in an explosion of vendors across all angles of the stack. Data technology was so in demand that it wasn’t unheard to invest in multiple companies that might be in competition with one another!

Problems in Paradise

Despite all the excitement for the Modern Data Stack, there were noticeable cracks that began to emerge over time. Teams who were beginning to reach scale were complaining - The amount of tech debt was growing rapidly, pipelines were breaking, data teams weren’t able to find the data they needed, and analysts were spending the majority of their time searching for and validating data instead of putting it to good use on ROI generating data products. So what happened?

First, software engineering teams were no longer engaging in proper business-wide data modeling or entity relationship design development. This meant that there was no single source of truth for the business. Data was replicated across the cloud in many different microservice. Without data architects serving as the stewards of the data, there was nothing to prevent unique implementations of the same concept, repeated dozens of times uniquely with data perhaps providing opposite results!

Second, data producers had no relationship to data consumers. Because it was easier and faster to dump data into a data lake than build explicit interfaces for specific consumers and use cases, software engineers threw their data over the fence to data engineers whose job it was to rapidly construct pipelines with tools like Airflow, dbt, and Fivetran. While these tools completed the job quickly, they also created significant distance between the production and analytical environments. Making a change in a production database had no guardrails. There was no information provided about who was using that data (if it was being used at all), where the data was flowing, why it was important, and what expectations on the data were essential to its function.

Third, data consumers began to lose trust in the data. When a data asset changed upstream, downstream consumers were forced to bear the cost of that change on their own. Generally that meant adding a filter on top of their existing SQL query in order to account for the issue. For example, if an analyst wrote a query intended to answer the question “How many active customers does the company have this month,” the definition of active may be defined from the visits table, which records information every time a user opens the application. Secondly, the BI team may decide that a single visit is not enough to justify the intention behind the word ‘active.’ They may be checking notifications but not using the platform, which results in the minimum visit count to be set to 3.

WITH visit_counts AS (
    SELECT
        customer_id,
        COUNT(*) AS visit_count
    FROM
        visits
    WHERE
        DATE_FORMAT(visit_date, '%Y-%m') =
        DATE_FORMAT(CURDATE(), '%Y-%m')
    GROUP BY
        customer_id
)
SELECT
    COUNT(DISTINCT customers.customer_id) AS active_customers
FROM
    customers
LEFT JOIN
    visit_counts ON
    visit_counts.customer_id = customers.customer_id
WHERE
    COALESCE(visit_counts.visit_count, 0) >= 3;

However, over time changes upstream and downstream impact the evolution of this query in subtle ways. The software engineering team decides to distinguish between visits and impressions. An impression is ANY application open to any new or previous screen, whereas a visit is defined as a period of activity lasting for more than 10 seconds. Before, all ‘visits’ were counted towards the active customer count. Now some percentage of those visits would be logged as impressions. To account for this, the analyst creates a CASE WHEN statement that defines the new impression logic, then sums the total number of impressions and visits to effectively get the same answer as their previous query using the updated data.

WITH impressions_counts AS (
    SELECT
        customer_id,
        SUM(CASE WHEN duration_seconds >= 10 THEN 1 ELSE 0 END) AS visit_count,
        SUM(CASE WHEN duration_seconds < 10 THEN 1 ELSE 0 END) AS impression_count
    FROM
        impressions
    WHERE
        DATE_FORMAT(impression_date, '%Y-%m') =
        DATE_FORMAT(CURDATE(), '%Y-%m')
    GROUP BY
        customer_id
    HAVING
        (visit_count + impression_count) >= 3
)
SELECT
    COUNT(DISTINCT customers.customer_id) AS active_customers
FROM
    customers
LEFT JOIN
    impressions_counts ON
    impressions_counts.customer_id = customers.customer_id
WHERE
    COALESCE(impressions_counts.visit_count, 0) >= 3;

The more the upstream changes, the longer these queries become. All the context about why CASE statements or WHERE clauses exist is lost. When new data developers join the company and go looking for existing definitions of common business concepts, they are often shocked at the complexity of the queries being written and cannot interpret the layers of tech debt that had crusted over the in analytical environment. Because these queries are not easily parsed or understood, data teams review directly with software engineers to understand what data coming source systems meant, why it was designed a particular way, and how to JOIN it with other core entities. Teams would then recreate the wheel for their own purposes leading to duplication and growing complexity beginning the cycle anew.

Fourth, the costs of data tools began to spiral out of control. Many of the MDS vendors use usage based pricing. Essentially that means ‘pay for what you use.’ Usage based pricing is a great model when you can reasonably control and scale your usage of a product over time. However, the pricing model becomes venomous when growth of a service snowballs outside the control of its primary managers. As queries in the analytical environment became increasingly more complex, the cloud bill grew exponentially to match. The increased data volumes resulted in higher bills from all types of MDS tools - which were now individually gouging on usage based rates that continued to skyrocket.

Almost overnight, the data team was larger than it had ever been, more expensive than it had ever been, more complicated than it had ever been, and delivering less business value than they ever had.

The Shift to Data-centric AI

While there are earlier papers on arXiv mentioning “data-centric AI,” its widespread acceptance was pushed by Dr. Andrew Ng’s and DeepLearningAI’s 2021 campaign advocating for the approach. In short, data-centric AI is the process of increasing machine learning model performance via systematically improving the quality of training data either in the collection or preprocessing phase. This is in comparison to model-centric AI which relies on further tuning an ML model, increasing the cloud computing power, or utilizing an updated model to increase performance. Through the work of Andrew Ng’s AI lab and conversations with his industry peers, he noticed a pattern where data-centric approaches vastly outperformed model-centric approaches. We highly encourage you to watch Andrew Ng’s referenced webinar linked in the further resources section, but two key examples from Ng best encapsulate why the data industry is shifting towards a data-centric AI approach.

First, Andrew Ng highlights how underlying data impacts fitting ML models for the following conditions represented in Figure 1-3:

Small Data, High Noise:

Results in poor-performing models, as numerous best-fit lines could be applied to the data and thus diminish the ability for the model to predict values. This often requires ML practitioners to go back and collect more data or remediate the data collection process to have more consistent data.

Big Data, High Noise:

Results in an ML model being able to find the general pattern, where practitioners utilizing a model-centric approach can realize gains by tuning the ML model to account for the noise. Though ML practitioners can reach an acceptable prediction level via model-centric approaches, Ng argues that for many use cases, there is better ROI in taking the time to understand why training data is so noisy.

Small Data, Small Noise:

Represents the data-centric approach where high-quality and curated data results in ML models being able to easily find patterns for prediction. Such methods require an iterative approach to improving the systems in which data is collected, labeled, and preprocessed before model training.

We encourage you to try out the accompanying Python code in Example 1-3 to get an intuitive understanding of how noise (i.e. data quality issues) can impact an ML model’s ability to identify patterns.

Figure 1-3. The impact of the varying levels of data volume and noise on predictability.
Example 1-3. Code to generate data volume and noise graphs in Figure 1-3.
import numpy as np
import matplotlib.pyplot as plt
def generate_exponential_data(min_X, max_x, num_points, noise):
    x_data = np.linspace(min_X, max_x, num_points)
    y_data = np.exp(x_data * 2)
    y_noise = np.random.normal(loc=0.0, scale=noise, size=x_data.shape)
    y_data_with_noise = y_data + y_noise
    return x_data, y_data_with_noise
def plot_curved_line_example(min_X, max_x, num_points, noise, plot_title):
    np.random.seed(10)
    x_data, y_data = generate_exponential_data(min_X, max_x, num_points, noise)
    plt.scatter(x_data, y_data)
    plt.title(plot_title)
    plt.show()
example_params = {
    'small_data_high_noise': {
        'num_points':100,
        'noise':25.0,
        'plot_title': 'Small Data, High Noise (100 Points)'
    },
    'big_data_high_noise': {
        'num_points':1000,
        'noise':25.0,
        'plot_title': 'Big Data, High Noise (1000 Points)'
    },
    'small_data_low_noise': {
        'num_points':100,
        'noise':1.0,
        'plot_title': 'Small Data, Small Noise (100 Points)'
    },
    # 'UPDATE_THIS_EXAMPLE': {
    #     'num_points':1,
    #     'noise':1.0,
    #     'plot_title': 'Your Example'
    # }
}
for persona in example_params.keys():
    persona_dict = example_params[persona]
    plot_curved_line_example(
        min_X=0,
        max_x=2.5,
        num_points=persona_dict['num_points'],
        noise=persona_dict['noise'],
        plot_title=persona_dict['plot_title']
        )

Second, Andrew Ng provided the analogy of comparing an ML engineer to a chef. The colloquial understanding among ML practitioners is that 80% of your time is spent preparing and cleaning data, while the remaining 20% is actually training your ML model. Similar to a chef, 80% of their time is spent sourcing and preparing ingredients for mise en place, while the remaining 20% is actually cooking the food. Though a chef can improve the food substantially by improving cooking techniques, the chef can also improve the food by sourcing better ingredients– which is arguably easier than mastering cooking techniques. The same holds true for ML practitioners, as they can improve their models via tuning (model-centric AI approach) or improve the underlying data during the collection, labeling, and preprocessing stages (data-centric AI approach). Furthermore, Ng found that for the same amount of effort, teams that leveraged a data-centric approach resulted in better-performing models than the teams using model-centric approaches.

Diminishing ROI of Improving ML Models

Incrementally improving machine learning models follows the Pareto Principle, where 80% of the gains in improving the model itself is achieved through 20% of the effort. Via a model-centric approach, every improvement grows exponentially harder for the needed effort, such as going from 93% to 95% accuracy.

Figure 1-4. The Pareto Principle when tuning ML models.

Furthermore, taking a model-centric approach to AI often requires a substantial amount of data, hence why big tech SaaS companies have been the first successful adopters of machine learning at scale given their access to massive amounts of weblogs. Ng argues that as AI branches outside of these big tech domains, such as healthcare and manufacturing, ML practitioners are going to need to adopt methods that account for having access to substantially less data. For example, on average a single person generates about ~80MB of healthcare imaging and medical record data a year, as compared to the ~50GB of data a single user generates a month on average via their browsing activity.

In addition, Ng also argues that even with big data use cases, ML practitioners still need to wrestle with the challenges of small data. Specifically, after ML models are tuned on large datasets, gains come from accounting for the long-tail of use cases, which is ultimately a small data problem as well. Taking a data-centric AI approach to these long-tail problems within big data can provide more gains with substantially less effort than optimizing the ML model.

Commoditization of Data Science Workflows

While machine learning and AI has been developed for decades, it wasn’t until around 2010 when the practice gained widespread utilization within industry. This is apparent in the number of ML vendors growing from ~5 companies in 2014 to ~200 plus companies in 2023, as illustrated in Matt Turck’s yearly data vendor landscapes in Figure 1-5.

Figure 1-5. ML vendor landscape growth in a decade, as presented by Matt Turck.

Furthermore, as the data industry matured, less emphasis has been placed on developing ML models and instead the focus has turned to putting ML models in production. Early data science teams could get by with a few STEM PhDs toiling away in jupyter notebooks for R&D purposes, or sticking to traditional statistical learning methods such as regression or random forest algorithms. Fast forward to today, and data scientists have a plethora of advanced models they can quickly download from GitHub or can leverage auto-ml via dedicated vendors or products within their cloud provider. Also, there are entire open-source ecosystems such as scikit-learn or TensorFlow that have made developing ML models easier than ever before. It’s simply not enough for a data team to create ML models to drive value within an organization, the value is generated in their ability to reliably deploy machine learning models in production.

Finally, the rise of generative AI has further entrenched this trend of the commoditization of data science workflows. In a matter of an API call, that costs fractions of a cent, anyone can leverage the most powerful deep learning models to ever exist. For context, in 2020 Mark put an NLP model in production for an HR tech startup looking to summarize employee survey free text responses, utilizing the spaCy library. At the time, spaCy abstracted away the need to fine-tune an NLP model, hence why it was chosen for our quick feature development cycle. If tasked with the same project today, it would be unsound to not strongly consider using a large language model (LLM) for the same task, as no amount of tuning spaCy NLP models could compete with the power of LLMs. In other words, the process of developing and deploying an NLP model has been commoditized to a simple API call to OpenAI.

Data’s Rise Over ML in Creating a Competitive Advantage

In parallel with the commoditization of data science workflows, the competitive advantage of ML models in themselves is decreasing.The amount of specialized knowledge, resources, and effort necessary to train and deploy an ML model in production is significantly lower than what it was even five years ago. This lower barrier of entry means that the realized gains of machine learning are no longer relegated to big tech and advanced startups. Especially among traditional businesses outside of tech, the implementation of advanced ML models is not only possible but expected. Thus, the competitive advantage of ML models in themselves have diminished.

The best representation of this reduced competitive advantage is once again the emergence of generative AI. The development of ChatGPT, and other generative AI models from big tech, was the culmination of decades of research, model training on expensive GPUs, and an unfathomable amount of web-based data. The requirements to develop these models were cost prohibitive for most companies and thus the models in themselves maintained a competitive advantage, up until recently. At the time of writing this, open-source and academic communities have been able to replicate and release similarly powerful generative AI models in a matter of months of the releases of their closed-source counterparts.

Therefore, the ways in which companies can maintain their competitive advantage, in a market where machine learning is heavily commoditized, is through their underlying data itself. Through a data-centric AI approach, taking the time to generate and or curate high quality data assets unique to a respective business will extract the most value out of these powerful but commoditized AI models. Furthermore, the data generated or processed by businesses are unique to the businesses themselves and hard, if not impossible, to replicate. The winners of this new shift in our data industry won’t be the ones who can implement AI technology, but rather the ones who can control the quality of the data they leverage with AI.

Conclusion

In this chapter we provided an overview of historical and market context as to why data quality has been deprioritized in the data industry for the past two decades. In addition we highlighted how data quality is again being deemed as integral as we evolve from the Modern Data Stack era and shift towards data-centric AI. In summary, this chapter covered:

  • What is data debt and how it applies to garbage-in garbage-out

  • The death of the data warehouse and the subsequent rise of the Modern Data Stack

  • The shift from model-centric to data-centric AI

In Chapter 2, we will define data quality and how it fits within the current state of the data industry, as well as highlight how current data architecture best practices creates an environment that leads to data quality issues.

References

Get Data Contracts now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.