Chapter 4. An Introduction to Data Contracts
The previous three chapters focused heavily on the why of data contracts and the problem statement that data contracts aims to solve. Starting in Chapter 4, we shift to defining what exactly the data contract architecture is and what its workflow looks like. This chapter will serve as the theoretical foundation before transitioning into discussing real-world implementations in Chapter 5 and implementing it yourself in Chapters 6 and 7. In addition to the theoretical foundation, we will also discuss the key stakeholders and workflows when utilizing data contracts.
Collaboration is Different in Data
Depending on who you ask, ‘collaboration’ could either refer to a nebulous executive buzzword or a piece of software that makes vague and lofty promises about improving the relationships between team members who already have quite a bit of work to do.
For the purposes of this book, our definition of technical collaboration is the following: “Collaboration refers to a form of distributed development where multiple individuals and teams, regardless of their location, can contribute to a project efficiently and effectively.”
Teams may collaborate online or offline, and with comprehensive working models or not. The goal of collaboration is to increase the quality of software development through human-in-the-loop review and continuous improvement.
In the modern technology ecosystem, there are many components to collaboration that have become part of the standard developer lifecycle and release process. Version control, which allows multiple contributors to work on a project without interfering with each other’s changes, is a fundamental component of collaboration facilitated through source code management systems like Git.
Pull Requests (PR) are another common collaborative feature that have become an essential component of DevOps. Contributors can make changes in the code branches they maintain, then propose these changes to the main codebase using PRs. PRs are then reviewed by software engineers on the same team in order to ensure high code quality and consistency.
With automation, collaboration around the code base can provide short feedback loops between developers that ensure bugs are mitigated prior to release, the team has a general understanding of which changes are being merged to production and how they impact the codebase, ensuring greater accountability for releases which are now often made multiple times per day.
All of this is ultimately done to improve code quality. If you are an engineer, imagine what would life be like if your company wasn’t using software like GitHub, GitLabs, Azure DevOps, or BitBucket? Teams would need to make manual backups of their code base and leverage centralized locking systems to ensure only one developer was working on a particular file at a time. Teams would also have to consistently check in with each other to plan which aspects of the codebase were being worked on by whom, and all updates would be shared through email or FTP systems. With space for human error, a significant amount of time would be spent dealing with bugs and software conflicts as illustrated in Figure 4-1.
However, effective collaboration is not only about implementing core underlying technology—it ultimately hinges on the interaction patterns between change-makers and change-approvers. These interaction patterns will differ depending on the change being made, the significance of the change, and the organizational design of the company.
GitHub, the world’s most popular version control system, was launched in 2008 by Tom Preston-Werner, Chris Wanstrath, and PJ Hyett. By 2011, GitHub was home to over 1 million code repositories, and grew tenfold over the following two years. GitHub not only solved a significant technical challenge in creating a simple to use interface for Git and providing a host of collaboration oriented functionality, but it also allowed leadership teams who were excited by the principles of AGILE software development to meaningfully switch to new organizational structures that were better in line with rapid, iterative release schedules and federation.
These organizational structures placed software engineers at the center of product teams. Product teams were composed of a core engineering team, reinforced by a variety of other roles such as product and program managers, designers, data scientists, and data engineers. Each product team acted as a unique microorganism within the company, pursuing distinct goals and objectives that, while being in line with the company’s top level initiatives, were tailored to specific application components. GitHub allowed developers on these teams to further subdivide their tasks, many of which overlapped, without consequence.
Despite federated collaboration becoming a core component of a product development team’s workflow, this same cycle of review and release is not present for data organizations, or software engineers who maintain data sources such as production databases, APIs, or event streams. This is primarily because software engineers operate within teams, while data flows between teams, as illustrated in Figure 4-2 .
Each member of a product team has an in-built incentive to take collaboration seriously. Bad code shipped by an engineer could cause a loss of revenue or delays to the feature roadmap. Because product teams are judged according to their component level goals, each engineer is equally accountable to changes in their codebase.
However, data is a very different story. In most cases, data is stored in source systems owned by product teams which flow downstream to data developers in finance, marketing, or product.
Upstream and downstream teams have unique goals and timelines. Due to the highly iterative and experimental nature of constructing a query (or, asking/answering a question) it is often unclear at the moment the query is created whether or not it will be useful. This leads to a split in the way data is being used and a divergence of responsibilities and ownership.
When upstream software engineers make changes to their database, these changes are reviewed by their own team but not the downstream teams that leverage this data for a set of totally different use cases. Thanks to the isolated nature of product teams, upstream engineers are kept in the dark about how these changes affect others in the company, and downstream teams are given no time to collaborate and provide feedback to their data providers upstream. By the time data teams detect something has changed,it is already too late—pipelines have been broken, dashboards display incorrect results, and the product team who made the change has already moved on.
The primary goal of data contracts is to solve this problem. Contracts are a mechanism for expanding software-oriented collaboration to data teams, bringing quality to data through human-in-the-loop review, just as the same systems facilitated code quality for product teams. The next section goes into more detail about who these stakeholders are and how they collaborate.
The Stakeholders You Will Work With
As stated in the previous section, data contracts center around collaboration in managing complex systems that leverage software and data. To best collaborate, one must first understand who they are working with. The following section details the various stakeholders and their role in the data lifecycle in respect to data contracts.
The Role of Data Producers
Thus far we have been talking quite a bit about the gaps between data producers and data consumers which culminate in data quality issues. But before we go further into the why of data contracts as a solution, it is essential to understand the responsibilities of all those who are involved with collecting, transforming, and using data. By understanding the incentive structures of key players in this complex value chain of data movement, we can better identify how technology helps us address and overcome challenges with communication between each party.
Data does not materialize from thin air. It must be explicitly collected by software engineers building applications leveraged by customers or other services. We call engineers responsible for collecting and storing this data, data producers. While unique roles like data engineers can be both producers and consumers, or non-technical teams such as sales people or even customers themselves can be responsible for data entry, for now we will focus primarily on the role of software engineers who almost exclusively take on the work of most data producers in a company.
Most structured data can be broken down into a collection of events which correspond with actions undertaken either by some customer leveraging the application or the system itself. In general, there are two types of events which are useful to downstream consumers.
- Transactional Events:
-
Transactional events refer to individual transaction records logged when a customer or technology takes some explicit behavior that typically (but not always) results in a change in state. An example of a transactional event might be a customer making an order, a back-end service processing and validating that order, or the order being released and delivered to a customer’s address. Transactional events are either stored in Relational Database Management Systems (RDBMS) or emitted using stream processing frameworks such as Apache Kafka.
- Clickstream Events:
-
Clickstream events are built to capture a user’s interactions with a web or mobile application for the purpose of tracking engagement with features. An example of a clickstream event might be a user starting a new web session, a customer adding an item to their cart, or a shopper conducting a web search. There are many tools available to capture clickstream events such as Amplitude, Google Analytics, or Segment.
Clickstream events are typically limited in their shape and contents by Software Development Kits provided by 3rd party companies to emit, collect, and analyze behavioral data. These events are often stored in the vendor’s cloud storage environment where product managers and data scientists can conduct funnel analysis, perform A/B tests, and segment users into cohorts to see behavioral trends across groups. Clickstream data can then be loaded from these 3rd party providers into 1st party analytical databases such as Snowflake, Redshift, or Big Query.
Transactional events depend entirely on the role of the applications and can vary wildly in terms of their schema, data contents, and complexity. For example, in some banks, transactional events may contain 100s or even 1000s of columns with detailed information about the transaction itself, the account, the payer, the payee, the branch, the transaction channel, security information (which is highly leveraged for fraud detection models), any fees or changes, and much more. In many cases, there is so much data recorded in transactional events that much of a schema goes unused or is simply not leveraged in a meaningful way by the service.
Because transactional events record data that is seen as only being relevant to the functioning of the application owned by the engineering team (as opposed to clickstream events, which are designed specifically for product analytics) it is often said that transactional data is a by-product of the application. Data producers treat their relational databases as an extension of their applications, and will generally modify them using controlled integration & controlled deployment (CI/CD) processes similar to their typical software release process.
In the modern data environment, data producers’ role stops once the data has been collected, processed, and stored in an accessible format. In the past, the responsibility of the data producer was also heavily influenced by design and data architecture. However, in federated product environments this function is becoming increasingly rare and seen as unneeded. It is easier and faster for the data producer to collect whatever data they would like from the application, store it all in their database in the most beneficial format for their application, and then make the data accessible to data consumers so that they might leverage the data for analysis.
To be clear, this is not a statement on what we think is right or wrong, but simply how a significant amount of data producers in technologically advanced organizations actually behave.
The Role of Data Consumers
A data consumer is a company employee that has access to data provided by a data producer, and leverages that data to construct pipelines, answer questions, or build machine learning models. Data consumers are split into two categories, technical and non-technical.
Technical data consumers are what you might call data developers. Data developers are engineers or analysts that leverage data-oriented programming languages like SQL, Spark, Python, R, and others to either analyze data in large volumes in order to discover trends and insights, or construct pipelines which allows data to flow through a series of transformations on a schedule before being leveraged for decision making. Technical data consumers deal with the nitty gritty of data management, from data discovery, query generation, documentation, data cleaning, data validation, and much more.
Non-technical data consumers are often business users or product managers that lack the technical ability to construct queries themselves (or can only do so on pre-cleaned and validated data). This category of user typically are the ones asking the questions, and ultimately hold the decision making power on which outcome should be taken depending on the directionality of the data.
For the remainder of this chapter, we’ll be focusing on the technical data consumer, though we’ll get back to the role non technical teams play later. There are many different types of technical data consumers and where they fit into the data lifecycle as illustrated in Table 4-1:
- Data Engineers:
-
Data Engineers are most commonly known as software engineers with a data specialty. They focus on extracting data from source systems, moving data into cheap file storage (such as data lakes or delta lakes) then eventually to analytical databases, cleaning and formatting the incoming data for proper use, and ensuring pipelines have been orchestrated properly. Data Engineers form the backbones of most data teams and are responsible for ensuring the entire business gets up-to-date high quality data on a schedule.
- Data Scientists:
-
Once one of the sexiest jobs in the world, data scientists are computer engineers with a higher focus placed on statistics and machine learning. Data scientists are most frequently associated with creating features to be leveraged by ML models, conducting and evaluating hypothesis tests, and developing predictive models to better forecast the success of key business objectives.
- Data Analyst:
-
Data analysts are SQL specialists. They primarily leverage SQL to construct queries in order to answer business questions. Analysts are on the front-lines of query composition - it is their responsibility to understand where data is flowing from, if it’s trustworthy, and what it means.
- Analytics Engineers:
-
Analytics engineering is a relatively new discipline that revolves around applying software development best practices to analytics. Analytics Engineers or BI Engineers leverage data modeling techniques and tooling like dbt (data built tool) to construct well defined data models that can be leveraged by analysts and data scientists.
- Data Platform Engineer:
-
Even newer than analytics engineers, platform engineers are responsible for the implementation, adoption, and ongoing growth of the data infrastructure and core datasets. Platform engineers often have data engineering or software engineering backgrounds, and are mainly concerned with selecting the right analytical databases, streaming solutions, data catalogs, and orchestrations systems for a company’s needs.
Data Consumer Role | Data Infra Management | Data Ingestion | Transaction Data Operations | Data Replication | Analytical Data Operations | Analytics, Insights, and Dashboards | ML Model Building | Actioning on Insights and Predictions |
---|---|---|---|---|---|---|---|---|
Data Platform Engineer | X | X | X | X | X | |||
Data Engineer | X | X | X | X | ||||
Analytics Engineer | X | X | ||||||
Data Analyst | X | |||||||
Data Scientist | X | X | ||||||
Internal Business Stakeholder | X | |||||||
External User of Data Product | X |
Most technical data consumers follow a similar workflow when beginning a new project:
-
Define or receive requirements from non-technical consumers
-
Attempt to understand what data is available and where it comes from
-
Investigate if data is clearly understood and trustworthy
-
If not, try to locate the source of the data and set time with data producers to better understand what the data means and how it is being instrumented in the app
-
Validate the data
-
Create a query (and potentially a more comprehensive pipeline) that leverages valuable data assets
Data consumers access data through a variety of different channels depending on their skillset and background. Data engineers extract data in batch from multiple sources using open source ELT technologies like Airbyte, or or closed source tooling such as Fivetran. Data engineers can also build infrastructure which moves data in real-time. The more common example of this is Change Data Capture (CDC). CDC captures record level transformations in upstream source systems before pushing each record into an analytical environment, most commonly using streaming technologies such as Apache kafka or Redpanda. Once data arrives in an analytical environment, data engineers then clean, structure, and transform the data into key business objects that can be leveraged by other consumers downstream.
Data Scientists, Analytics Engineers, and Analysts typically deal with data after it has already been processed by data engineers. Depending on the use case, their most common activities might include dimensional modeling and creating data marts, building machine learning models, or constructing views which can then be used to power dashboards or reporting.
The output of most data consumers work are queries. Queries are code, written in a language designed for either analytical or transactional databases - most commonly SQL, though Python is also a popular data science alternative. Depending on the complexity of the operation queries can range from simple 5-10 line code blocks to incredibly complex files with hundreds or even thousands of lines full of CASE WHEN statements. As these queries are leveraged to answer business questions they might be extended, replicated, or other modified in a variety of ways.
The Impact of Producers and Consumers
As you might imagine, data producers play a large role in the day to day work of data consumers. Data producers control the creation of and access to source data. Source data represents as close to the ground truth as possible, given that it is collected directly from applications. Data consumers often prefer to use upstream data because re-leveraging queries built by other data consumers can be challenging. The meaning of queries is ultimately subjective - a data scientist may have an opinion on what makes an active customer ‘active’ or a lost order ‘lost.’ These opinions are often baked into the query code with very little explanation or documentation provided. Depending on the length and complexity of the query, it might be almost impossible for other data scientists to fully understand what one of their fellow data team members meant.
Note
This problem is exacerbated with time. Data consumers regularly leave their place of employment and often take the knowledge of their code along with them. This is called institutional knowledge.
For these reasons, data consumers love going to the source. In the same way it’s always easier to get information about a particular person directly from that individual than rely on rumors and hearsay. However, going to the source can be challenging. This can require understanding the lineage of your data ecosystem. Lineage refers to the spider-web of connections that tie data assets to each other within an analytical environment. The older and more dense the environment, the more challenging it can be to track lineage across nodes in the graph. The lineage graph creates an additional problem: Change management.
As data producers update their application they regularly make changes to the code of their software. Software changes may or may not affect data structures, such as schema or the contents of data generating objects like transactional events, logs, or analytical events. Because there is no baseline which sets the expected state of these data objects, data producers make their changes effectively in the dark, as illustrated in Figure 4-3. While integration testing can help check integration errors with the code itself, and production management software like LaunchDarkly (feature management) or DataDog (Observability) can detect or prevent issues from degrading a customer experience, this quality assurance only applies to the application layer, NOT the data layer, where data consumers do their work. The data layer is hidden behind the lineage graph. Such convoluted and tangled data environments make it very difficult for application developers to understand how and where their data is being used.
One option for data producers might be to treat ANY impact to the data layer as part of CI/CD. Unfortunately, this rarely works out well. At most large scale businesses the amount of data in a data lake is so large that it is rare for even 25% of the data to have meaningful utilization within the company. Slowing down your engineering team to ensure integration testing for data that isn’t even relevant or useful for data consumers is a waste of time. Producers must be free to iterate so long as they have limited dependencies.
Second, oftentimes backwards incompatible changes are hard requirements. Product teams regularly ship new features, refactor old code, and modify events according to a more grand architectural design which has buy-in from across the organization. While these changes may catch downstream consumers unaware, they often have too much momentum to be stopped. Data producers might attempt to communicate migrations to data consumers by sending emails, announcing their intentions during design reviews, or posting plans in Slack channels. Unfortunately, this feedback loop rarely reaches data consumers who (again, due to complex lineage) rarely realize they will be impacted by an upstream change even if they knew the context.
When change does happen, it is rarely detected in advance. More often than not downstream consumers (and in the worst case scenario, not technical business stakeholders) are the first to notice a problem exists, which sets them on a long-winded goose chase of identifying and resolving errors caused by an upstream team which barely knows they even exist.
The Trials and Tribulations of Data Consumers Managing Data Quality
Note to editors - based on the following article: https://dataproducts.substack.com/p/the-data-quality-resolution-process
Most data practitioners will have a chill run down their spine when they hear the phrase “These numbers look off.” Such statements often lead to hours or days of digging into data and their respective systems to unearth and fix data quality issues. Often, data teams are the “face of failure” for these data quality issues within the business, despite many issues arising from upstream changes outside of their control. In Mark’s time in startups as a data scientist and data engineer, he came up with a repeatable pattern in solving these data quality issues in organizations where data maturity was relatively early. While this process is relatively manual, many organizations find themselves in similar positions. The data quality resolution process consisted of the following steps, which we have written in more detail about previously:
-
“0. Stakeholder Surfaces Issue”
-
“1. Issue Triage”
-
“2. Requirements Scoping”
-
“3. Issue Replication”
-
“4. Data Profiling”
-
“5a. Downstream Pipeline Investigation”
-
“5b. Upstream Pipeline Investigation”
-
“5c. Consult Technical Stakeholders”
-
“6. Pre-Deploy - Implement DQ Fix”
-
“7. Deploy - Implement DQ Fix”
-
“8. Stakeholder Communication”
Note that a majority of these steps are not technical but rather center around communication across the business to triage breaks within the data lifecycle and coordinate a solution among multiple teams. In more detail, each step consists of the following from the perspective of the analytical database:
- Stakeholder Surfaces Issue:
-
Best case scenario is that a data consumer, such as a data analyst, surfaces a data quality issue before downstream business stakeholders notice. Getting ahead of the data quality issue is less about avoiding the downstream users noticing, but rather making sure your stakeholders are confident that you are handling issues in a timely manner. How you respond to such requests shapes the data culture among your stakeholders.
- Issue Triage:
-
While it’s important to respond quickly, data teams should not try to solve the data issue until they properly triage the issue as an urgent fix, assign for later, or to not work on it. Key to this is having managers on the data team serve as a buffer and set expectations. In addition, requests should never be accepted within individual chat channels (e.g. direct Slack message) but rather directed towards a shared channel with visibility such as Jira.
- Requirements Scoping:
-
Mark’s early career mistakes in data often stemmed from jumping straight into solving the problem without properly scoping the issue. Often there are tradeoffs between the effort and impact of the fix, and thus you need to consult with the downstream stakeholder to understand why this data is important and how it impacts their workflows. In addition, this step further fosters trust with your stakeholder as it shows the due diligence you are taking as well as including them in the resolution process.
- Issue Replication:
-
Once the problem is scoped, problem replication provides the first clues as to where the data quality issue is stemming from. In addition, it prevents one from pursuing a data quality issue that is the result of a human error– which instead implies a communication or process issue. Typically this can be done by either using the data product in question or pulling data from the source table via SQL.
- Data Profiling:
-
At this stage one can do a series of aggregates and cuts of the tables in question. While not exhaustive, and context specific, the following are great starting points:
-
Data Timeliness
-
Null Patterns
-
Data Count Spikes or Drops
-
Counts by Aggregate (e.g. Org Id)
-
Reviewing the data lineage of impacted tables
-
-
The goal isn’t to find the issue, but rather reduce the scope of the issue surface so you can deep dive in a targeted fashion. These quick data searches become the hypotheses to test out.
- Downstream Pipeline Investigation:
-
With the hypotheses created from data profiling, assumptions are tested via downstream investigation in the analytical databases and data products. While the issue may be caused by an upstream change, its visibility is often most present downstream. Again, emphasis is building a complete picture of the data quality issue and potentially discovering other surfaces impacted by the issue that was not in the original problem. Two common issues that surface in this stage are the following:
-
A bug was introduced in the SQL transformation code such as a misuse of a JOIN, a WHERE clause missing an edge case, or pulling data from the wrong table (e.g. user_table vs. user_information_table).
-
The SQL code no longer aligns with evolving business logic, and thus transformations need to be updated accordingly.
-
- Upstream Pipeline Investigation:
-
After exploring downstream impacts, the next step is to trace the data lineage upstream to the transactional database. Specifically, going beyond the database and reviewing the underlying code generating data, conducting the CRUD operations, and capturing logs. In this step, lineage moves from tables and databases to instead also reviewing function calls and the inheritance of these functions. For example, the `product_table` in the database is updated by the `product_sold_count()` function within the `ProductSold` class within the `product_operations.py` file (example below):
-
# product_operations.py example code import db_helper_functions as db_helper class ProductSold: def __init__(self) -> None: # Connect to the database self.db_connection = db_helper.connect_to_database() pass def product_sold_count(self, product_id, sold_count): # Update the product_table in the database with new count <python code implementing logic> # Update database db_connection.commit() db_connection.close()
-
Often the most “hidden” data quality issues lurk within the codebases outside the scope of the data team. Without this step, data teams many times create additional transformations downstream to deal with it quickly.
- Consult Technical Stakeholders:
-
With the facts collected, one may think the solution is apparent, but knowing the solution is only half the battle if the source issue is outside your technical jurisdiction (e.g. limited read-write access to the transactional database). The other half is convincing upstream teams that the proposed solution is correct and that it’s worth prioritizing over their current work. Thus, emphasis is on consulting (rather than requesting) with stakeholders to make them a stakeholder in the solution and understand where it fits within their priorities. Furthermore, there may be some nuances that only those who often work in the upstream system would be aware of.
-
Pre-Deploy - Implement DQ Fix:
-
Once a solution is determined, take the results from the data profiling stage for a baseline reference and identify which values should be changing and staying the same. This is key in both ensuring that one’s solution does not introduce further data quality issues, and to document the due diligence of resolving the issue for your impacted stakeholders.
-
For downstream data quality issues, this typically looks like changing the underlying SQL code making the transformations until one is satisfied with the expected behavior of the data. Ideally these SQL files are version controlled via a tool like Data Build Tool, and thus will go under code review before changing the database. For upstream data quality issues, changes will certainly go through code review, but the challenge is instead getting a separate team to implement the fix. This will hopefully not be an issue if the “Consult Technical Stakeholders” stage goes well, but one does need to account for a timeline outside of their control.
- Deploy - Implement DQ Fix:
-
Once changes are confirmed and pushed into the main branch of the code repository, the solution needs to be deployed into production and then monitored to ensure changes work as expected. Once the change for the underlying code has been deployed, backfilling the impacted data should be considered and implemented if warranted.
- Stakeholder Communication:
-
One area many technical teams forget to consider is the role of communication among stakeholders, especially business leaders impacted by data quality. Resolving data quality issues in a timely manner is mainly an effort in mitigating lost trust in the data organization, thus simply just resolving the issue silently is not enough. Key stakeholders need to be continually informed of the status of the data quality issue, its timeline to being resolved, and its ultimate resolution. The way in which a data quality issue is handled is just as important as the issue being resolved in maintaining the trust in the data organization.
Though this process is tremendously useful in handling data quality issues, it is still quite manual and reactive. While there are numerous alternatives to automating and scaling data quality (e.g. data observability), we believe that data contracts is the ideal choice for implementing a solution that involves multiple teams along the same data lifecycle within one’s respective organization.
An Alternative: The Data Contract Workflow
The current state of resolving data quality issues revolves around reactive processes that require considerable iteration during breaking changes. Furthermore, the largest bottleneck in this data quality process is the process of coordinating various parties to resolve a breaking change– especially among parties where data quality is not their focus.
The data contrat workflow moves the data quality resolution process from reactive to proactive, where constraints, owners, and resolution protocols are established well before a breaking change. In addition, while data contracts ideally prevent a breaking change, in the case where a contract violation is unavoidable, the relevant parties are automatically made aware to resolve accordingly and prevent the stakeholder coordination bottleneck.
The Data Contract Workflow
As highlighted in Figure 4-4, the data contract workflow consists of the following steps:
-
Data constraint identified by data consumer.
-
Data consumer requests a data contract for an asset.
-
Data producer confirms the data contract is viable.
-
Data contract confirmed as code.
-
Data producer creates a pull request to change a data asset.
-
Automatically check if requested change violates a data contract.
-
7a) Data asset owners are notified of data contract violation and change follows failure protocol.
-
7b) Data asset updated for downstream processes.
Let’s go look at each step in detail:
- 1. Data constraints identified by a data consumer.
-
As mentioned in Chapter 1, data revolves around the needs of data consumers and thus sets forth the data quality requirements. This is because the data consumer is the interface between available data assets and the operationalization of such assets to drive value for the business. Though it’s possible for a data producer to be aware of such business nuances, their work is often far removed from the business stakeholders and their needs. A great example of this divide is comparing the business knowledge of a data analyst and software engineer. While both can be business savvy, the data analyst’s job revolves around answering questions for the business with data and thus will likely be more abreast of pertinent requirements of the business in real-time.
- 2. Data consumer requests a data contract for an asset.
-
One of the most important roles of a data consumer is to translate business requirements into technical requirements in respect to data, as highlighted in Figure 4-5. This is reflected in the fact that business stakeholders are often relegated to interfacing with data via dashboard rather than accessing raw data directly. Therefore, we need to differentiate between “technical data consumers” and “business data consumers” when thinking of the data contract workflow. Thus, technical data consumers will take their knowledge of business and data requirements to request a data contract for data producers to abide to.
- 3. Data producer confirms the data contract is viable.
-
While technical data consumers are skilled in matching business requirements to technical requirements, their limitation is understanding how technical requirements align with the entire software system. Thus, the data producer will be the one to determine the viability of this request and make necessary adjustments of the proposed data contract.
-
For example, a data consumer may be aware that the degree of data freshness required by the business is one day for a specific business need. This can be a simple update to a data pipeline schedule, or require a massive refactor to scale capabilities of the data pipeline. The data producer can support the data consumer in becoming aware of these constraints and communicating such limitations to the business.
- 4. Data contract confirmed as code.
-
Data contracts have a heavy emphasis on the automatic prevention and alerting of data quality violations, but the step of creating the data contract is actually the most important. Specifically, this step serves as a forcing function for data teams to communicate their needs, inform producers of the business implications of data assets, and establish owners and workflows when a violation is encountered. As noted in the previous step, it’s not a one-way request but rather a negotiation among stakeholders to align on how to best serve the business with data.
-
Furthermore, since the data contract is stored as version-controlled code (typically YAML files), the evolution of historical business to technical requirements mappings are saved. This historical information is gold for technical data consumers who often work with data spanning across timelines of various product and or business changes.
- 5. Data producer creates a pull request to change a data asset.
-
This step is self explanatory as version-controlled code and code reviews are a minimum requirement for any software system. With that said, changes to data assets to meet changing software requirements may seem innocuous, but these changes are the fuel for major technical fires that start off as silent failures. This is because data producers are often not privy to downstream business implications and are removed from the fallout of such failures until after a root cause analysis is conducted. Data contracts move these requirements from being downstream and obscured to being readily available for any technical stakeholder to review.
- 6. Automatically check if requested change violates a data contract.
-
Once the data contracts are in place for the relevant data assets, data quality prevention can happen in the developer workflow rather than being a reactionary response. Specifically, CI/CD requires new pull requests for code changes to pass a set of tests, and data contract checks fit within this workflow.
- 7a. Data asset owners are notified of data contract violation and change follows failure protocol.
-
As stated earlier in this chapter, it’s not enough to just be aware of data quality issues. Instead, only the relevant stakeholders need to be notified and given the context to motivate them to take action to resolve the data quality issue. While data quality is a requirement for data consumers, the state of data has limited impact on the constraints of data producers who primarily focus on software. As illustrated in Table 4-2, data contracts realigns the impact of data quality to the motivations of data producers via alerts on their pull request.
-
Table 4-2. Differences between data producers and consumers Data Producer Data Consumer Problem - Need to update the underlying software system to align with changing technical requirements.
- Need to ensure underlying data is trustworthy to drive meaningful insights for the business.
Motivations - Need to pass CI/CD checks to merge pull requests.
- Don’t want to have a business critical failure point back to code change where they were notified of the risk.
- Ensuring proper due diligence surrounding the data has been conducted to know limitations of a data asset.
- Having insights accepted by the business, especially among executives.
Outcome - Robust software that not only accounts for technical tradeoffs, but also critical business logic tied to data.
- Reduced time spent investigating and resolving data quality issues for critical business workflows.
-
It’s important to note that a data contract violation does not equate to an automatic full block of the change. Again, the role of data contracts is to serve the needs of the business in respect to data. It may be the case that the contract itself needs to be changed or that there is a technical tradeoff being made where the data quality is a lower priority. Regardless, the contract spec itself will inform if the violation results in a hard or soft failure.
- 7b. Data assets updated for downstream processes.
-
After data contract CI/CD checks have passed, the code with the requested change to the data asset will be merged into the main branch and eventually deployed to production. Ideally this workflow will happen with minimal intervention, but in the case of a data contract violation, the resolution process will be documented on the pull request itself given that relevant stakeholders have been notified to engage.
What makes this workflow powerful is that it’s not relegated to a single section of the data lifecycle (e.g. tools that only focus on the data warehouse), but rather works anywhere there is movement of data from a source to a target. In the next section, we’ll discuss where one can implement data contracts and their various tradeoffs throughout the data lifecycle.
An Overview of Where to Implement Data Contracts
Emphasis of the data contract workflow is abstracting away the business logic and data quality surrounding data assets as an API. By creating this abstraction, stakeholders no longer need to interact with a multitude of touchpoints to understand data and resolve quality issues. Instead, stakeholders now only need to interact with the contract itself and only engage when alerted by a contract violation.
Data contract architecture can be bundled into four distinct components:
-
Data Assets
-
Contract Definition
-
Detection
-
Prevention
We will deep dive into each component in Chapter 5 from a conceptual level, Chapter 6 will provide the open source tools we recommend to build data contracts, and Chapter 7 will provide an end-to-end implementation.
In addition to the numerous components, there are also multiple stages in the data lifecycle in which you can implement data contracts, as illustrated in Figure 4-6.
Below are the considerations for each stage of the data lifecycle:
- A. Third-Party Data → Transactional Database (OLTP):
-
While we recommend going as upstream as possible for data contracts, one of the most challenging areas to enforce data contracts are with third-party data. While possible, it is unlikely that a third-party would agree to additional restrictions without your organization having leverage. With that said, while you can’t control the third-party data, data contracts can be utilized between ingestion and your transactional database as a way to triage data that violates a contract and provide early alerting.
- B. Product Surface → Transactional Database:
-
Changes to the underlying schema and semantics for CRUD operations are often the root of data quality issues, and would be an ideal location for the first implementation of data contracts. One huge consideration though is whether your team oversees the transactional database; often software engineering teams control this database, which means this would not be an ideal location for a first implementation if you are a data team.
- C. Transactional Database → Analytical Database (OLAP):
-
The data pipeline between transactional databases and analytical databases is where we recommend most organizations start implementing data contracts. Specifically, both software engineering and data teams are active stakeholders in this stage of the data lifecycle, with data teams often having considerably more autonomy for implementing changes. In addition, this lifecycle stage is often the most upstream source of data for analytical and ML workflows, thus having the highest probability of preventing major data quality issues.
- D. Third-Party Data → Analytical Database:
-
Similar to third-party data entering transactional databases, putting contracts between third-party data and an analytical database is often not feasible. One major exception are cloud-based customer relationship management (CRM) platforms (e.g. Salersforce and Hubspot), which are often synced with an analytical database with data connectors. While the data is third-party, CRMs still allow customizability of the underlying data models and columns which can be controlled with data contracts. This is especially true given these data sources.
- E. Analytical Database → Data Products:
-
From our conversations with organizations interested in data contracts, many data teams strongly consider first placing data contracts on their analytical database to control data transformations used for analytics, machine learning, and dashboards. While valuable, we strongly encourage teams to first focus on the stage between transactional and analytical databases and then work downstream. Specifically, starting too far downstream will diminish your ability to prevent data quality issues before they are stored in the analytical database. With that said, this is an excellent location after placing the upstream data contracts.
- F. Analytical Database → Transactional Database:
-
While it’s possible to place data contracts for “reverse ETL” workflows, it’s less common as ideally data contracts on the analytical database are preventing data quality issues from data transformations. Furthermore, unless the engineering team is spearheading the implementation of data contracts, it’s unlikely for data contract requests as it often falls in the purview of the data team.
- G. Data Products → Business Users:
-
Placing data contracts between analytical databases and downstream consumers serve as a way to ensure trust and consistency for the data being served via data products. While too downstream to prevent data quality issues, it enables data teams to tier the data they serve and set expectations. For example, would you want to rely on making key business decisions on a dashboard that was or wasn’t using data under contract?
The key deciding factor of where to implement data contracts is the relationship between upstream and downstream producers. While ideally this will always be on good terms, realistically it’s more nuanced and challenging. In Chapters 10 through 12, we will detail how to navigate these nuances in team dynamics within the business and build buy-in.
The Maturity Curve of Data Contracts: Awareness, Ownership, and Governance
While Data Contracts are a technical mechanism for identifying and resolving Data Quality and Data Governance issues upstream, cultural change is equally as important to address as technology. Culture change here refers to the shifts in behavior required by data producers, data consumers, and leadership teams in order to help their company onboard to data contracts and successfully roll out federated governance at scale.
It’s important to acknowledge that not all companies are equally ready for data contracts from day 1. Some businesses are already data-driven: These companies understand and appreciate the value of data for its capacity to add operational and analytical functions from the top of the organization to the bottom. Others are simply data aware: They understand data is used at their company, but outside the data organization there isn’t much recognition of the need to invest in infrastructure, tooling, and process. Others are data ignorant: Their journey in data has barely started!
To make matters more complex, these differences in maturity and communication may not only differ between organizations, but WITHIN organizations across teams. For example, a machine learning team may have a sophisticated appreciation for the value of data with the Sales team may not. In our experience, it is important to acknowledge these differences exist. Companies cannot be treated as monolith! To succeed with the implementation of data contracts, data heroes must take a pragmatic approach that relies on rolling out contracts across a maturity curve depending on organizational and team-based readiness.
The data contract maturity curve has three steps: Awareness, Collaboration, and Ownership. We’ll outline each step, as well as their corresponding goals.
1) Awareness
Goal: Create visibility into how downstream teams use upstream data
The goal of the Awareness phase is to help both data producers and data consumers become aware of their responsibilities as active stakeholders in the data supply chain. As mentioned above, data contract requirements must start from the consumers who are the only ones with an explicit understanding of their own use cases and data expectations. In order to limit the surface over which data teams must begin implementing contracts, it is advisable to select a subset of useful downstream data assets known as tier-one data products.
Data product is a term that has many different definitions. We prefer looking to our software engineering counterparts for directional guidance. A software product is the sum of many engineering systems organized to serve an explicit business need. Products have interfaces (APIs, User interfaces) and backends. In the same way, a data product is the sum of data components that are organized to serve a business need, as illustrated in Figure 4-7.
A dashboard is a data product. It is the sum of many components such as visualizations, queries, and a data pipeline. The interface: a drag-and-drop editor, charts, and data tables. The back-end: data pipeline and data sources. This framing rings true for a model’s training set, embedded data products, or other data applications. The data contract should be built in service of these products, not the other way around.
In the Awareness phase, data producers must understand that changes they make to data will harm consumers. Most data producers are operating in a black box regarding the data they emit. They don’t want to cause outages, but without any context provided pre-deployment, it is incredibly challenging to do so.
Even without the implementation of a producer-defined data contract, producers should still be aware of when they are making code changes that will affect data, exactly what those changes will impact, and who they should speak to before shipping. This pre-deployment awareness drives accountability and most importantly - conversation.
2) Collaboration
Goal: Ensure data is protected at the source through contracts
Once a data producer has some understanding of how their changes will impact others in the organization, they are faced with a set of choices. A.) Make the breaking change and knowingly cause an outage or B.) Communicate with the data consumers that the change is coming. The second option is better for a wide variety of reasons that should be obvious!
This resolves most problems for data consumers. They are informed in advance before breaking changes are made, have plenty of time to prepare, and can potentially delay or deter software engineers’ deployment by advocating for their own use cases of the data. This sort of change management functions in a similar way to pull requests. Just as an engineer asks for feedback about their code change, with a consumer-driven contract, they may also “ask” for feedback about their change to data.
Note
It can’t be stressed enough the importance of this collaboration happening pre-deployment driven by context. Once code has been merged it is no longer the responsibility of the engineer. You can’t be accountable for a change you were never informed about!
3) Contract Ownership
This shift-left towards data accountability resolves problems but also creates new challenges. Consider you are a software engineer who regularly ships code changes that affect your database. Every time you do so, you see that there are dozens of downstream consumers, each with critical dependencies on your data. It’s not impossible to simply communicate with them which changes are coming, but to do that for each consumer is incredibly time-consuming! Not only that, but it turns out certain consumers have taken dependencies on data that they shouldn’t be, or are misusing data that you provide.
At this point, it is beneficial for producers to define a data contract, for the following reasons:
-
Producers now understand the use cases and consumers/customers
-
Producers can explicitly define which fields to make accessible to the broader organization
-
Producers have clear processes in place for change management, contract versioning, and contract evolution
Data producers clearly understand how changes to their data impact others, have a clear sense of accountability for their data, and can apply data contracts where they matter most to the business. In short, consumer-defined contracts create problem visibility, and visibility creates culture change. The next section details the outcomes of enabling this change.
Outcomes of Implementing Data Contracts
There are several extremely important outcomes which occur as a result of data contracts being implemented across your organization. Some of them are obvious and can be measured quantitatively while others are softer, but nevertheless have some of the greatest impact on culture change and working conditions as a data developer. The three core metrics are the following:
- Faster Data Science and Analytics Iteration Speed:
-
When Mark was a data scientist, a majority of his time for any project was spent on 1) sourcing data that was of high enough quality to use within a data lake, and 2) spending considerable time understanding the quirks of the data in question and validating it. Specifically in his previous role at an HR-tech company, one of the most important data assets was “manager status of an employee.” Despite this being a critical data asset for an HR company, it was constantly changing as new customers created various edge cases or the product evolved. For example, an enterprise customer would change employee management systems and thus ingested employee data would change from a daily batch job to a monthly– unfortunately management status doesn’t align with a monthly cadence. Thus, the same exploration and validation stage was present on every new project on the same data asset.
-
With implementing data contracts, the outcome would be considerably cutting this exploration and validation stage. First, having data assets under contract creates a shortlist of data to use for data science projects and ensures that proper due diligence has already taken place. Second, the data contract itself documents the quirks of the data one needs to be aware of and its expectations. Third, since data contracts are version controlled, data teams also have a log of past changes in constraints and assumptions, as well as a mechanism to document and enforce new ones that arise and or evolve. In short, data contracts provide a mechanism to handle the discovery and validation of data assets at scale while also disseminating this information across teams in a manner that’s version controlled.
- Developer and Data Team Communication:
-
In Chapter 3, we referenced Dunbar’s Number and Conway’s Law as two powerful phenomena within businesses that shape how technical teams communicate (or lack of) with each other. Specifically how an increase in employee count corresponds with an exponential rise of potential connections that ultimately breaks down communications– which are reflected in the technical systems built by the organization. Data contracts aim to overcome this challenge via automation and embedding itself within existing CI/CD workflows.
-
First, data contracts increase visibility of dependencies related to data assets generated and or transformed by upstream producers (e.g. application engineers) in the three following ways:
-
For a data contract to be enforceable, both the consumer and producer parties need to agree to the data contract spec before implementing it within the CI/CD pipeline– serving as a forcing function for communication between both parties.
-
Since an enforceable contract is within the CI/CD process, violations generate failed test notifications within pull requests and notify data asset owners, thus supporting code reviews with relevant information and the people who can help resolve issues.
-
There may be cases where a contract violation is no longer relevant due to evolving needs, implying that a new constraint is needed, thus changes to the contract specs inform downstream data asset owners rather than finding out reactively with the data itself.
-
- Mitigation of Data Quality Issues:
-
Ultimately, the reason for going through the effort of implementing data contracts and coordination amongst technical teams is to reduce business critical data quality issues. But to reiterate, data quality is not about pristine data, it’s about fitness of use by the consumer and their relevant tradeoffs. Furthermore, poor data quality is a people and process problem masquerading as a technical problem. Data contracts serve as a mechanism to improve the way in which people (i.e. data producers and consumers) communicate about the captured process of the business (i.e. data). Specifically, resolving data quality issues shift from being a reactive problem to instead a change management problem in which teams can iterate in as the consumer use cases become clearer over time.
While this section highlights high-level outcomes of data contracts, we encourage reviewing Chapter 11 where we go into the specific metrics used to measure the success of a data contract implementation.
Data Contracts vs. Data Observability
Through our hundreds of conversations with companies about data contracts, one question was often asked: “How are data contracts different from data observability, and when would I need data contracts or observability?” The below section aims to answer this question.
According to Gartner, data observability is defined as the following:
“... [The] ability of an organization to have a broad visibility of its data landscape and multi-layer data dependencies (like data pipelines, data infrastructure, data applications) at all times with an objective to identify, control, prevent, escalate and remediate data outages rapidly within acceptable data SLAs.”
The only caveat we will make to this definition is that data observability can’t be preventative in itself, as an event needs to happen for it to be observable, but it can definitely inform preventive workflows such as data contracts themselves. Table 4-3 below provides further information on the comparison of the two:
Data Contracts | Data Observability |
---|---|
Prevent specific data quality issues: Data contracts emphasize preventing changes to metadata that would result in breaking changes in related data assets. |
Highlight data quality trends: The main value proposition of data observability is that it gives you visibility of your data system and how that system is changing in real-time. |
Included in CI/CD workflow: Data contracts checks are embedded within the developer workflow, specifically CI/CD, so that breaking changes can be addressed before a pull request is merged. |
Compliments CI/CD workflow: Observability serves as a measurement of your data processes, and thus is not within the developer workflow; but the output of observability will inform which additional tests you need to have within your CI/CD workflow. |
Informed by business logic: Data contracts is technology to scale communication among data producers and consumers. One of the most important data consumers are business stakeholders who provide domain knowledge that enables value generation from data. |
Reflects how data captures business logic: While business logic is an input into data contracts, observability insteads measures the output of business logic within data systems and how well they align with reality (e.g. capturing data drift). |
Targeted visibility: Data contracts should not be on every data asset and pipeline, but rather only on the most important ones (e.g. revenue generating or utilized by executives). This prevents alert fatigue, as the goal of data contracts is not only to alert, but also encourage the data producer to take action when a contract is violated. |
Broad visibility: Data observability should reach into every aspect of your data stack from ingestion and processing to serving. While at later stages of implementation, one can refine thresholds, the broad visibility of |
Alerts before change: Data contracts shift agreements between data producers and consumers from implicit to explicit. This enables the prevention of metadata changes that would result in breaking related data assets for known issues. |
Alerts after change: By its name, to observe means that an event has already taken place, and thus can’t be fully preventative. With that said, such workflows after an event is essential for surfacing unknown issues to inform future data contracts. |
The key difference between data contracts and data observability is that contracts emphasize data quality prevention of known issues, while observability emphasizes detection of unknown data quality issues. One is not a replacement of the other, but rather, both data contracts and data observability compliment each other. Another way to think about the two is in terms of the flashlight and laser pointer, where both illuminate an area to bring attention to it yet serve different purposes. As illustrated in Figure 4-8, data observability is similar to the flashlight where it illuminates the entire data system and workflows, where the alternative is being “left in the dark” about your data and waiting to bump into data quality issues. While data observability serves as a flashlight in this analogy, data contracts can be viewed as the laser pointer. While its light is relegated to a small area, its value is in its ability to target and bring attention to a specific area within a system.
Observability and contracts are individually useful for data teams, but using them together enables teams to work more efficiently in that they are able to automate understanding their entire data system with observability and use this information to inform enforceable constraints with contracts.
Conclusion
This chapter provided the theoretical foundation of the data contract architecture, as well as discussed the key stakeholders and workflows when utilizing data contracts. Specifically, we covered:
-
How collaboration is different in data.
-
The roles of data producers and consumers within the data lifecycle.
-
The current state of reactively resolving data quality issues.
-
A high level overview of the data contract workflow.
-
The various tradeoffs between implementing data contracts throughout the data lifecycle.
-
The maturity curve of implementing data contracts and its outcomes.
In the next chapter, we will move from theoretical and into providing real-world case studies of implementing the data contract architecture.
Additional Resources
Get Data Contracts now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.