Chapter 4. Pipeline Issues and Troubleshooting

As we’ve alluded, the primary objective of any ETL is to deliver business value. This goal, however, is often hindered by fragile systems characterized by prolonged recovery times, leading to excessive resource allocation toward troubleshooting rather than innovation and value creation.

A well-designed ETL pipeline, therefore, must be robust and maintainable, embodying key characteristics such as high data quality, efficient error handling, and effective issue identification—all integral components of system observability. Regardless of your engineering prowess, your systems will fail—not everything is within our control!

Planning for failure means engineering pipelines that:

  • Are easy to maintain and extend—that is, allow for quick error triage and new feature development

  • Provide automated ways to handle errors in real time and recover from failure

  • Incorporate a framework for improvement based on learning and experience

This approach ensures maintainability and seamlessly ties into another critical aspect of ETL systems: scalability.

As we delve deeper, we’ll explore how maintaining a system goes hand in hand with scaling it, addressing challenges and strategies to ensure your ETL pipeline is both resilient and adaptable to evolving business needs.

Maintainability

In a chapter on issues and troubleshooting, we’re mainly addressing the topic of maintainability—quite literally, the ability to maintain the things you’ve already built.

As engineers, we sometimes goal-seek on solutions exclusively. While we’re huge advocates for minimum viable products and the 80/20 principle, when collaborating on systems to drive business value, the ability to maintain and scale (the topic of Chapter 5) those systems is imperative. The inability to maintain a pipeline will come at a direct and indirect cost to your team. While the former is obviously more salient, the latter is more likely to sink your battleship:

Direct cost

Most are quite aware of direct cost—if you want something expensive, you usually have to pony up and make a good case for it. When interest rates were near zero and venture capital money flowed like the Rio Grande, many overlooked the cost of things like cloud computing, storage, and DataOps. Efficiency starts with pattern design. Bad patterns mean low efficiency across an entire organization.

An ETL system with a low maintainability will incur a high direct cost in the form of inefficient operations and long runtimes. This will not only incur a high bill from your provider of choice, but it will also result in slow jobs and delayed data from your team.

Indirect cost

Indirect costs can far outpace direct costs. Creating systems that fail often and require constant triaging can be dismal for resource allocation. Your team’s time and energy are the most valuable commodities you have.

Spending the majority of your working hours fixing DAGs, responding to alerts, and fighting fires might make you look busy, but generally manifests itself as unsustainable and unproductive. Teams that win build efficient systems that allow them to focus on feature development and data democratization. Inefficient systems mean more fire drills, more sleepless nights, and fewer hours spent building things that matter.

The main selling point of SaaS is that the benefit outweighs the cost, namely the cost of developing and maintaining in-house solutions. If you’re on a team with limited resources or experience, the most maintainable solution might be one you buy.

Of course, these outages don’t just impact data engineers—those downstream are also affected. That means a loss of trust from stakeholders, loss of revenue from customers, and even loss of reputation from the general public. The indirect cost of a data failure, especially one that exposes PII, can be catastrophic.

So, why focus on maintainability? Winning teams are built on maintainable systems.

If your goal is to move your organization forward with timely, efficient data, you’ll need to minimize issues and troubleshooting, lowering the direct and indirect costs of operation. In the rest of this chapter, we’ll discuss precisely how you can do that through observability, data quality, error handling, and improved workflows.

Monitoring and Benchmarking

Here, we make the distinction between observability and monitoring/benchmarking. Benchmarking and monitoring systems is essential—a subset of observability. But observability isn’t just about troubleshooting and maintenance. Monitoring and benchmarking our systems is required for minimizing pipeline issues and expediting troubleshooting efforts.

Tightly monitored and neatly benchmarked systems are set up quite nicely as “observable” and make it easier to improve the maintainability of our data systems and lower the costs associated with broken data. Without proper monitoring and alerting, your team might not even know things have gone wrong! The last thing we want is for stakeholders to be the ones to discover a data error…or worse, the error to go undiscovered and poor business decisions to result.

We should observe data across ingestion, transformation, and storage, handling errors as they arise (gracefully, if possible) and alerting the team if (when) things break.

Observability isn’t just for troubleshooting, however. It also helps us scale, as we’ll discuss in Chapter 5.

Metrics

Here are some essential measures used to assess the reliability and usefulness of data within an organization. These metrics help ensure that the data being collected and processed meets specific standards and serves its intended purpose effectively:

Freshness

Freshness refers to the timeliness and relevance of data in a system. It’s the measure of how up to date and current the data is compared with the real-world events it represents. Maintaining data freshness is crucial in ensuring that analytics, decision making, and other data-driven processes are based on accurate and recent information. Data engineers work to design and implement systems that minimize latency in data updates, ensuring that stakeholders have access to the most current data for their analyses and operations.

Common freshness metrics for a dataset include:

  • The length between the most recent timestamp (in your data) and the current timestamp

  • The lag between source data and your dataset

  • The refresh rate, e.g., by minute, hourly, daily

  • Latency (the total time between when data is acquired and when it’s made available)

Volume

Volume refers to the sheer amount of data that needs to be processed, stored, and managed within a system. It’s a fundamental aspect of data handling. Dealing with data at scale presents challenges, such as efficient storage, quick retrieval, and processing speed. High data volume demands specialized infrastructure and techniques like distributed computing, parallel processing, and data compression.

Volume metrics include:

  • The size of a data lake (gigabytes, terabytes, petabytes)

  • The number of rows in a database

  • The volume of daily transactions in a system

Quality

Quality involves ensuring that data is accurate, consistent, and reliable throughout its lifecycle. Data quality revolves around accuracy, consistency, reliability, timeliness, completeness, security, documentation, and monitoring. Addressing these aspects guarantees high-quality data for informed decision making and analysis.

Here are some sample data quality metrics:

  • Uniqueness: are there duplicate rows in your dataset?

  • Completeness: how many nulls exist? Are they expected?

  • Validity: is data consistently formatted? Does it exist in the proper range, e.g., greater than zero?

Methods

Monitoring implies the ability to see everything that’s happening across your stack and detect errors in a timely fashion. Data quality means implementing strict measures that improve the quality of the observed data.

The following are patterns and techniques you can apply to directly improve the quality of the data you’re monitoring, hopefully leading to fewer errors and less downtime:

Logging and monitoring

The first step to debugging is to check the logs. But to check logs, there have to be logs to check! Be sure that your systems are logging data verbosely. For systems that you build, make logging mandatory. Define and codify logging best practices, including libraries and what to log. Not only must you be an exceptional data plumber, but you also have to be a data lumberjack, a term we deem much more flattering.

Lineage

Simple in concept, lineage—the path traveled by data through its lifecycle—is one of the most important ways to observe your data. Having visual and code representations of your pipelines and systems is important for everyday runs, and will save you endless amounts of time in triaging and debugging issues.

For lineage to be useful, it needs to be both complete and granular. Complete in the sense that all systems are observed, including interconnectedness between systems. Ideally, your lineage will be at the most granular level possible. For tabular data, that’s the column level. Column-level metadata gives the most granular insight possible. It enhances a team’s ability to triage errors, simplifies workflows, and improves productivity…and the experience of working on data. As shown in Figure 4-1, lineage exercises start with the column.

Figure 4-1. A first principles assessment of data origins reveals the complexities of lineage

A good start is implementing systems with self-contained lineage solutions. A more complete lineage system might observe all of your data processes. Managed platforms are good for this, since they’re generally self-contained. Other tools, like Monte Carlo, might provide insight into the entire stack.

Anomaly detection

The trickiest thing about data is that you can’t see it all. Most of our errors arise from stakeholders or analysts returning unexpected results, which begs the question, “What gremlins are alive in my data today?”

Anomaly detection is one way to know that, within some threshold, anomalous data does not exist. Anomaly detection systems work through basic statistical forecasts to analyze time-series data and return data that lies outside some confidence interval.

Anomaly detection can be a great way to catch errors that might originate outside your systems—for example, a payments processing team introducing a bug that underreports purchases in Europe. No alarms will sound, but an anomaly detection system pointed at the right table will catch the bug.

Data diffs

Data engineering contains an additional dimension software engineering does not: data quality. Changing code changes outputs in ways that can be difficult to understand. What’s the trickiest thing about data?

Data diffs are systems that report on the data changes presented by changes in code. These tools will inform row and column changes, primary key counts, and more specific effects. Their primary purpose is to be sure accurate systems stay accurate.

For data diffing solutions, we recommend tools like Datafold. Newer SQL orchestrators, like SQLMesh, also have data diffing functionality. Data diffing is tied closely to CI/CD and is accented nicely by assertions and tests. We’ll discuss all of the above shortly.

Assertions

Assertions are constraints put on data outputs to validate source data. Different from anomaly detection, assertions are much simpler. For instance, you might say, “Assert that users.purchases only contains prices in the plans.pricing table.” If a value that does not exist in pricing appears in purchases, you know either (a) a new price was introduced or (b) there’s an error in your system.

While manual in nature (assertions require some business context and understanding of what should be), assertions are one of the few ways that we can be entirely confident data is not erroneous (in the way we specify, at least). For assertion solutions, check out libraries like Great Expectations or look for systems that have the built-in ability to define assertions.

Errors

Now that we’ve discussed observation and preventing errors, we come to the simple truth. Regardless of how robust your solution is, there will still be errors! Thus, error handling becomes incredibly important, both for your sanity and that of your team.

Separate from handling errors is recovering from their effects, whether that be lost data or downtime. An important part of recovery is retrospectives, postmortems, and reflections on how to prevent similar errors in the future.

Error Handling

Error handling is how we automate error responses or boundary conditions to either keep our data systems functioning or alert our team in a timely and discreet manner. The following approaches detail some methods for processing errors gracefully and efficiently:

Conditional logic

When building your data pipelines, conditional logic can be useful for unreliable or inconsistent sources. The ability to say “If X then A, else if Y then B” adds a powerful component to your orchestration and transformation tooling. Seek out solutions that allow conditional logic.

Retry mechanisms

Systems, even well-built ones, fail. Unforeseen errors can cause one-off API timeouts or other oddities. For that reason, even well-functioning code can produce errors. Retry logic is important in any orchestration tooling. Of course, for these to work, sensible retry settings must be configured. Be sure to incorporate a retry strategy that can handle data oddities or nuances but that isn’t wasteful and doesn’t result in endless runs.

Pipeline decomposition

We previously mentioned the concept of modularity in Chapter 3, but breaking pipelines down into “microservices” is an effective way to keep the impact of errors contained. Consider building DAGs and systems that only require absolutely necessary tables and connections.

Graceful degradation and error isolation

Error isolation is enabled through pipeline decomposition—whenever possible, systems should be designed to fail in a contained manner. You wouldn’t, for example, want your product usage data to live upstream of a financial reporting pipeline, or else an unrelated failure might have your CFO returning late metrics to the board, a situation that will end well for precisely no one.

Graceful degradation is the ability to maintain limited functionality even when a part of the system fails. By isolating errors and decomposing pipelines, we’re effectively enabling graceful degradation. You might experience an error that only one part of the business notices because the rest of your systems work so well. Trust us, this is far better than errors that everyone notices.

Alerting

Alerting should stand as a last line of defense—receiving alerts is necessarily reactive. We see that something bad has happened and drop everything to fix it. While proactive resolution of common errors is best, the unexpected will prevail eventually. Alerts might come in the form of an email or Slack message—we prefer Slack, since it’s highly visible and team members can add comments with helpful context.

When alerting, be cognizant of alert fatigue. Alert fatigue refers to an overwhelming number of notifications diminishing the importance of future alerts. Isolating errors and building systems that degrade gracefully, along with thoughtful notifications, can be powerful mechanisms for reducing alarm fatigue and creating a good developer experience for your team.

Recovery

So we now have insight into our systems, we’re actively enforcing quality constraints, and we have dedicated methods for handling errors. The final piece is building systems for recovering from disasters, which might include lost data. The following are a few methods and concepts that will help you bounce back after the inevitable failure:

Staging

We’ve mentioned staging quite a bit thus far, but an additional benefit of staged data is disaster recovery. With Parquet-based formats like Delta Lake and patterns like the medallion architecture, time travel makes it possible to restore data (up to a certain point). While staged data should be treated as ephemeral, it’s an important pattern for redundancy.

Backfilling

It’s likely that you’ll either (a) want data from some period before you began running your pipeline or (b) have lost data that needs to be backfilled. Backfilling is the practice of simulating historical runs of a pipeline to create a complete dataset. For example, Airflow has a backfill command that runs a pipeline for every date between two dates.

When building systems, keep backfills in mind. Systems that are easy to backfill will save you quite a bit of time when something breaks. Seek out tools that support simple backfills, as backfill logic is something that can get very complex, very quickly. Idempotent pipelines will also make your life easier. Check your orchestrator of choice for backfill functionality out of the box.

Improving Workflows

As much as we wish improving processes were entirely within our control, it’s not. Our job is inherently collaborative—proverbial “plumbers,” we give stakeholders pipelines for internal and external data. This makes our job inherently collaborative.

We’ve mentioned it briefly, but data engineering is truly a question of when things break, not if. Even in the best systems, anomalies arise, mistakes happen, and things fall apart. The Titanic is a lesson for a reason, right?

The deciding factor becomes your ability to adapt and resolve issues…and that’s the point of this chapter! Starting with systems that prioritize troubleshooting, adaptability, and recovery is a great way to reduce headaches down the line. In this section, we’ll provide a few ways for you to continually improve your processes.

Start with Relationships

Consider the following example: semi-regularly, your software team changes a production schema without warning. This not only breaks your daily ETL job from the prod dataset, it also runs the risk of losing data, since your export method lacks CDC.

Understandably, this leaves your team pretty frustrated. Not only are you losing time and resources by fixing these issues, but they seem entirely avoidable. Before reacting harshly, ask yourself the question, “Is the software team trying to reduce overall productivity and invoke worse outcomes?” If the answer is yes, we recommend a job search, but in 99.999% of cases the answer will be “No, of course not.”

In fact, if you ask yourself exactly what that team is trying to do, it’s probably the same thing you are—do the best work possible given your resources. Great, now you have empathy for their motivations.

Next is to understand their workflows and communicate your frustrations. From there, you can begin to craft a process to improve efficiency. Here are some ways you can ensure healthy relationships through a structured, pragmatic approach:

SLAs

Service-level agreements (SLAs) are common in the provider space for runtime and uptime guarantees, but they can be used just as effectively within and between teams. If you’re struggling with data quality, consider an SLA that formally defines things like performance metrics, responsibilities, response and resolution times, and escalation procedures so that everyone has a clear understanding of what’s required for incoming data.

By communicating requirements clearly and assigning ownership, SLAs can be a surprisingly effective way to improve the quality of data that’s outside of your control, just by writing a few agreements down on paper.

Data contracts

Data contracts have gained traction in the past few years as an effective way to govern data ingested from external sources. Popularized by dbt, contracts are a type of assertion that check for metadata (usually column names and types) before executing part or all of an ETL pipeline.

We like the term “contract” because it implies an agreement between two parties, like an SLA. We recommend first defining an SLA, even if it’s simply a back-of-the-envelope agreement, then implementing that SLA in the form of data contracts on external assets. If (when) a contract returns an error, the SLA will dictate exactly whose responsibility it is to resolve that error and how quickly it should be expected.

APIs

APIs can be a more formal method of enforcing contracts and SLAs. In a sense, SQL is itself an API, but there is no reason internal data can’t be fetched (or provided) via an API. APIs are just a method of transmitting an expected set of data, but, implemented correctly, they provide an additional layer of standardization and consistency to the source. APIs also come with more granular access control, scalability benefits, and versioning, which can be useful for compliance.

Compassion and empathy

You might be used to these terms appearing in texts where dbt is capitalized and refers to something other than a transformation tool, but they’re just as important in engineering as in psychology. Understanding your coworkers (and partners) and their motivations, pain points, and workflows will allow you to effectively communicate your concerns and appeal to their incentives.

In this digital age, it’s far too easy to take an adversarial approach or assume ill intent, especially from vague video meetings and terse snippets of text. We advocate going the extra mile to understand your coworkers, whether through one-to-ones, long-form written communication, or in-person meetings, where possible.

Align Incentives

To foster meaningful progress, we need to align incentives and outcomes.1 If your team’s only mandate is to “build lots of stuff,” little time will be spent on making that stuff resilient and robust. Setting key performance indicators (KPIs) around common incident management metrics can help justify the time and energy it takes to do the job right:

Number of incidents (N)

Counting the number of incidents over some timeframe will provide you with a window into the frequency of incorrect, incomplete, or missing data.

Time to detection (TTD)

TTD describes the average time it takes for an incident to be detected.

Time to resolution (TTR)

This metric gauges the swiftness with which your systems can resume normal operations after a disruption. It’s a direct measure of your system’s resilience and recovery capabilities.

Data downtime (N × [TTD + TTR])

Using the above three metrics, we can arrive at an average “data downtime.” This summary metric can help you understand the severity of your outages and health of your systems.

Cost

From downtime, you can calculate the cost of these failures. Cost is highly specific to your team and organization. We recommend a bespoke cost calculation, factoring in the specifics of your deployment.

Improve Outcomes

Thomas Edison’s famous quote, “I’ve not failed, I’ve just found 10,000 ways that don’t work,” aptly applies to the process of building exceptional data pipelines and the frameworks that support them. The journey toward excellence in this field is marked by a series of educated guesses, experiments, and, crucially, the ability to adapt and correct course when faced with challenges.

If you’ve followed along thus far, you know great systems are built on great frameworks. Here are a few ways to iterate and improve your processes in the wake of failures and adjust your good pipelines to make them great:

Documentation

If your data is staged and backfill-able, the only thing left is to know how to fix it! Be tedious and pedantic when it comes to documenting your systems processes and code. While it might feel like a chore, we can guarantee it will be less time-consuming (and stressful!) then trying to reengineer your code during a failure.

Postmortems

In any large event or outage, a postmortem can be valuable for analyzing the failure. A postmortem involves reflecting on what went wrong and performing an analysis to understand why. Postmortems, and reflection in general, are excellent ways to learn, educate, and grow. Ideally, your postmortems will lead to fewer events that require recovery in the first place.

Unit tests

Unit testing is the process of validating small pieces of code (the components of a system) to ensure they produce results as expected. Like any other engineering system, code in a data engineering system should be unit tested. That means any custom code or bespoke systems you create should have tests to ensure they’re producing desired results.

These are specifically different from assertions, since unit tests check the underlying code rather than the output data. Building unit tests into your code is an excellent preventative practice to minimize future errors.

While many platforms are slowly adopting unit tests/assertions, there are a surprising number of data teams operating without them. We advocate their adoption in every data system.

CI/CD

Continuous integration/continuous deployment (CI/CD) is a term that refers to how you integrate and deploy your code—that is, how changes (like pull requests) are assimilated, tested, and rolled out to your code base.

In practice, CI/CD for data engineering might include any number of things we’ve already discussed and a few things we haven’t. Unit tests, assertions, linting, and data diffs will ensure a consistent code base that functions well and allows you to seamlessly collaborate with others to build something awesome (at scale).

Simplification and abstraction

More of a design pattern than a tool, as an engineer, you should seek to simplify and abstract as much complex logic out of your code as possible. This not only makes it easier to collaborate, but it also reduces the likelihood that you will introduce errors.

When you’re writing a piece of code, think to yourself, “If I don’t look at this for six months, how difficult will it be for me to understand it?” If something breaks in six months, you’ll likely be looking at it under some form of pressure, so keep that in mind.

Data systems as code

Data diffs hit on a concept that we’ve yet to discuss—building data systems as code. By versioning and codifying every system, it becomes possible to roll back changes and revert to previous states. In a sense, staging systems with a medallion architecture let us do something similar with time travel.

Building your data systems with software engineering best practices and implementing logic as version-controlled code will drastically improve your observability, disaster recovery, and collaboration. Be wary of any tooling that is difficult or impossible to version control or otherwise manipulate through code, even if only JSON or YAML configs.

With responsibilities defined, incentives aligned, and a full monitoring/troubleshooting toolkit under your belt, you’re ready to begin automating and optimizing your data workflows. Recognizing that some tasks cannot be automated and edge cases will always exist, the art of data engineering is balancing automation and practicality. Now, with robust, resilient systems, we’re ready to scale.

1 This section is adapted from a Monte Carlo guide by Barr Moses, “12 Data Quality Metrics That ACTUALLY Matter”.

Get Understanding ETL now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.