Chapter 1. The Need for Telemetry Pipelines

Water, water, everywhere, nor any drop to drink.

Samuel Taylor Coleridge, The Rime of the Ancient Mariner

Volume. The problem wasn’t originally volume.

In the early days of managing, monitoring, and operating systems, when things were going well, a system was relatively silent. Your system was a black box that you hoped was working or, if you were lucky, emitted a few signals to give you a sense that things were, indeed, all OK.

In the days of fixed, inflexible systems, fragile and fixed management dashboards were de rigueur. As you composed a system from its individual parts, you, ideally in parallel, decided on the metrics that might be important and looked to expose those for the sole purpose of producing fixed operational management dashboards.

The goal was simple: provide just enough information and control to help the poor souls woken up in the depths of the night to complete their blurry-eyed, urgent, and critical tasks to set the system right, to keep it working, and to keep the lights on.

And then our systems stopped being fixed, resolute, static. Our systems became elastic, rapidly changeable, rapidly scaling and evolving, and the volume of system telemetry signals grew in tandem. With increasing volumes of telemetry data came an increase in opportunity to use that data effectively. This is, in a nutshell, what telemetry pipelines do: help you manage, manipulate, make sense of, and extract value from this new flood of telemetry data.

Making Sense of the Madness: A Telemetry Pipeline in Action

Let’s start with a concrete, real-world example. What follows is a tragic story of “find the needle in the haystack” at the worst possible moment—in the heat of a real-world incident.

Optimized for Agility…

Setting the stage, we were working in a regulated bank. Our teams had recently become more agile in our ways of working and were running our systems at scale, in the cloud. This was supposed to be what success felt like. Through a number of architectural shifts and system evolutions, we had developed and were maintaining and scaling our payment systems for a rapidly growing set of customers. We’d won, but rather than a silver lining, there was a tinge of darkness to our clouds.

We’d encouraged small, independent incremental iterations of delivery rather than enormous batch releases. Our systems were changing all the time, but our metrics and dashboards struggled to keep up. Change had become normal, and that posed a problem for our management and monitoring.

We overcame this by bringing our developers and operations people together, eventually establishing the beginnings of a DevOps movement amid new team structures and later bringing in site reliability engineering (SRE) and DevSecOps (development, security, and operations) practices too.

But we were beginning to feel nervous. Our telemetry data and dashboards were lagging dangerously behind. In a world where everything is changing all the time, how do you carefully curate the right feeds of telemetry data to surface useful metrics and control dashboards? How do you keep up with the sheer pace of change?

Then let’s throw scale into the mix, because we were becoming hugely successful.

…Ready for Scale…

If speed of change through agility upped the difficulty level from easy to hard, the flexibility and scale of the cloud shifted everything into “nightmare”. In the cloud, our systems became composed of tens, hundred, or even thousands of services, each service scaling and changing independently. Multitudes of different runtime services could be involved in a simple cross-border payment transaction, spread out over various deployment locations and using a hybrid of styles from functions to monoliths.

Armies of our people became involved. Each service could be scaled and evolved by its own teams, which then might have specific skills and needs for monitoring the critical behavior that they were responsible for. Production shifted from being complicated to complex as the lines between people and the systems they ran blurred. What used to be one black box became a legion: each independent part important, each critical, each made up of potentially many horizontally scaled resources, and each needing to support its people with effective telemetry data.

Any dream we might have had of designing a system to manage and monitor such a continuously evolving, scaling, and heterogeneous environment disappeared like droplets in the wind. Appropriately, the cloud has made things harder to see, manage, control, and have confidence in. Ashby’s law seemed to have settled in.

And then the incident hit … and it was painful.

…Ready for Disaster

Two weeks later, after conducting a number of interview sessions with the various incident participants, a pattern started to emerge.1 Although we had invested a fortune in collecting all the metrics we could think of—even though we had traces galore, logs by the pound, and dashboards by the dozen—we had consistently hit a problem.

Our data was siloed.

Pulling everything together in the heat of the moment required a gargantuan effort, a huge amount of specific area knowledge, and a lot of time. Time we didn’t have. This was a P1 incident. Clients had seen the damage. Our reputation had been harmed and, when you’re a regulated bank, that is not acceptable.

We had operated blind because of all the data we had. All the data was likely there for us to respond quickly, but it wasn’t in the right form, not even the right format, to come together into one, useful picture. And our incident resolution had really suffered.

Learning from Our Incident

Fortunately, we had a strong culture of learning from incidents. We immediately instigated a series of game days: controlled chaos experiments where we could practice various scenarios proactively and explore the gaps in our silos of telemetry information. Our game days became our flashlight, but then the question was, how do we work with the darkness to be better prepared for the next incident?

We realized that we needed something that could bring the streams of telemetry data together, to help us explore questions like “Where is that message?” and “What is happening now in that system?” and “Where is the payment?” Questions that we could not anticipate in advance. We could see the data, but we couldn’t debug the system quickly. We could see, but we could not observe.

Crafting the Flood: From Seeing to Observing

Observability is the capacity to ask open, unanticipated questions of your system as a black box—to be able to debug production in the heat of the moment. Because of the many disparate sources and structures of our telemetry data, all the information was likely there. Somewhere. We had the data, but rather than being a rich stream of information, it was a brutal flood of raw telemetry data residing in many hidden pools throughout the system.

Our telemetry data had all the hallmarks of a big data problem:

Volume

When it comes to telemetry data, “Source everything!” has become a mantra for modern, rapidly evolving, cloud native systems. This means you will have a huge volume of logs, metrics, traces, and other events that could be important right now in a critical system outage or next year when the auditors come calling.

Velocity

The speed at which your telemetry data is created and how quickly it needs to be disseminated, processed, and consumed to be as useful as possible will vary dramatically across the huge volume of telemetry data you will have access to.

Variety

Although initiatives like OpenTelemetry are looking to enable easy interoperability and, where possible, open standards across telemetry data and tooling, not every system will have caught up with that yet, and some may never do so. Across your heterogeneous systems, you are likely to encounter a wide variety of data formats for the logs, metrics, events, and other data that make up your telemetry.

Veracity

While it’s rare for systems to lie in their telemetry data, the quality and accuracy of your data could vary by source, and so veracity can be a factor.

Value

Another dimension to consider is the value of your telemetry data. To complicate matters, this could be the value of a timely, high-urgency metric signal or the latent value of the data when combined from multiple sources and in multiple forms.

We needed to bring together all this data and then work with it to surface the right information in the right places to support our practice incidents, our game-day exercises. All while respecting our stringent security policies. We needed to seize control of our flood of telemetry data and craft it into something really useful that would help us debug production quickly to reduce our mean time to resolution.

We needed a telemetry pipeline.

Mastering the Flood with a Telemetry Pipeline

To support our incident resolution processes, we created a number of telemetry pipelines to bring together, enrich, channel, secure, and surface our telemetry data where our teams needed it. Telemetry pipelines offer tools to source data from multiple locations, condition that data through different processors, and then channel that data to as many destinations as you need to turn your flood into something useful.

This is what we did. We sourced our data from locations as diverse as Amazon S3 buckets to as standard as Splunk. We munged formats so that data could become useful, enriching the data as we went. Then we stripped out the data that we couldn’t distribute to enforce our security and privacy policies before surfacing the data in all the right destinations to help cut down the time to resolve our incidents.

The combination of our new telemetry pipelines and proving the pudding with our game days resulted in us being better prepared than ever for future unknown incidents. We couldn’t be ready for anything, but we could be better prepared, and that counted.

But we’d only just scratched the surface.

Rich, Useful, Timely, Secure

Telemetry pipelines helped us turn our flood of telemetry data into a rich, timely, compliant, secure, and, ultimately, useful set of insights to support one of the harshest environments: incident response.

And incident response was only the beginning.

An Investment, Not a Free Lunch

Creating telemetry pipelines requires mastering some new concepts and tools. Our first challenge was mindset. We had to change our perspective of what telemetry data was. Instead of seeing telemetry data as simply a peripheral artifact, we needed to reframe our view of the data as highly valuable streams that were worth investing in to support critical business functions. We’d viewed telemetry data as a side effect, but under the lens of telemetry pipelines, it became far more important.

We also had to learn new concepts and invest in some new tools to craft our telemetry pipelines. In Chapters 2 and 3, you’re going to explore the concepts, domain language, and types of components that will make up your telemetry pipelines so that you are ready to build your own. Then, in Chapters 4 and 5, you will explore how you can use pipelines to control the cost of your observability and embrace security compliance and regulations while not losing the power and potential of your telemetry data.

Get The Fundamentals of Telemetry Pipelines now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.