Chapter 4. Containing the Cost

“Show me the money!”

Jerry Maguire

Volume is the problem, but not just because it is hard to navigate and work with. Data, especially in the cloud, costs money. Sometimes lots of money. If a potential $65 million bill doesn’t scare you, then your organization is doing exceptionally well. For the rest of us, cost really matters.

Processors Are the Key

In Chapters 2 and 3, you got a glimpse of how telemetry pipelines can help with cost. Some key processors that can help you control cost are deduplicate, route, reduce, sample, filter, and conversion processors.

Deduplicate Where You Can

When it comes to cost, the deduplicate processor is your brutally simple friend. By applying some simple logic, the deduplicate processor can reduce your telemetry data streams significantly without losing any data. This is why the first step for designing a telemetry pipeline must include first getting an understanding of your data, so you can effectively determine which processor to use to target your data components.

Choose Your Route Carefully

At the simplest end of the scale, you can merely choose where your telemetry data goes. If you want to optimize your spend on Splunk, you can ensure that only the data necessary for Splunk is routed to it. The remaining data could be routed to low-cost storage, such as S3, so that nothing is lost just in case. It’s that simple, sort of.

The art here is to ensure that you are still routing something useful to your destinations. A router might not give you the right level of intelligence to create a stream that is ultimately useful to your tooling destinations. You could end up paying for a first-class tool that is utterly hamstrung by third-class data.

As always, and with all of the cost-strategy processors, it’s going to be a trade-off, but at least you get to make that trade-off in your telemetry pipelines.

Reduce, Sample, or Filter It Down (Carefully)

Sampling and filtering loses data…. Say it with me: sampling and filtering by definition loses data. That sounds bad, and it is; however, sampling and filtering can still be used with care when you really have to. The good news is that a telemetry pipeline gives you the choice to sample or filter your data if you need to before the stream hits an expensive destination.

Downsampling or selectively filtering your telemetry data using a sample or filter processor will reduce the number of events you see in the resulting stream.1 For example, a simple sampling strategy is to ignore every second event; this is called 1/2 sampling. Only sending every third event is 1/3 sampling. This type of sampler is very common and is, for perhaps obvious reasons, called a 1/n sample processor.

You can also configure a 1/n sample processor to ignore the sampling if an event matches a particular pattern. To prevent dropping those crucial events amid the noise, you can configure a sample processor to notice the events and let them through with no downsampling applied.

Sampling can be extremely helpful with keeping control of the flood of data, particularly where specific events are less useful and a broad brush is all that is needed. But you have to use it carefully and get those exception cases configured so that you don’t accidentally sample out the one important piece of information that you need.

Filtering is more controlled. With a filter processor, you are looking for events that match a specific set of criteria so that, when they match, they can be dropped. You can filter out whole events or just drop single fields, reducing the data in the stream and lowering costs at any of your destinations.

In an ideal world, which none of us live in, we wouldn’t sample or filter at all other than to improve the condition of the events using something gentle like deduplication. There is, however, a get-out-of-jail-free card available with telemetry pipelines. If you route your telemetry data before your sampler to an archive—something much cheaper than a full observability tool destination, such as S3—then you can persist a record of the raw data. That repository becomes an asset should you ever need to suck up the cost and bring that data back into your observability tools.2 It’s not perfect, but it gives you options you would not have had without a telemetry pipeline at your disposal.

Converting Events or Logs to Metrics

Another useful technique to reduce data volumes and increase insights is converting logs to metrics. Logs are inherently unstructured and voluminous, but you can derive metric data from logs by parsing the log data to extract specific information and then use that information to create metrics. Or you can count specific events within the log data and use that count to create a metric.

As an example, by identifying a specific value within a log message, such as the time it took to serve a request, you can create a new metric. Now you can use this distilled information for your analysis in security information and event management (SIEM) or for visualization in tools like Grafana. This not only reduces the log volume, but also helps extract business insights.

Cost, Controlled

By carefully using route, deduplicate, reduce, sample, filter, and conversion processors, you can at least control how much of your telemetry data ends up where. You can also choose to park your data in locations, such as S3, that can be cheap options for storage and, through telemetry pipelines, can be easily reborn into streams for processing and channeling in the future. It’s with this choice and flexibility that telemetry pipelines really help you shine when managing the cost of your observability.

But in addition to cost, there is one other challenge with telemetry data that we’ve only hinted at so far. The elephant in the room. Data isn’t just about cost, it’s about control. Serious control. Serious go-to-jail-if-you-get-it-wrong consequences. We are, of course, talking about compliance.

By extracting metrics from raw events and log data, you can discern more value from your telemetry data assets and see where your money is being spent on effective ingress and egress of that data. And this is only the beginning.

The visibility and control your telemetry pipelines give you can help you deliver better business insights and even improve your security posture by making sure the right data is sent to your security tools at the right time. You can accelerate resolution times and improve customer experience as well as help ensure that you meet all required data compliance regulations. Next, it’s time to explore that specific case. Risk and compliance, anyone?

1 See “Adding Processors” for a short description of these processors.

2 Retention management is often built into cloud storage, such as S3 lifecycle policies.

Get The Fundamentals of Telemetry Pipelines, Revised Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.