To whom it may concern,
In the book "Distributed Tracing in Practice", the reader is advised
to not use tracing for very long operations. But the book indicates
that there are options in this scenario. However, so far I can't find
where in the book these options are discussed. If you know where to
look, any pointers would be appreciated :).
For context, I'm thinking about stuff like tracing builds, deploys, etc. In parts of the book this seems to be encouraged, but the builds and deploys I'm working with are (currently) longer than a couple minutes.
Below is the quote I am referring to:
"Distributed request tracing works best when the entire traced
operation takes place in a fairly short (minutes) time span. There are
several reasons for this, such as data retention periods for trace
analyzers and sampling considerations (which we’ll get into later),
but for now let’s just say that it’s not a great fit. If you are
trying to trace operations with an extremely long execution time,
don’t fret, there are options to address those use cases."
Thanks a bunch,
PS -- it's a really good book, good job everyone!
Note from the Author or Editor:
Thanks for the feedback! I went back and looked and we don't really address this use case in the book (whoops!). I can give you a short explanation why and address some things that have changed in the space since we wrote it, though, and hopefully give you some pointers going forward.
Many tracing systems (especially span-based, off-the-shelf ones) are optimized for request/response RPC transaction flows like you'd associate with HTTP client/server operations. The simplest rationale I can give you here is simply that those are the ones that provide the greatest amount of business value for monitoring - they tend to be the things you care about the most when they break. However, there's nothing inherent in span-based tracing that makes it unsuitable for CI/CD tracing and the implementation of it is broadly similar to implementing it for any other service, although passing trace context between build tools and systems may be more challenging.
In general, I'd consider the following when deciding how to trace CI/CD processes (or really, any operation with a long wall clock time):
- "Do I have one big trace, or several smaller ones?" Consider assembling a single deployment that is composed of multiple assemblies or programs. If you're only building what changed as part of a deployment, it would stand to reason that certain intermediate steps are going to be cached. It might make more sense to create a shared attribute for whatever logical grouping you wish to consider the 'primary' key and then create smaller traces, perhaps even per-service traces, that share this attribute. This also can alleviate some burdens around passing trace context between systems, as each service is creating its own trace.
- "What am I trying to measure?" Traditional tracing systems are focused on 'golden signals' such as rate/error/duration metrics, but these may not be as valuable for CI/CD applications. The answer to this question can also inform your trace design quite heavily - if the primary operation you wish to measure is "how long does it take for a build to deploy to <x> environment", then that should inform what services in your build pipeline generate spans, and how they fit together.
- "Can I re-use existing work?" You probably already have logs for your build and deployment pipeline; It may be worthwhile to evaluate translating them into traces, rather than trying to add tracing into the code itself. Consider tools such as the OpenTelemetry Collector, which is an open source utility that can be easily extended, and use it to ingest and transform your log files into traces. In general, I'd suggest that CI/CD applications have less temporal urgency (as in, they're a bit lower-priority in terms of analysis) so it's more acceptable to transform logs into traces rather than perform white-box instrumentation of your CI/CD code and tools.
I hope this is useful information, and regret that we didn't add this advice to the text.