Fair
Fair (source: FrankWinkler)

I recently sat down with Ben Sigelman, cofounder and CEO of LightStep, to discuss important considerations when adopting distributed tracing tools and the realities of monitoring complex, distributed systems. Here are some highlights from our talk.

What are distributed tracing tools and what problem(s) do they solve?

Good monitoring is and always has been about telling clear stories. When your system had three components, you could understand application behavior by looking at single components in isolation. With the transition to microservices, individual end-user requests often traverse dozens if not hundreds of system components; tools that consider each microservice in isolation simply do not tell clear stories anymore, and that’s a massive operational problem.

Simply put, distributed tracing tells these stories about transactions in distributed systems; by following the transaction as it propagates from service to service, distributed tracing tools illuminate the relationship between user-visible behavior at the top of the stack and the complex mechanics of the distributed systems underneath.

What are some of the considerations and challenges involved in adopting distributed tracing tools?

There are three main considerations.

First, is it necessary? Organizations that are happily using legacy monolithic architectures may see some modest benefits by adopting distributed tracing, but generally I don’t recommend it until an organization is starting a transition to microservices. That said, it is vitally important to think about distributed tracing before it’s too late. If your organization moves to microservices and has an underdeveloped distributed tracing story, devs and devops alike will be unable to make sense of even basic system behavior and misbehavior.

The second consideration is instrumentation. As a project originator I am admittedly biased, but OpenTracing (opentracing.io) was specifically designed to reduce the cost of distributed system instrumentation for all parties—open-source software maintainers, tracing system authors, and application programmers.

Your goal is to minimize the amount of time you spend writing instrumentation while maximizing coverage of your distributed system. Using a standard set of technologies in your microservices architecture will reduce the number of integrations you need to think about, so that’s a best practice (and not just for the sake of monitoring). Beyond that, try to avoid coupling instrumentation to specific monitoring tools. OpenTracing helps with this by focusing on application semantics, and less on the details of wire protocols and proprietary terminology.

The third, final, and most important consideration is use cases and workflows. It’s exciting to see the first timing diagram in your system, but that is the beginning, not the end. How is your team going to find the right traces during a production emergency? How are you going to apply these insights beyond vanilla critical path analysis? Sit down with an expert (I’m always happy to get emails out of the blue on this front!) and understand what you’re building. The instrumentation is the cost, not the benefit, so don’t forget to think about workflows and value.

What do people have to consider when monitoring complex, distributed systems (such as microservice architectures)?

Everything. Honestly it’s a nightmare if you’re using tools from five years ago. As I mentioned earlier, those tools do not tell clear stories about distributed transactions. But even if they did, those tools become dollar-cost expensive in microservices. Since the monitoring data volume increases superlinearly with the number of system components, ROI gets precarious.

Furthermore, the number of possible root causes in your system grows with the number of services. Monitoring used to be about enumerating root causes; this is no longer feasible.

In my worldview, monitoring in complex distributed systems has two core components:

  1. Enumeration and precise statistical monitoring of important symptoms—things like user-visible latency, internal and external SLAs, and so on.
  2. Automated root-cause analysis for anomalies observed in those statistics. You need tools that help an operator root-cause an anomaly directly. It’s not sufficient to dump them into a giant, undifferentiated log aggregator or a dashboard with 150 graphs that suggest correlated failure without getting into causation.

How do you optimize the performance of these kind of systems?

The first thing is just to be able to measure it in a principled way. If you’re already doing that, you’re better off than most in our industry. Performance monitoring must focus on high-percentile latency where the scary problems show up before they’re emergencies. It’s also mandatory to understand the critical path for business-critical transactions. The easiest way to do this is via distributed tracing; though if you have centralized logging and correlation IDs, that can help, too. Finally, it’s impossible to think about microservice performance in a dev environment. Modern systems are non-linear, and unfortunately there’s still no substitute for production when it comes to real-world performance; this just underlines the importance of CI/CD and automated release qualification and rollback.

You’re speaking at the O'Reilly Velocity Conference in San Jose this June about distributed tracing. What presentations are you looking forward to attending while there?

I’m looking forward to Lyft's Envoy: Experiences operating a large service mesh, Google Cloud Spanner: Global consistency at scale, The verification of a distributed system, Distributed tracing and the future of chargeback and capacity planning, and The road to chaos. And the keynotes look awesome.

Article image: Fair (source: FrankWinkler).