Distributed Tracing in Practice
by Austin Parker, Daniel Spoonhower, Jonathan Mace, Ben Sigelman, Rebecca Isaacs
Chapter 11. Beyond Individual Requests
You’ve already seen how traces capture useful information about the end-to-end behavior of individual requests. This includes the time taken by each individual RPC, how much data was transferred at each hop, timeouts, and error responses. By inspecting a single trace carefully, you can often explain why the request took the time that it did. For example, you might see that a particular request missed in the cache. Perhaps a service returned an exceptionally large response record that took a long time to serialize and deserialize. Maybe there’s a straggler in a large RPC fanout that responds many milliseconds after its peers. Perhaps the trace reveals the dreaded staircase pattern, where RPC calls that should be parallel are in fact executing serially.
Any one of these situations would reveal something important about that particular trace, but as the span timing diagram in Figure 11-1 illustrates, it’s hard to interpret these behaviors in isolation. What you can’t tell from individual traces is how often the situation occurs, and in response to which types of requests. Therefore, should you—the service operator or owner—take some action to fix the problem, or is it a one-off that is unlikely to happen again in your lifetime? Which of the suspicious-looking parts of a trace are actually unusual? By comparing a single trace to an aggregate, or one aggregate set to another, you can learn contextual information that helps answer such questions. ...