Chapter 9. Restoring Baseline Performance

In the previous chapter, we discussed approaches to improving baseline performance, usually with the goal of improving user experience, reducing costs, or both. In this chapter, we’ll consider how distributed tracing can help when a change—intentional or not—has caused a degradation in performance, and you need to restore performance to its previous levels quickly.

The way that your organization approaches problems like this may vary, but most organizations will follow some sort of incident response plan. Such a plan involves identifying when an incident occurs (either a partial or complete interruption in service or a significant performance degradation), how team members are notified, how they respond, and (once the incident is over) what sorts of follow-up are required. While there are other types of incidents besides those related to performance (for example, security breaches), we will frame many of the approaches here in terms of incident response.

In this chapter we will also focus on performance from the perspective of a single service. Most developers are responsible for at most a small number of services, and so it’s natural to frame performance issues in terms of the performance of a single service. Of course, what ultimately matters is the overall application performance as perceived by your users; much of this chapter will discuss how to relate application performance to the performance of individual services.

As the focus ...

Get Distributed Tracing in Practice now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.