Improving reliability over time is only possible if you start from a known baseline and can track progress. “Outalator,” our outage tracker, is one of the tools we use to do just that. Outalator is a system that passively receives all alerts sent by our monitoring systems and allows us to annotate, group, and analyze this data.
Systematically learning from past problems is essential to effective service management. Postmortems (see Chapter 15) provide detailed information for individual outages, but they are only part of the answer. They are only written for incidents with a large impact, so issues that have individually small impact but are frequent and widespread don’t fall within their scope. Similarly, postmortems tend to provide useful insights for improving a single service or set of services, but may miss opportunities that would have a small effect in individual cases, or opportunities that have a poor cost/benefit ratio, but that would have large horizontal impact.1
We can also get useful information from questions such as, “How many alerts per on-call shift does this team get?”, “What’s the ratio of actionable/nonactionable alerts over the last quarter?”, or even simply “Which of the services this team manages creates the most toil?”
At Google, all alert notifications for SRE share a central replicated system that tracks whether a human has acknowledged receipt of the notification. If ...