Incident Metrics in SRE

Abstract

Measuring improvements as a result of a process change, product purchase, or technological change is commonplace. In reliability engineering, statistics such as mean time to recovery (MTTR) or mean time to mitigation (MTTM) are often measured. These statistics are sometimes used to evaluate improvements or track trends.

In this report, I use a simple Monte Carlo simulation process (which can be applied in many other situations), as well as statistical analysis, to demonstrate that these statistics are poorly suited for decision making or trend analysis in the context of production incidents. To replace these, I propose better ways to achieve the same measurements for some contexts.

Introduction

One of the key responsibilities of a site reliability engineer (SRE) is to manage incidents of the production system(s) they are responsible for. Within an incident, SREs contribute to debugging the system, choosing the right immediate mitigation, and organizing the incident response if it requires broader coordination.

But the responsibility of an SRE is not limited just to managing incidents. Some of the work involves prevention, such as devising robust strategies for performing changes in production or automatically responding to problems and reverting the system to a known-safe state. The work also includes mitigation, such as better processes for communication, improvements in monitoring, or development of tooling that provides assistance during ...

Get Incident Metrics in SRE now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.