Reducing Mean Time to Detection for High-Severity Incidents
Introduction
High-severity incident (SEV) is a term used at companies like Amazon, Dropbox, and Gremlin. Common types of SEVs are availability drops, product feature issues, data loss, revenue loss, and security risks. SEVs are measured based on a high-severity scale; they are not low-impact bugs. They occur when coding, automation, testing, and other engineering practices create issues that reach the customer. We define time to detection (TTD) as the interval from when an incident starts to the time it was assigned to a technical lead on call (TL) who is able to start working on resolution or mitigation.
Based on our experiences as Site Reliability Engineers (SREs), we know it is possible for SEVs to exist for hours, days, weeks, and even years without detection. Without a focused and organized effort to reduce mean time to detection (MTTD), organizations will never be able to quickly detect and resolve these damaging problems. It is important to track and resolve SEVs because they often have significant business consequences. We advocate proactively searching for these issues using the specific methodology and tooling outlined in this book. If the SRE does not improve MTTD, it is unlikely that they will be able to detect and resolve SEV 0s (the highest and worst-case severity possible) within the industry-recommended 15 minutes. Many companies that do not prioritize reducing MTTD will identify SEVs only when customers ...
Get Reducing MTTD for High-Severity Incidents now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.