Skip to Main Content
Incident Metrics in SRE
book

Incident Metrics in SRE

by Stepan Davidovic
March 2021
Intermediate to advanced content levelIntermediate to advanced
34 pages
52m
English
O'Reilly Media, Inc.

Overview

Site reliability engineers often use MTTx metrics to evaluate improvements or track trends. But is either MTTR (mean time to recovery) or MTTM (mean time to mitigation) ideal for decision making or trend analysis when it comes to production incidents? This report not only demonstrates how and why MTTx metrics come up short but also proposes ways to think about metrics differently to get the answers you want.

Google SRE Google SRE Stepan Davidovic uses a Monte Carlo simulation to show how MTTx metrics are poorly suited for decision making or trend analysis in the context of production incidents. Applying these metrics is trickier than it seems and can be dangerously misleading in many practical scenarios. With this report, you'll explore alternative methods for achieving these measurements.

  • Work with a simple model of the incident lifecycle and timings using empirical datasets
  • Use an analytical approach to get a clear picture of what your incident durations look like
  • Focus on narrow questions of the incident lifecycle rather than analyze incident statistics using MTTx
  • Explore alternative methods for achieving your measurements
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Integrated Measurement- KPIs and Metrics

Integrated Measurement- KPIs and Metrics

Daniel McLean
Cloud Native Monitoring

Cloud Native Monitoring

Kenichi Shibata, Rob Skillington, Martin Mao

Publisher Resources

ISBN: 9781098103163