Book description
Site reliability engineers often use MTTx metrics to evaluate improvements or track trends. But is either MTTR (mean time to recovery) or MTTM (mean time to mitigation) ideal for decision making or trend analysis when it comes to production incidents? This report not only demonstrates how and why MTTx metrics come up short but also proposes ways to think about metrics differently to get the answers you want.
Google SRE Google SRE Stepan Davidovic uses a Monte Carlo simulation to show how MTTx metrics are poorly suited for decision making or trend analysis in the context of production incidents. Applying these metrics is trickier than it seems and can be dangerously misleading in many practical scenarios. With this report, you'll explore alternative methods for achieving these measurements.
- Work with a simple model of the incident lifecycle and timings using empirical datasets
- Use an analytical approach to get a clear picture of what your incident durations look like
- Focus on narrow questions of the incident lifecycle rather than analyze incident statistics using MTTx
- Explore alternative methods for achieving your measurements
Product information
- Title: Incident Metrics in SRE
- Author(s):
- Release date: March 2021
- Publisher(s): O'Reilly Media, Inc.
- ISBN: 9781098103156
You might also like
book
Designing Data-Intensive Applications
Data is at the center of many challenges in system design today. Difficult issues need to …
book
Generative Deep Learning, 2nd Edition
Generative AI is the hottest topic in tech. This practical book teaches machine learning engineers and …
book
Foundations of Scalable Systems
In many systems, scalability becomes the primary driver as the user base grows. Attractive features and …
book
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd Edition
Through a recent series of breakthroughs, deep learning has boosted the entire field of machine learning. …