Chapter 30. Methodological Debugging
Avishai Ish-Shalom and Nati Cohen
SREs often debug in production—under stress and flooded with information. Debugging can seem like a mysterious, innate trait, but luckily this is untrue; rather, you can follow a structured methodological process to pinpoint the problem, avoiding mistakes and cognitive biases.
Triage: In dealing with an incident, we first must make some meta-decisions quickly: What’s the business impact? Are we handling this incident now or can it be deferred? Do we have time to debug it or should we employ an emergency failover procedure? Many of the answers are unknown before you begin debugging. Triage is a short phase to answer these questions quickly, before launching into a possibly long debugging process. However, you can go back to it at any point: think of triage as a fail-fast step to return to any time you have more data.
Operational definition: To solve an issue we decide is worth pursuing, we need to define it precisely and measurably (“it’s slow” doesn’t cut it). An operational definition has two main parts: a method of measurement (i.e., From where? With which tools? When? In which environment?) and an expected result of that measurement (e.g., “p99 of transaction X is consistently over ...
Get 97 Things Every SRE Should Know now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.