Incident Analysis

AUDIENCE

Whole Team

We learn from failure.

Despite your best efforts, your software will sometimes fail to work as it should. Some failures will be minor, such as a typo on a web page. Others will be more significant, such as code that corrupts customer data, or an outage that prevents customer access.

Some failures are called bugs or defects; others are called incidents. The distinction isn’t particularly important. Either way, once the dust has settled and things are running smoothly again, you need to figure out what happened and how you can improve. This is incident analysis.

NOTE

The details of how to respond during an incident are out of the scope of this book. For an excellent and practical guide to incident response, see Site Reliability Engineering: How Google Runs Production Systems [Beyer2016], particularly Chapters 12–14.

Get The Art of Agile Development, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.