Chapter 20. Managing Incidents
As we explored in Chapter 19, the purpose of on-call is to be aware of your systems so you can keep them healthy. But as much as you strive to reduce risk, failure will happenâthere will be incidents. Incident management begins when you detect a problem during an on-call rotation, but management often extends beyond on-call when other subject matter experts and teams are required for issue resolution. The aim of incident management is to minimize the impact of an incident.
You, as an individual, need the kinds of tools, techniques, and practices that will not only get you through an incident with minimal suffering but will also help you feel prepared ahead of time and able to react effectively when an incident occurs. You need good, clear communication across teams so that the appropriate subject matter experts can share their knowledge and minimize time to resolution. And you need a way to capture and apply what you learned from the incident to improve overall production, reduce future impacts to customers, and reduce the teamâs toil.
In this chapter, I share the framework for collaborative and sustainable incident management from identifying incidents to conducting post-incident reviews and identifying the actions required to improve the live environment.
Note
I am assuming your team has incident management and that youâll have some framework to which you can apply what Iâm sharing to improve your experience. If your team doesnât currently ...
Get Modern System Administration now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.