Chapter 11. Building Resilient On-call Teams

The most critical responsibility of the production environment is on-call and the management of events that impact customers. As an organization scales, often on-call and incident management become a team’s breaking point because not enough investment is put into the production experience early enough in the software development lifecycle. It may not make sense to spend money on operational costs when the service isn’t in operation. Yet, over time, this mode of spending becomes ingrained behavior. Burdened with the reactive work of responding to pages, you don’t have time or energy to repair underlying infrastructure, software, or service problems effectively. Even when you are not on-call, you may find yourself trying to focus on project work while also executing on preventing larger impacts or preparing yourself to be interrupted by an incident.

In this chapter, I examine on-call ...

Get Modern System Administration now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.