Being on-call means being available during a set period of time, and being ready to respond to production incidents during that time with appropriate urgency. Site Reliability Engineers (SREs) are often required to take part in on-call rotations. During on-call shifts, SREs diagnose, mitigate, fix, or escalate incidents as needed. In addition, SREs are regularly responsible for nonurgent production duties.
At Google, being on-call is one of the defining characteristics of SRE. SRE teams mitigate incidents, repair production problems, and automate operational tasks. Since most of our SRE teams have not yet fully automated all their operational tasks, escalations need human points of contact—on-call engineers. Depending on how critical the supported systems are, or the state of development the systems are in, not all SRE teams may need to be on-call. In our experience, most SRE teams staff on-call shifts.
On-call is a large and complex topic, saddled with many constraints and a limited margin for trial and error. Chapter 11 of our first book (Site Reliability Engineering), “Being On-Call,” already explored this topic. This chapter addresses specific feedback and questions we received about that chapter. These include the following:
“We are not Google; we’re much smaller. We don’t have as many people in the rotation, and we ...