Effective SRE: On-call Best Practices
Topic: System Administration
As the world has moved online, we’ve grown to expect that everything works around the clock. Organizations are investing more and more into maintaining their systems, searching high and low for ways to make them more reliable. Yet a key resource is hiding in plain sight: your people. Your developers and operators are the last line of defense against getting your systems back online, but we give the practice of on-call barely any thought.
Designing on-call looks deceptively easy and is often done ad hoc. But ineffective on-call design can lead to slower incident response and diminished well-being for those on-call, including burnout and attrition. Effective and sustainable on-call, on the other hand, yields substantive benefits and helps operators learn about their systems and improve how they support them.
Experts Jaime Woo and Emil Stolarsky guide you through the key components of on-call, from training, scheduling, and rotations to incident response and evaluation. On-call is an accepted part of an operator's life, and being intentional about it is the best way to ensure that the team stays healthy and sustainable. Join in to learn how it’s done.
What you'll learn-and how you can apply it
By the end of this live online course, you’ll understand:
- When you need to establish on-call
- How to create playbooks
- How to create healthy on-call rotations
- How companies can fight pager fatigue
- Whether or not to compensate workers for being on-call
- The tools you can use to help with on-call
And you’ll be able to:
- Follow best practices for on-call to build a healthy and sustainable culture
- Evaluate if your on-call process is “working”
- Spot the signs of a poor on-call process
- Manage stakeholders around on-call culture
This training course is for you because...
- You’re a developer, operator, or manager involved with on-call.
- You work with systems that require on-call support.
- You want to become an effective and supportive manager for your people.
- Experience running software in production environments
- Familiarity with on-call processes
- Read “Being On-Call” (chapter 11 in Site Reliability Engineering)
About your instructors
Jaime Woo is an award-nominated writer, and is a frequent speaker at SREcon EMEA, Americas West, and Americas East. He started his career as a molecular biologist, before working at DigitalOcean, Riot Games, and Shopify, where he launched the engineering communications function.
Emil Stolarsky is a site reliability engineer. Previously, he worked on caching, performance, and disaster recovery at Shopify and the internal Kubernetes platform at DigitalOcean. He’s the program cochair for SREcon EMEA 2019 and SREcon Americas West 2020 and contributed a chapter to the O’Reilly book Seeking SRE.
The timeframes are only estimates and may vary according to how the class is progressing
On-call processes (55 minutes)
- Presentation: An overview of on-call at companies of different sizes; What does starting on-call look like?; training for on-call; introduction to incident response
- Group discussion: Where on the on-call journey is your company?
- Hands-on exercise: Practice using a Wheel of Misfortune
Break (5 minutes)
On-call culture (55 minutes)
- Presentation: The need to balance on-call scheduling; anti-patterns—how poor on-call affects your operators; evaluating your on-call; the politics of on-call
- Group discussion: Stress and on-call
- Hands-on exercise: Improve your on-call experience
Wrap-up and Q&A (5 minutes)