SRE Best Practices: Implementing Automation to Reduce Toil
Topic: System Administration
Homeowners understand the unexpected amount of work it takes to keep a house running smoothly. For site reliability engineers, our home is our production environment. As systems and services scale, the amount of work it takes to just keep them running tends to scale up too. In SRE, this work—repeatable and not requiring human judgment—is known as toil. It’s the kind of work necessary for keeping the lights on that doesn’t meaningfully improve the services we’re responsible for. However, it’s critical for the health and success of SRE teams that toil is managed effectively.
The primary tool for eliminating or reducing toil is automation. But automation is a tricky beast. Get overaggressive and it can become the culprit of incidents and overall pain. Ignore it and your team will drown in work without being able to complete projects. The solution is to take a mindful approach, starting at checklists and moving gradually to fully automated services. That journey isn’t easy, but it’s the key to developing an effective SRE team.
Join Incident Labs’ Emil Stolarsky and Jaime Woo to learn how to identify toil and manage it effectively. You’ll explore the techniques companies like Google, Facebook, and Microsoft use for managing their toil with automation as well as strategies for bringing these ideas back to your organization to unlock time for your team.
What you'll learn-and how you can apply it
By the end of this live online course, you’ll understand:
- What toil is, how much toil is acceptable, and how to reduce the amount of toil
- Where and why you want to automate
- Strategies to approach meaningful automation to manage toil
And you’ll be able to:
- Communicate to stakeholders the destructive nature of toil
- Assess how much toil is acceptable and where it can be reduced
- Assess the benefits of automation and how it can contribute to a safer deployment environment
- Determine good candidates for automation (and what form of automation you should use) at your size while maintaining human judgment
This training course is for you because...
- You’re concerned that toil will overwhelm your schedule.
- You want to encourage automation where possible to prevent repeatable, automatable work from swallowing up your SRE team.
- You’re aware that sometimes toil can be seen as a source of pride and need tips to negotiate this thinking effectively.
- Experience running software in production environments
- A basic understanding of DevOps or SRE
About your instructors
Jaime Woo is an award-nominated writer, and is a frequent speaker at SREcon EMEA, Americas West, and Americas East. He started his career as a molecular biologist, before working at DigitalOcean, Riot Games, and Shopify, where he launched the engineering communications function.
Emil Stolarsky is a site reliability engineer. Previously, he worked on caching, performance, and disaster recovery at Shopify and the internal Kubernetes platform at DigitalOcean. He’s the program cochair for SREcon EMEA 2019 and SREcon Americas West 2020 and contributed a chapter to the O’Reilly book Seeking SRE.
The timeframes are only estimates and may vary according to how the class is progressing
Understanding and limiting toil (55 minutes)
- Group discussion: Where do you have sources of toil?
- Presentation: Understanding toil; the dangers of too much toil; creating a toil budget; strategies for limiting toil
Break (5 minutes)
Automation (60 minutes)
- Group discussion: Pets versus cattle; culture and automation
- Presentation: When to automate and when not to; DevOps and the value of automation; the journey to effective automation; designing interfaces for automation