Reducing the Impact of Service Outages with Generic Mitigations: A Philosophy of Duct-Tape-Based Outage Resolution
This webcast has been postponed.
Your service should have at least one or two generic mitigations. If it doesn’t, you’re in for a bad time. If it does—treasure them, maintain them, and use them, lest they rot beneath your feet.
While a mitigation is any action you might take to reduce the impact of a breakage—SSHing into an instance and clearing the cache, for example, or turning off the machines to close down a vulnerability—a generic mitigation is useful in addressing a wide variety of outages. In this talk you will learn how to distinguish between specific and generic mitigations, and how to identify what generic mitigations your service might need. You’ll also understand why you need to build generic actions you can trust to “make it stop!”
Jennifer Mace is a Site Reliability Engineer at Google. She draws on years of experience addressing production problems before they begin, and providing big red buttons to safely stop problems when they are detected.