Skip to Content
Learning from failure: Why a total site outage can be a good thing
conference

Learning from failure: Why a total site outage can be a good thing

by Alex Elman
October 2019
Intermediate
34m
English
O'Reilly Media, Inc.
Closed Captioning available in German, English, Spanish, French, Japanese, Korean, Portuguese (Portugal, Brazil), Chinese (Simplified), Chinese (Traditional)

Overview

Although an outage is a terrifying prospect, you should embrace it as an opportunity. Failure can expand and improve your understanding of your systems.

Three years ago, Indeed suffered one of the worst outages in its history. No single fault or failure caused this outage. Rather, it was a complex interaction of bugs, design decisions, capacity loss, and poor situational awareness during incident response. Indeed learned valuable lessons from this event. It identified ways to make the systems more resilient and improved the approach to the incident lifecycle within the engineering culture.

Alex Elman uses the narrative of this incident to demonstrate how a site-wide outage can inform increased resilience and reduced operational complexity. Learning from failure is a feedback loop rather than a one-off process. He applies Indeed’s outage as a practical example of what an iteration of this loop can look like. He shares with other SREs the success that has risen from this failure. Indeed hasn’t had a global site outage in the three years since this event.

Alex begins with a discussion of failure to set the stage for delivering the incident background, then discusses incident response and situational awareness. He explains conducting incident postmortems and learning from failure and designing for reliability, including resilience patterns such as circuit breaking and graceful degradation. Finally, he gets into resilience testing, running chaos tests, and closing the feedback loop, leaving some time for a question and answer session.

This session was recorded at the 2019 O'Reilly Velocity Conference in San Jose.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Watch now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

Unix® System Management Primer Plus

Unix® System Management Primer Plus

Jeff Horwitz
A Better Way to Manage Risk

A Better Way to Manage Risk

Robert Kaplan, Anette Mikes
Be a Brave, Big-Hearted Rebel at Work

Be a Brave, Big-Hearted Rebel at Work

Lois Kelly, Carmen Medina

Publisher Resources

ISBN: 0636920338451