This book originated from an argument during my first year as Amazon’s “Master of Disaster,” as I began applying the incident management and operations practices I learned as a firefighter to improve Amazon’s overall reliability and resiliency. I vividly remember facing a room full of scowling engineers and managers who were saying, “I get that these ideas work for firefighters, but do you really think they can work at internet speed?”
The answer, of course, was yes. The systems and best practices developed over decades of managing complex emergency incidents—where seconds count and lives are on the line—work just as well for managing complex incidents for technology organizations. Over the next few years, my team and I used these techniques and systems to help transform the culture and technology of what is now one of the greatest engineering and operations organizations in the world.
When I left Amazon, it was clear to me that as the world was becoming increasingly connected and distributed, people would come to depend on the new technology we build and systems we run as part of their daily lives. It was also clear to me and a group of passionate peers that there were too few people with the knowledge and experience to build and run these systems at scale. My friend, Artur Bergman helped me found the O’Reilly Velocity Performance & Operations conference to organize, develop, and spread our emerging and critical professional discipline.
As Velocity grew, I started sharing my work with friends and mentors in the Fire Service. I am fortunate to have worked with and been trained by some of the most experienced and respected incident management experts in the world, and I asked them to help build and expand on what I had started.
I convened the first “Web Ops/Fire Ops” summit on a beautiful day at Artur’s loft in San Francisco. Attending from “the internet” were Artur Bergman (Fastly/Wikia), John Adams (Twitter), Johnathan Heiliger (Facebook), Pedro Canahuati (Facebook), Simon Wistow (Fastly), and Chris Brown (Amazon/Chef/Microsoft). Attending from the “Fire Ops” side were the authors of this book: Chris Hawley, Rob Schnepp, and Ron Vidal.
After a few hours of sharing backgrounds, “war stories,” and a lot of laughter, it became clear to everyone that there was both the need and opportunity for a tech-oriented incident management training program. Chris, Rob, and Ron formed Blackrock Partners and began consulting with large companies on how to improve their operations. Since then they have worked with dozens of tech companies, trained thousands of new responders, and reviewed hundreds of incidents as they help companies “work like a fire department at internet speed.”
This book is the first publicly released product of their exceptional work, and is the essential foundation for building technology and organizations that people can depend on. I hope you use it.
As we say in the fire department, “See you at the big one!”