Chapter 13. How to Make Failure Beautiful: The Art and Science of Postmortems

Jake Loomis

AS AN ENGINEER AT YAHOO! DURING THE EARLY DAYS OF THE DOT-COM BOOM, I lived in a world where features were king and eyeballs were gold. Engineers could do whatever they wanted with the production site, and the customers were often the first line of QA. Customers didn't expect the Internet to always work, and they joked about the World Wide Wait. It wasn't until real revenue started to flow in that Internet sites were forced to grow up. Downtime meant actual dollars being lost, and things such as email became critical to people's everyday lives.

But like a newly graduated teenager, just knowing you needed to grow up didn't tell you how to do it. Sites such as Twitter, with a history of downtime, know they need better uptime if they are going to continue to succeed after their initial burst of new users. Could a "fast-moving" Internet site really have uptime similar to "slow-moving" utilities such as power, phone, or cable? Change was the riskiest thing you could do to a system, and Internet sites often changed production daily. In addition to that, many of the successful sites were growing at an unprecedented rate, and the Internet technologies they were built with were new and unproven. Whether it was hardware solutions continually chasing Moore's Law or novel software solutions bludgeoned into handling millions more customers than they were ever designed for, sites were built on unstable ground. ...

Get Web Operations now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.