Chapter 1. Introduction
Make no mistake—the coming N weeks are going to be personally and professionally stressful, and at times we will race to keep ahead of events as they unfold. But we have been preparing for crises for over a decade, and we’re ready. At a time when people around the world need information, communication, and computation more than ever, we will ensure that Google is there to help them.
Benjamin Treynor Sloss, Vice President, Engineering, Google’s Site Reliability Engineering Team, March 3, 2020
Failure is an inevitability (kind of depressing, we know). As scientists and engineers, you look at problems on the long scale and design systems to be optimally sustainable, scalable, reliable, and secure. But you’re designing systems with only the knowledge you currently have. And when implementing solutions, you do so without having complete knowledge of the future. You can’t always anticipate the next zero-day event, viral media trend, weather disaster, config management error, or shift in technology. Therefore, you need to be prepared to respond when these things happen and affect your systems.
One of Google’s biggest technical challenges of the decade was brought on by the COVID-19 pandemic. The pandemic created a series of rapidly emerging incidents that we needed to mitigate in order to continue serving our users. We had to aggressively boost service capacity, pivot our workforce to be productive at home, and build new ways to efficiently repair servers despite ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access