Chapter 9. Resilience

Safety work is today recognized as an economic necessity. It is the study of the right way to do things.

Robert W. Campbell, addressing the Third National Safety Council Congress & Expo (1914)

Late one September night, at just after two in the morning, a portion of Amazon’s internal network quietly stopped working.1 This event was brief, and not particularly interesting, except that it happened to affect a sizable number of the servers that supported the Amazon DynamoDB service.

Most days, this wouldn’t be such a big deal. Any affected servers would just try to reconnect to the cluster by retrieving their membership data from a dedicated metadata service. If that failed, they would temporarily take themselves offline and try again.

But this time, when the network was restored, a small army of storage servers simultaneously requested their membership data from the metadata service, overwhelming it so that requests—even ones from previously unaffected servers—started to time out. Storage servers dutifully responded to the timeouts by taking themselves offline and retrying (again), further stressing the metadata service, causing even more servers to go offline, and so on. Within minutes, the outage had spread to the entire cluster. The service was effectively down, taking a number of dependent services down with it.

To make matters worse, the sheer volume of retry attempts—a “retry storm”—put such a burden on the metadata service that it even became entirely ...

Get Cloud Native Go, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.