Chapter 9. Resilience

A distributed system is one in which the failure of a computer you didn’t even know about can render your own computer unusable.1

Leslie Lamport, DEC SRC Bulletin Board (May 1987)

Late one September night, at just after two in the morning, a portion of Amazon’s internal network quietly stopped working.2 This event was brief, and not particularly interesting, except that it happened to affect a sizable number of the servers that supported the DynamoDB service.

Most days, this wouldn’t be such a big deal. Any affected servers would just try to reconnect to the cluster by retrieving their membership data from a dedicated metadata service. If that failed, they would temporarily take themselves offline and try again.

But this time, when the network was restored, a small army of storage servers simultaneously requested their membership data from the metadata service, overwhelming it so that requests—even ones from previously unaffected servers—started to time out. Storage servers dutifully responded to the timeouts by taking themselves offline and retrying (again), further stressing the metadata service, causing even more servers to go offline, and so on. Within minutes, the outage had spread to the entire cluster. The service was effectively down, taking a number of dependent services down with it.

To make matters worse, the sheer volume of retry attempts—a “retry storm”—put such a burden on the metadata service that it even became entirely unresponsive to requests ...

Get Cloud Native Go now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.