Chapter 2. The Reactive Deployment
Failure is always an option; in large-scale data management systems, it is practically a certainty.
Alvaro, Rosen, and Hellerstein, Lineage-driven Fault Injection
The way applications are deployed is changing just as rapidly as the development tools and processes being used to produce those applications. Microservices are deployed as systems to fleets of nameless cattle servers. Unlike a set of named pet hosts that you care for and upgrade, cattle are immutable and replaceable. System security updates? New kernel? No problem. Introduce new instances with updates to the cluster fleet. Workload is migrated off the older, unpatched instances to the newly minted ones. The outdated nodes are terminated once idled of all executions.
The physical world into which you deploy our applications, however, hasn’t changed much by comparison. Hardware fails. Mean time before failure maybe longer, but mechanical failure is still inevitable. Processes will still die for numerous reasons. Networks can and will partition. Failure cannot be avoided. You must, instead, embrace failure and seek to keep your services available despite failure, even if this requires operating in a degraded manner. Let it crash! Your systems must be capable of surviving failures. Instead of attempting to repair nodes when they fail, you replace the failing resources with new ones.
Consider Chaos Monkey, a service that randomly terminates services in applications to continuously test ...