Chapter 4. Vary Real-World Events
Every system, from simple to complex, is subject to unpredictable events and conditions if it runs long enough. Examples include increase in load, hardware malfunction, deployment of faulty software, and the introduction of invalid data (sometimes known as poison data). We don’t have a way to exhaustively enumerate all of the events or conditions we might want to consider, but common ones fall under the following categories:
-
Hardware failures
-
Functional bugs
-
State transmission errors (e.g., inconsistency of states between sender and receiver nodes)
-
Network latency and partition
-
Large fluctuations in input (up or down) and retry storms
-
Resource exhaustion
-
Unusual or unpredictable combinations of interservice communication
-
Byzantine failures (e.g., a node believing it has the most current data when it actually does not)
-
Race conditions
-
Downstream dependencies malfunction
Perhaps most interesting are the combinations of events listed above that cause adverse systemic behaviors.
It is not possible to prevent threats to availability, but it is possible to mitigate them. In deciding which events to induce, estimate the frequency and impact of the events and weigh them against the costs and complexity. At Netflix, we turn off machines because instance termination happens frequently in the wild and the act of turning off a server is cheap and easy. We simulate regional failures even though to do so is costly and complex, because ...