Chaos engineering

To address the massive impact of system downtime on business revenues, many organizations are adopting Chaos Engineering in order to gain confidence that their systems are fault-tolerant, that is, built to anticipate and mitigate a variety of software and hardware failures. Many organizations are implementing internal "failure as a service" systems, such as Failure Injection Testing (FIT) [6], Simian Army [7] at Netflix, uDestroy at Uber, and even commercial offerings like https://gremlin.com.

These systems advocate treating Chaos Engineering as a scientific discipline:

  1. Form a hypothesis: What do you think could go wrong in the system?
  2. Plan an experiment: How can you recreate the failure without impacting users?
  3. Minimize the blast ...

Get Mastering Distributed Tracing now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.