Chapter 1. Experimenting with Failure

Chaos engineering is the practice of continual experimentation to validate that our systems operate the way we believe they do. These experiments help uncover systemic weaknesses or gaps in our understanding, informing improved design and processes that can help the organization gain more confidence in their behavior. The occurrence of failure is part of the normal condition of how our systems operate. Chaos engineering offers engineers a practical technique for proactively uncovering unknown failure within the system before it manifests into customer-facing problems.

The Foundation of Resilience

So what do we mean by resilience? According to Kazuo Furuta, “Resilience is the intrinsic ability of a system to adjust its functioning prior to, during, or following changes and disturbances so that it can sustain required operations under both expected and unexpected conditions.”1

Resilience represents the ability not only to recover from threats and stresses but to perform as needed under a variety of conditions and respond appropriately to both disturbances as well as opportunities.

However, we commonly see resilience reduced to robustness in the information security dialogue (though it is far from the only domain felled by this mistake). A focus on robustness leads to a “defensive” posture rather than an adaptive system or one that, like a reed bending in the wind, is designed to incorporate the fact that events or conditions will occur that negatively impact the system. As a result, the status quo in information security is to aim for perfect prevention, defying reality by attempting to keep incidents from happening in the first place.

Robustness also leads us to prioritize restoring a compromised system back to its prior version, despite it being vulnerable to the conditions that fostered compromise.2 This delusion drives us toward technical controls rather than systemic mitigations, which creates a false sense of security that facilitates risk accumulation in a system that is still inherently vulnerable.3 For instance, if a physical barrier to flooding is added to a residential area, more housing development is likely to occur there—resulting in a higher probability of catastrophic outcomes if the barrier fails.4 In information security, an example of this false sense of security is found in brittle internal applications left to languish with insecure design due to the belief that a firewall or intrusion detection system (IDS) will block attackers from accessing and exploiting it.

Things Will Fail

Detecting failures in security controls early can mean the difference between an unexploited vulnerability and having to announce a data breach to your customers. Resilience and chaos engineering embrace the reality that models will be incomplete, controls will fail, mitigations will be disabled—in other words, things will fail. If we architect our systems to expect failure, proactively challenge our assumptions through experimentation, and incorporate what we learn as feedback into our strategy, we can learn more about how our systems actually work and how to improve them.

Failure refers to when systems—including any people and business processes involved—do not operate as intended.5 For instance, a microservice failing to communicate with a service it depends on would count as a failure. Similarly, failure within SCE is when security controls do not achieve security objectives. Revoked API keys being accepted, firewalls failing to enforce denylists,6 vulnerability scanners missing SQLi, or intrusion detection not alerting on exploitation are examples of security failure.

Instead of seeking to stop failure from ever occurring, the goal in chaos engineering is handling failure gracefully.7 Early detection of failure minimizes the blast radius of incidents and also reduces postincident cleanup costs. Engineers have learned that detecting service failures early—like excessive latency on a payment API—reduces the cost of a fix, and security failure is no different.

Thus we arrive at two core guiding principles of SCE. First, expect security controls to fail and prepare accordingly. Second, do not attempt to completely avoid incidents but instead embrace the ability to quickly and effectively respond to them.

Under the first principle, system architecture must be designed under this assumption that security controls will fail, that users will not immediately understand (or care about) the security implications of their actions.8 Under the second principle, as described by ecological economics scholar Peter Timmerman, resilience can be thought of as the building of “buffering capacity” into a system to continually strengthen its ability to cope in the future.9 Acceptance that compromise and “user error” will happen and a focus on ensuring systems gracefully handle incidents are essential. Security must move away from defensive postures to resilient postures and let go of the impossible standard of perfect prevention.

Benefits of SCE

The practice of SCE and leveraging failure result in numerous benefits. Among these are a reduction in remediation costs, disruption to end users, and stress level during incidents, as well as improvement of confidence in production systems, understanding of systemic risk, and feedback loops.

SCE reduces remediation costs by better preparing teams to handle incidents through the repeated practice of recovering from unexpected events. Security teams may have incident response plans available somewhere in their knowledge base, but practicing the process is a more reliable way to gain comfort in the ability to efficiently recover from incidents. Think of it as your team developing muscle memory.

SCE also reduces disruption to end users.10 Each scenario tested generates feedback that can inform design improvements, including changes that can help minimize impacts to users. For example, simulating a distributed denial-of-service (DDoS) attack, which causes an outage in an ecommerce application, could lead to the adoption of a content delivery network (CDN), which would substantially reduce end-user disruption in the event of a subsequent DDoS attempt.

Reducing the stress of being on call and responding to incidents11 is another benefit of SCE. The repeated practice of recovering from failure minimizes fear and uncertainty in the event of an incident, transforming it into a problem with known processes for solving it. The muscle memory gives teams more confidence that they can work through the problem and ensure system recovery based on their developed expertise.

Few organizations are highly confident in the security of their systems, largely due to an inability to track and objectively measure success metrics on a continual basis.12 SCE can improve your ability to track and measure security through controlled and repeatable experimentation. SCE can also boost confidence by informing your organization of its preparedness to unexpected failures. Feedback from these scenarios can be used to improve the system’s ability to be resilient over time. After repeated experimentation, your team can gain a clearer picture of the system, efficacy of security processes, and the ability to recover from unforeseen surprises.

1 Kazuo Furuta, “Resilience Engineering,” in Joonhong Ahn, Cathryn Carson, Mikael Jensen, Kohta Juraku, Shinya Nagasaki, and Satoru Tanaka (eds), Reflections on the Fukushima Daiichi Nuclear Accident (New York: Springer, 2015), sec. 24.4.2.

2 A. X. Sanchez, P. Osmond, and J. van der Heijden, “Are Some Forms of Resilience More Sustainable Than Others?” Procedia Engineering 180 (2017): 881–889.

3 This is known as the safe development paradox: the anticipated safety gained by introducing a technical solution to a problem instead facilitates risk accumulation over time, leading to larger potential damage in the event of an incident. See R. J. Burby, “Hurricane Katrina and the Paradoxes of Government Disaster Policy: Bringing about Wise Governmental Decisions for Hazardous Areas,” The Annals of the American Academy of Political and Social Science 604, no. 1 (2006): 171–191.

4 C. Wenger, “The Oak or the Reed: How Resilience Theories Are Translated into Disaster Management Policies,” Ecology and Society 22, no. 3 (2017).

5 Pertinent domains include disaster management (e.g., flood resilience), climate change (e.g., agriculture, coral reef management), and safety-critical industries like aviation and medicine.

6 We will use the terms “allowlist” and “denylist” throughout the report. See M. Knodel and N. Oever, “Terminology, Power, and Inclusive Language,” Internet Engineering Task Force, June 16, 2020, https://oreil.ly/XsObA.

7 See Bill Hoffman’s tenants of operations-friendly services in J. R. Hamilton, “On Designing and Deploying Internet-Scale Services,” LISA 18 (November 2007): 1–18.

8 “End users” and “system admins” are continually featured as “top actors” involved in data breaches in the annual editions of the Verizon Data Breach Investigations Report.

9 P. Timmermann, “Vulnerability, Resilience, and the Collapse of Society,” Environmental Monograph 1 (1981): 1–42.

10 Vilas Veeraraghavan, “Charting a Path to Software Resiliency,” Medium, October 2018, https://oreil.ly/fimSI.

11 As an example, see key findings in Jonathan Rende, “Unplanned Work Contributing to Increased Anxiety,” PagerDuty, March 10, 2020, https://oreil.ly/jKewe; see also S. C. Sundaramurthy, et al., “A Human Capital Model for Mitigating Security Analyst Burnout,” Eleventh Symposium on Usable Privacy and Security ({SOUPS} 2015) (Montreal: USENIX Association, 2016), 347–359.

12 Discussing the myriad issues with risk measurement in information security is beyond the scope of this report, but for a starting perspective on it, we recommend reading Ryan McGeehan’s “Lessons Learned in Risk Measurement,” Medium, August 7, 2019, https://oreil.ly/toYaM.

Get Security Chaos Engineering now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.