The Case for Security Chaos Engineering

Definition of security chaos engineering: The identification of security control failures through proactive experimentation to build confidence in the system’s ability to defend against malicious conditions in production.1

Information security is broken. Our users and our customers—who make up our world—are entrusting us with more and more of their lives, and we are failing to keep that trust. Year after year, the same sort of attacks are successful, and the impact of those attacks becomes greater. Meanwhile, the security industry keeps chasing after the shiny new tech and maybe incremental improvement in the process.

A fundamental shift in both philosophy and practice is necessary. Information security must embrace the reality that failure will happen. People will click on the wrong thing. Security implications of simple code changes won’t be clear. Mitigations will accidentally be disabled. Things will break.

By accepting this reality, information security can move from trying to build the perfect secure system to continually asking questions like “How will I know this control continues to be effective?”, “What will happen if this mitigation is disabled, and will I be able to see it?”, or “Is my team—including executives making critical decisions—ready to handle this sort of incident tomorrow?”

Hope isn’t a strategy. Likewise, perfection isn’t a plan. The systems we are responsible for are failing as a normal function of how they operate, whether we like it or not, whether we see it or not. Security chaos engineering is about increasing confidence that our security mechanisms are effective at performing under the conditions for which we designed them. Through continuous security experimentation, we become better prepared as an organization and reduce the likelihood of being caught off guard by unforeseen disruptions. These practices better prepare us (as professionals), our teams, and the organizations we represent to be effective and resilient when faced with security unknowns.

The advanced state of practice for how we build software has reached a state where the systems we build have become impossible for our minds to mentally model in totality. Our systems are now vastly distributed and operationally ephemeral. Transformational technology shifts such as cloud computing, microservices, and continuous delivery (CD) have each brought forth new advances in customer value but have in turn resulted in a new series of challenges. Primary among those challenges is our inability to understand everything in our own systems.

This report does not consist of incremental solutions for how to fix information security. We are reassessing the first principles underlying organizational defense and pulling out the failed assumptions by their roots. In their place, we are planting the seeds of the new resistance, and this resistance favors alignment with organizational needs and seeks proactive, adaptive learning over reactive patching.

The Greatest Teacher

Pass on what you have learned. Strength, mastery, hmm…but weakness, folly, failure also. Yes: failure, most of all. The greatest teacher, failure is. Luke, we are what they grow beyond. That is the true burden of all masters.

Jedi Master Yoda, The Last Jedi

Traditional defensive security philosophy is anchored to the avoidance of failure—preventing the inevitable data breach. Failure is seen as the axis of evil. We propose that failure is the greatest teacher we have in defensive security; it teaches us valuable lessons that inform us about how we can become better prepared for incidents.

If we have a poor understanding of how our systems are behaving, how can we drive good security in those systems? The answer is through planned, empirical experimentation. This report applies chaos engineering to the field of information security. We call this security chaos engineering (SCE). SCE is the way forward for information security and will facilitate the adaptation of defensive security to meet the requirements of modern operations.

SCE serves as a foundation for developing a learning culture around how organizations build, operate, instrument, and secure their systems. The goal of these experiments is to move security in practice from subjective assessment into objective measurement. Chaos experiments allow security teams to reduce the “unknown unknowns” and replace “known unknowns” with information that can drive improvements to security posture.

By intentionally introducing a failure mode or other event, security teams can discover how well instrumented, observable, and measurable their systems truly are. Teams can validate critical security assumptions, assess abilities and weaknesses, then move to stabilize the former and mitigate the latter.

SCE proposes that the only way to understand this uncertainty is to confront it objectively by introducing controlled signals. By injecting a controlled signal such as a security failure into the system, it becomes possible to measure your team’s capability of responding to incidents. Additionally, we can proactively gain insights into how effective the technology is, how aligned runbooks or security incident processes are, and much more. As a practice, this helps teams better understand attack preparedness by tracking and measuring experiment outcomes across varying periods of time.

This report shares the guiding principles of SCE so you can begin harnessing experimentation and failure as a tool for empowerment—and so you can transform security from a gatekeeper getting in the way of business to a valued advisor that enables the rest of the organization.

1 Aaron Rinehart, “Security Chaos Engineering: A New Paradigm for Cybersecurity,” Opensource.com, January 24, 2018, https://oreil.ly/gVMJk.

Get Security Chaos Engineering now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.