Chapter 6. Microsoft Variation and Prioritization of Experiments

At Microsoft we build and operate our own Chaos Engineering program for cloud infrastructure at scale. We find that experiment selection in particular has an outsized impact on the way you apply Chaos Engineering to your system. Examples of different failure scenarios in real production systems illustrate how a variety of real-world events can affect your production system. I’ll propose a method for prioritizing experimentation of your services, and then a framework for considering the variation of different experiment types. My goal in this chapter is to offer strategies you can apply in your engineering process to improve the reliability of your products.

Why Is Everything So Complicated?

Modern software systems are complex. There are hundreds, often thousands, of engineers working to enable even the smallest software product. There are thousands, maybe millions, of pieces of hardware and software that make up a single system that becomes your service. Think of all those engineers working for hardware providers like Intel, Samsung, Western Digital, and other companies designing and building server hardware. Think of Cisco, Arista, Dell, APC, and all other providers of network and power equipment. Think of Microsoft and Amazon providing you with the cloud platform. All of these dependencies you accept into your system explicitly or implicitly have their own dependencies in turn, all the way down ...

Get Chaos Engineering now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.