Chapter 3. Implementing SRE
When it comes time to implement a new SRE team, the main factor that contributes to the plan is whether you are starting “fresh”—a “greenfield” project—or taking a “brownfield” approach and migrating an existing team. In either scenario, the amount of cultural change needed can be daunting.
Even before a team is formed, one must prioritize the work to be done. A guide for figuring out where to start is the Hierarchy of Reliability. Since this hierarchy is going to guide the future changes, let’s start by explaining what it is.
Hierarchy of Reliability
In late 2013, an SRE from Google, Mikey Dickerson, was asked to help the struggling HealthCare.gov. (To demonstrate the previous terms, he stepped into a situation that had a number of pieces in place, but the site as a whole was not functioning as desired. This is a great example of a “brownfield” scenario.) There was some intense time pressure to get things working quickly. He needed a way to explain “reliability” in a simple and straightforward way, so he borrowed from a theory in psychology, Maslow’s Hierarchy of Needs. The Hierarchy of Reliability that Dickerson used to help HealthCare.gov is shown in Figure 3-1.
The idea is that topics at the bottom are more “basic,” and they gradually get more advanced as you progress up the pyramid. But each topic (or “level,” ...