Chapter 4. SRE Practices

Once you’ve established an SRE team and have a grasp on the principles, it’s time to develop a set of practices. A team’s practices are made out of what its members can do, what they know, what tools they have, and what they’re comfortable doing with all these.

What teams do is initially based on their charter and their environment. Often, it defaults to “everything the dev team is not doing,” which can be dangerous. By focusing a team on a subset of operational duties, they can produce a flywheel of capabilities that build on each other, over time. If they’re just thrown in the deep end with an undefined scope, toil and frustration will certainly result. Another common antipattern is to add SRE work onto an already overburdened team.

What the team knows can be expanded via education, either self-imposed or centrally organized. Teams should be encouraged to hold regular peer education sessions—for example, a weekly hour where any question about production is welcome, from either new or veteran team members. If a question is answerable by someone, a teaching session can result. If nobody knows the answer, it can turn into a collaborative investigation. In our experience, these sessions are highly valuable for everyone on the team. Junior team members learn new things, seniors get a chance to spread their knowledge, and often something new is discovered that nobody knew about. Similarly, Wheel of Misfortunes, or tabletop exercises, in which team members meet in an informal setting to discuss their roles during an emergency and their responses to a particular emergency situation, are extremely helpful for getting people more comfortable with touching production in a stress-free environment. Reliving a recent outage is an easy way to start. If one team member can play the part of the Dungeon Master and present the evidence as it played out in real life, other team members can talk through what they would have done and/or directly use tooling to investigate the system as it was during the event.

Teams should also be encouraged to gain more knowledge about the systems they are operating from development teams. This is not only a good exercise to better understand the existing system but also an opportunity to directly introduce new instrumentation, discuss and plan changes to the system such as performance improvements, or address scalability or consistency concerns. These conversations tend to be highly valuable in developing trust between teams.

A team’s capabilities can also be expanded through the introduction of new, third-party tooling, through open source software tools, or by the teams writing their own tools.

Where to Start?

When adding capabilities to a team, where do you begin? The problem space of reliability and SRE is vast, and not all capabilities are appropriate at the same time. We suggest starting with a set of practices that allow a team to learn what to work on next. Abstractly, we refer to a model called plan–do–check–act (PDCA). By basing your next step on how the system currently is working, your next step will always be relevant. We explain later in this chapter how to build a platform of these capabilities and where to start. This set of early capabilities will form a flywheel, so your teams won’t have to guess at what to build or adopt next—it will develop naturally from their observations of the system.

Where Are You Going?

It’s important to set your goals appropriately. Not all systems need to be “five 9s” and super reliable. We recommend classifying your services and apps based on their reliability needs and set levels of investment accordingly. As we mentioned previously, remember that each nine costs 10 times as much as the previous nine, which is to say that 99.99% costs 10 times more than 99.9%. While this statement is difficult to prove exactly, the principle is true. Therefore, setting targets blindly or too broadly without consideration can be expensive and run efforts aground. Forcing excessive reliability targets onto systems that don’t need them is also a good way to cause teams to lose top talent. Don’t aim for the moon if you just need to get to low earth orbit.

Make sure your path to success is one of “roofshots,” making incremental progress toward your goals. Don’t expect to achieve it in one large project or revolution. Incremental improvement is the name of the game here.

As you spin out new practices within your team, make sure you record the benefits you’re gaining. These gains should be promoted within the team and to stakeholders or other peer teams. Peer recognition is very important and includes praising members in a team standup, putting people on stage to retell how they avoided a catastrophe, publishing near-misses in a newsletter, and drawing out to the larger organization what might have happened if it weren’t for preventative measures. It’s important to celebrate this type of work, especially in an environment that hasn’t done so in the past. Verbal and written praise can also be coupled with monetary bonuses or gifts. Even a small gift can go a long way.

How to Get There

Don’t try to have a long-term (e.g., three year) detailed plan. Instead, focus on knowing the direction of travel. Know your north star, but generate your next steps as you accomplish your last ones. Once you’ve established your direction of travel, you don’t need to “blow up” your existing teams and processes that don’t align with the new model. Instead, try to “steer the ship” in the right direction.

We think of this as the Fog of War approach, wherein you know your destination but are ready for any hiccups along the way. Short-term planning and agility are essential here, especially early on, when quick wins and immediately demonstrable impact can have major positive effects on a fledgling program and the morale of the team. Give yourself achievable goals that solve today’s problems, while starting to build generic, reusable capabilities that multiple teams can use. By building out a platform that delivers these capabilities, you can scale the impact of your investment. We expand on this concept of platform and capabilities later in this chapter.

Not every product development team within an organization is equal in terms of their needs and their current capabilities. As you introduce SRE to an enterprise, you should strive to be flexible in your engagement models. By meeting product teams where they are, you can solve today’s problems while also introducing org-wide norms and best practices. As an SRE team gets off the ground, they can feel over-subscribed if many teams are looking for their help. By developing a clear engagement “menu,” you can avoid one-off engagements or other unsustainable models. There are several types of engagement models: embedded, consulting, infrastructure, etc. These are described well in a blog post by Google’s Customer Reliability Engineering (CRE) team, as well as the model described in Chapter 32 of the SRE book.

For SRE adoption, reporting structure is important to clarify early on. We recommend an independent organization, with SRE leaders having a “seat at the table” with the executive team. By separating the SRE leadership structure from the product development one, it’s easier for SRE teams to maintain focus on the core goals of reliability, without direct pressure from teams that are more motivated by velocity and feature delivery. However, take care when doing this not to build an isolated “Ops” silo because it’s critical that SREs work closely with other parts of the enterprise. Development teams should invest in these common SRE teams in such a way that the value derived from this team is greater than building out the SRE function from within their own ranks.

What Makes SRE Possible?

What makes SRE possible? Is it just a series of practices like SLOs and postmortems? Not exactly. Those are actually products of the culture that made SRE work to begin with. Therefore, a successful adoption of SRE should not just mimic the practices, but must also adopt a compatible culture to achieve success.

This culture is rooted in the trust and safety of the team itself. The team must feel psychologically safe when they’re put in the high-pressure position of control over major systems. They must be able to say “no” to their peers and leaders without fear of retribution. They must feel their time is valued, their opinions heard, and their contributions recognized. Most of all, SREs should not be made to feel “other” or “less” than their counterparts in a development organization. This is a common pitfall, based on the historically rejected models of Dev versus Ops.

A well-known example of this is blameless postmortems. By writing down “what went wrong,” a team is able to collaboratively determine contributing factors that result in outages, which might be either technical or procedural. Often, when mistakes are made by humans, it can be tempting to cite “human error,” but this has been shown to be somewhat meaningless and not an effective way to improve a system. Instead, SRE promotes blamelessness. An easy way to think of this is that the system should make it difficult for a human to make a mistake. Automation and checks should be in place to validate operator input, and peer reviews should be encouraged to promote agreement and collaboration. You know you have blameless postmortems when people freely include their names in reports for situations in which they made mistakes—when they know there will be no shaming, no demotion, and no negative performance reviews due to simple mistakes that could happen to anyone. If you see postmortems referring to “the engineer” or “Person 1,” you may consider this is a good blameless practice, but this could actually be due to underlying cultural problems that must be addressed directly. If names are redacted and replaced with “the engineer” or “Person 1” on paper, but blame is still cast on the engineer outside the context of the postmortem, the culture of blame has not been addressed. You should definitely not automate the process of explicitly redacting names from logs or documents—this does not solve the cultural problem, and it just makes documents harder to read and understand. Rather than superficially redacting names, address the culture underneath to move toward blamelessness.

One sign of bad culture is watermelon metrics: green on the outside, but red inside. These are metrics reflecting the efforts of a team that are contrived to look good but in reality hide real flaws. These are similar to Goodhart’s Law, which tells us that any measurement that becomes a target ceases to be a good measure. For example, focusing on the number of support tickets or overall mean time to resolve (MTTR) can often be abused either intentionally or by those with good intent who don’t realize their mistake. By measuring the activity of a team, we make that activity the goal, not the customer outcomes. Instead, a team should be able to define their own success metrics that are directly representative of things such as customer happiness, system stability, and development velocity.

SRE should not just be a “20% time” role but, rather, a dedicated title and position within your organization. There should be a job ladder with published transfer requirements and promotion expectations. Leveling and pay should be equitable between teams. A transfer should not feel any significant effects either way.

A good way to know if an established SRE team is succeeding is by looking at transfers into and out of SRE. By ensuring that transfers are routine and free of any sort of bureaucracy or limitations, you’ll quickly learn if people feel “stuck” in SRE or if it is a desirable role. By observing the rate of volunteer transfers into SRE from Development, you find out if it’s working or not.

SREs must know that their time is valued, especially when their job demands exceed “normal hours.” An example of this at Google is that of time-in-lieu: when an SRE must be available outside of normal hours (aka “on-call”), they should be compensated. Some teams at Google allow on-call engineers to choose between monetary compensation or time off, at some percentage of on-call hours, often with an agreed-to cap. There should not be more demands on a team than what can be delivered by that team, so it’s important to ensure the on-call pool is of sufficient size. A common mistake is to make the on-call pool consist of only SREs. This is an artificial limitation. On-call pools should be done on an opt-in basis, as well. As soon as a team feels their time is being abused, it’s a swift downward spiral.

Another cultural touchpoint is that of planning and goal setting. Because SREs are closest to the problems of production, they tend to have a good sense of what’s most important, what’s burning, what’s causing the most pain. By allowing an SRE team to set their own priorities and roadmap, you empower that team, and they will be much more effective and happier. Management should follow the practice of developing an agreed-upon, shared understanding of expected outcomes. Does the business need to move faster? Do users need their results faster? A common antipattern of this is Taylorism: the model of leaders independently setting and prioritizing detailed plans and tasks, then assigning them to the workers.

Building a Platform of Capabilities

An SRE team can build a platform to deliver capabilities to their partner teams, ideally scaling their contribution to the entire organization over time. By introducing resilience mechanisms into shared services, practices, norms, and code, these teams can develop a shared platform made of automation, code, shared libraries, pipelines, procedures, norms, documentation, playbooks, and, yes, even that special undocumented knowledge that lives only in people’s heads. Instead of each team attempting to create their own best practices, these can be baked into the platform. Products can be built from scratch on the platform (so called “digital natives”) or can be ported onto the platform. As the platform’s capabilities increase over time, and the team becomes more confident and comfortable with its operational characteristics, increasingly critical workloads can be ported over. By adopting this model of encoding capabilities into a platform, the SRE team can scale their impact by applying capabilities to many services together. The platform is an internal product and should be governed like one, treating service teams as customers, taking feature requests, and tracking defects (see Figure 4-1).

As a team builds a platform, the question arises, “What to build first?” By adopting low-risk services first, you can minimize that list to be an MVP, or minimum viable product. Over time, you’ll add more capabilities. But which ones are next? There are two sources: your developers and your environment. That is, build what they ask for, e.g., “We need a message bus!” and build what you know they’ll need, e.g., “There has to be a scalable service discovery system or else this will never work.”

For the environmental capabilities, these often come down to:

DevOps improvements such as enhancing the software development life cycle (SDLC) and getting more code out, faster and safer
Reliability engineering improvements: minimizing risk from the errors that do creep in

For reliability engineering improvements, we recommend developing “the virtuous cycle” within your teams. If you’re not sure what to improve, you can learn by looking at your outages and doing the following:

Institute SLOs.
Formalize incident response.
Practice blameless postmortems and reviews.
Use risk modeling to perform prioritization.
Burn through your reliability backlog based on error budget or other risk tolerance methods.

Let this cycle be your flywheel to spin out new capabilities. For example, if you have an outage in which a deployment introduced a bug that crashed every server in the fleet, you’ll want to develop a way to reduce that risk, possibly through something like blast radius reduction, using canary releases, experiment frameworks, or other forms of progressive rollouts. Similarly, if you find that a memory leak is introduced, you might add a new form of load test to the predeployment pipeline. Each of these is a capability that is added to your platform, which can provide benefit and protection for each service running on the platform. One-off fixes become rare as generic mitigation strategies show their value.

Leadership

Of course, to build such a platform, you need to devote engineering hours, which might otherwise be used to develop features. This is where influence is needed, all the way up the chain. When development talent is used for both features and stability, a trade-off must be made. It’s important to make sure that the people making this trade-off have the big picture in mind and have the appropriate incentives in place. We are increasingly seeing the role of Chief Reliability Officer, someone senior within your organization who has a seat at the table for strategic reliability decisions (this might be a familiar concept for fans of the book A Seat at the Table by Mark Schwartz [IT Revolution Press]). While this is a common job role for successful SRE adoption, it’s not a common job title, and it is frequently an additional hat that an existing executive is wearing.

Knowing If It Is Working

A well-run organization that understands and values reliability will exhibit a few observable traits. First is the ability to slow or halt feature delivery in the face of reliability concerns. When velocity and shipping is the only goal, reliability and other nonfunctional demands will always suffer. Do reliability efforts always get deprioritized by features? Are projects proposed but never finished due to “not enough time”? An important caveat here is that this should not be seen as slowing down the code delivery pipeline—you should keep your foot on the gas.

Another indicator of success is when individual heroism is no longer being praised, but instead is actively discouraged. When the success of a system is propped on the shoulders of a small set of people, teams create an unsustainable culture of heroism that is bound to collapse. Heroes will be incentivized to keep sole control of their knowledge and unmotivated to systematically prevent the need to use that knowledge. This is similar to the character Brent in The Phoenix Project by Gene Kim, Kevin Behr, and George Spafford (IT Revolution Press). Not only is it inefficient to have a Brent, it can also be downright dangerous. A team has to actively discourage individual heroism while maintaining the team’s responsibility because heroism can feel like a rational approach in short-term thinking.

Another sign of a well-functioning team is that reliability efforts are funded before outages, as a part of proactive planning. In poorly behaved teams, we see that an investment in reliability is used to treat an outage or series of outages. Although this might be a necessary increase, the investment needs to be maintained over time, not just treated as a one-off and abandoned or clawed-back “once things get better.”

To illustrate this further, consider a simplification of your organization’s approach to reliability as two modes: “peacetime” and “wartime.” Respectively, “things are fine” or “everybody knows it’s all about to fall apart.” By considering these two modes distinctly, you’re able to make a choice about investment. During wartime, you spend more time and money on hidden features of your platform, infrastructure, process, and training. During peacetime, you don’t abandon that work, but you certainly invest far less.

However, who decides when a company is in wartime? How is that decision made? How is it communicated throughout the company in ways that don’t cause panic or attrition? One method is to use priority codes, such as: Code Yellow or Code Red. These are two organizational practices that aid teams in prioritization of work. Code Yellow implies a technical problem that will become a business emergency within a quarter. Code Red implies the same within days, or it is used for an already-present threat. These codes should have well-defined criteria that must be understood and agreed to by all your leadership team. Their declaration must be approved by leadership for the intended effect to take place. The outcome of such codes should result in changing of team priorities, potentially the cessation of existing work (as in the case of a Code Red), the approval of large expenditures, and the ability to pull other teams in to help directly. Priority codes are expensive operations for an enterprise, so you should make sure there are explicit outcomes. These should be defined from the outset as exit criteria and clearly articulated upon completion. Without this, teams will experience signal fatigue and no longer respond appropriately.

Choosing to Invest in Reliability

What other less-dramatic changes might be under the purview of such a reliability leader? These would be policy and spending. Setting organization-wide policy tends to be inconsistent at best when driven from the bottom-up. It’s far more effective to have a leadership role in place to vet, dedupe, approve, and disseminate these policies as they’re proposed by teams. Similarly, spending company funds on staffing, hardware, software, travel, and services is often done in a hierarchical manner.

One has to consider the value of reliability within the organization before building out a structure as mentioned earlier. For this to make sense, the organization must consider reliability not as a cost center but as an investment and even as a product differentiator. The case to be made is that reliability is the hidden, most important, product feature. A product that is not available for use, too slow, or riddled with errors is far less beneficial to customers, regardless of its feature set. Setting this direction must be done at the executive level to set a consistent tone, especially if this is a new orientation.

One simple argument for this is that reliability can be a proxy for concepts that are better understood, like code quality. If a system introduces user-visible failures, the application of reliability practices such as gradual change can make the system appear to have fewer errors to your end customers, before directly addressing the code quality issues. For example, by rolling out a broken change to 1% of customers, 99% of those customers don’t experience the problem. This makes the system appear 100 times better than it actually is and reduces support costs and reputational damage.

Making Decisions

By setting up reliability as an investment into a stronger product, you’re able to make longer-term plans that have far greater impact. Traditional models treat IT as a cost center, focusing entirely on reducing that cost over time. At the end of the day, it doesn’t matter that the service is cheap if it’s not up. You can still apply cost reduction, but you should consider it after you’ve achieved reliability goals. If you’re finding that the cost of maintaining your stated reliability goals is too high, you can explicitly redefine those goals—i.e., “drop a 9”—and evaluate the trade-offs that result.

To achieve all these goals, you’ll likely need to persuade some governing board, group of decision makers, or executives. You’ll need their buy-in to staff and maintain a team over time, provision resources, and train and further develop team members. This should be seen as a long-term investment and explicitly funded accordingly, not as a hidden line-item in some other budget.

Antipattern: Ignoring Ulysses: When it comes to reliability, a common antipattern is to let outages or other “bad news” affect your planning cycle, even when they’re expected. Often, it’s tempting for leadership to feel the need to “do something” in the face of bad news, and “sticking to the plan” often doesn’t feel impactful. However, given a plan that expects outages to happen, unless a significant change in the understanding of a system occurs, “sticking to the plan” is exactly the right thing to do. The term Ulysses pact can be a useful illustration here. This is where a leader (Odysseus) tells his team (his sailors) to stick to the plan (sail past the Sirens as he is tied to the mast). When his team sticks to the plan (despite his thrashing and begging to stop), he congratulates them. They didn’t get tempted by short-term thinking. Their plan considered the long-term impact, and they had time to make a clear plan before the chaos started.; By allowing a team to make in-the-moment decisions, you’re often choosing to ignore a good plan and make emotional or ego-driven choices instead. A classic example of this is a leader “walking into an outage” and taking over command without having full context, and despite a capable team already in control. This is often the outcome of a company’s culture. A culture of HiPPOs (decisions based on the Highest Paid Person’s Opinion) can have a drastically bad effect on incident management and reliability in general. Instead, listen to Odysseus, stick to the plan, and don’t abandon ship. This applies not only to incident response, but also things like error budget exhaustion or tracking SLOs in the face of “really bad” incidents. If your plan is to halt a feature release in the face of error budget exhaustion, but you make an exception for “this important feature” every time, your leadership will be severely undercut. An effective practice to improve this is the introduction of “silver bullets” in which a leader is granted three silver bullets to be used sparingly as an override to the expected plan. By introducing this artificial scarcity, leaders are required to make explicit trade-offs. Similarly, if a single bad event wipes out an SLO, don’t ignore it. Gather the team to analyze how this changes your collective understanding of the system. Was this type of failure never considered before? Was the response to an outage insufficient?
Antipattern: Both at Once: Another antipattern is that of trying to mix old and new models without modification. This pulls teams in strange directions and should be avoided. For example, in the case of ITIL problem management, a central team is often expected to drive down causes and resolution times for all problems through a problem manager. In comparison, SRE expects embedded engineers to drive their own problem resolution through postmortems and reviews. While the outcomes are still aligned (fewer, shorter outages), the methods and personas differ greatly. By trying to do both at the same time, you end up with confusion, and the intended outcomes from both approaches conflict with each other and suffer.; We call these bad mixtures of SRE and non-SRE principles “toxic combinations,” similar to the medical term referring to bad mixtures of medicines. Each on its own can be beneficial, but the combination of the two together causes an unintended bad consequence. Often, we find good intent behind using both, often due to trying to keep existing staff involved, or in an attempt for continuity in reporting. However, the appeal of this is far outweighed by the worsened outcomes: longer outages, more toil, and reduced reliability.

Staffing and Retention

Staffing and role definition can also present antipatterns. When building out an SRE team, it can be tempting to hire an SRE from the outside to impose order on the existing team. This can actually result in wasted effort, often with the hired SRE failing to understand the nuances of the team or technology in place already and falling back to applying previously used methods, without knowing if they’re in fact reasonable in the new job.

We suggest growing existing teams into SRE teams instead. Simply renaming them isn’t effective, but providing a structured learning path and an environment to grow and thrive can certainly work. There are cases where the transition might fail, of course. If individuals are not set up to succeed and instead are expected to immediately turn into a senior SRE just by “reading the book,” they can become frustrated and look for employment elsewhere. Similarly, some engineers don’t see the reason for change, aren’t incentivized, or otherwise are highly resistant to adopting a new role. By providing paid education, time and room to learn, and the context to help your team understand why the change is needed, you can successfully transition a team into the SRE role. This takes time, effort, and patience. In cases where it doesn’t stick, it’s important to conduct an exit interview, specifically to address the transition and what did or didn’t work for an individual. You may uncover flaws in your plans or discover that it isn’t being executed in the way you intended. Finally, as you ask teams to do more complex work that has higher impact, note that this is, literally, higher-value work and the team should be compensated for it. That is, as your team starts acting like SREs, you should pay them like SREs, or else they’ll move to somewhere that does. If you pay teams to learn high-value skills and they leave to use those skills elsewhere, you have only yourself to blame.

Upskilling

When growing and transitioning existing staff into SREs, it is critical to build an upskilling plan. This includes both the what and the how—that is, what skills are needed in the role and how you’ll go about enabling staff to acquire those skills. Tools like skills gap analyses and surveys are intrinsically useful here to check assumptions about the foundational skills that are required for the job. These skills, often not talked about specifically in SRE literature, are nevertheless essential to allow SREs to scale their contributions organization-wide. For example, it is not unheard of for traditional operations teams to be unfamiliar with software engineering fundamentals such as version control, unit testing, and software design patterns. Ensuring that these baselines are a part of your upskilling plan and that they are tailored to each learner profile is crucial, not just to establish a critical mass of skill on the team but to provide a smooth on-ramp for individuals into the new expectations of their role (and thus help reduce team churn).

Get Enterprise Roadmap to SRE now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Enterprise Roadmap to SRE by James Brookbank, Steve McGhee