Stone wall pattern
Stone wall pattern (source: TheDigitalArtist via Pixabay)

Curious about site reliability engineering (SRE)?

The following overview is for you. It covers some of the basics of SRE: what it is, how it’s used, and what you need to keep in mind before adopting SRE methods.

This information comes from SRE experts and SRE material available on O'Reilly's online learning platform.

What is SRE?

In the book Site Reliability Engineering, contributor Benjamin Treynor Sloss—the originator of the term “Site Reliability Engineering”—explains how SRE emerged at Google:

SRE is what happens when you ask a software engineer to design an operations team. When I joined Google in 2003 and was tasked with running a “Production Team” of seven engineers, my entire life up to that point had been software engineering. So I designed and managed the group the way I would want it to work if I worked as an SRE myself. That group has since matured to become Google’s present-day SRE team, which remains true to its origins as envisioned by a lifelong software engineer.

SRE arose partially as a response to the division between product development and operations teams. Treynor Sloss explains this division in Site Reliability Engineering:

At their core, the development teams want to launch new features and see them adopted by users. At their core, the ops teams want to make sure the service doesn’t break while they are holding the pager. Because most outages are caused by some kind of change—a new configuration, a new feature launch, or a new type of user traffic—the two teams’ goals are fundamentally in tension.

But what would happen if these teams weren’t “fundamentally in tension”? How might that improve product development, operations, and the business itself? Treynor Sloss continues in Site Reliability Engineering:

Conflict isn’t an inevitable part of offering a software service. Google has chosen to run our systems with a different approach: our Site Reliability Engineering teams focus on hiring software engineers to run our products and to create systems to accomplish the work that would otherwise be performed, often manually, by sysadmins.

The attributes of SRE

“There are a lot of attributes SRE would share with any engineering discipline: pragmatic, objective, articulate, expressive,” says Theo Schlossnagle, founder of Circonus. “However, one that sets itself apart is a desire to straddle layers of abstraction.”

This means site reliability engineers need a holistic understanding of the systems and the connections between those systems. “SREs must see the system as a whole and treat its interconnections with as much attention and respect as the components themselves,” Schlossnagle says.

In addition to an understanding of systems, site reliability engineers are also responsible for specific tasks and outcomes. These are outlined in the following seven principles of SRE written by the contributors of The Site Reliability Workbook.

1. Operations is a software problem — “The basic tenet of SRE is that doing operations well is a software problem. SRE should therefore use software engineering approaches to solve that problem.”

2. Manage by Service Level Objectives (SLOs) — Maintaining 100% availability isn’t the goal of SRE. “Instead, the product team and the SRE team select an appropriate availability target for the service and its user base, and the service is managed to that SLO. Deciding on such a target requires strong collaboration from the business.”

3. Work to minimize toil — Toil is tedious, manual, work. SRE doesn’t accept toil as the default. “We believe that if a machine can perform a desired operation, then a machine often should. This is a distinction (and a value) not often seen in other organizations, where toil is the job, and that’s what you’re paying a person to do.”

4. Automate this year’s job away — Automation goes hand-in-hand with reducing toil by “determining what to automate, under what conditions, and how to automate it.”

5. Move fast by reducing the cost of failure — The later a problem is discovered, the harder it is to fix. SRE addresses this issue. “SREs are specifically charged with improving undesirably late problem discovery, yielding benefits for the company as a whole.”

6. Share ownership with developers — SRE aims to reduce boundaries. “Ideally, both product development and SRE teams should have a holistic view of the stack—the frontend, backend, libraries, storage, kernels, and physical machine—and no team should jealously own single components.”

7. Use the same tooling, regardless of function or job title — In SRE, you can’t have different teams using different sets of tools. “There is no good way to manage a service that has one tool for the SREs and another for the product developers, behaving differently (and potentially catastrophically so) in different situations. The more divergence you have, the less your company benefits from each effort to improve each individual tool.”

What’s the relationship between SRE and DevOps?

SRE and DevOps are often discussed together, but they’re not the same thing.

“SRE is a very specific functional role within an organization that ascribes to the DevOps philosophy,” says Schlossnagle. “DevOps is a ‘way’ in which an organization should operate, while SRE is simply a single unit in an organization that adheres to that ‘way’.”

The contributors of The Site Reliability Workbook further explain the relationship between DevOps and SRE:

DevOps is in some sense a wider philosophy and culture. Because it effects wider change than does SRE, DevOps is more context-sensitive. DevOps is relatively silent on how to run operations at a detailed level …

SRE, on the other hand, has relatively narrowly defined responsibilities and its remit is generally service-oriented (and end-user-oriented) rather than whole-business-oriented. As a result, it brings an opinionated intellectual framework (including concepts like error budgets) to the problem of how to run systems effectively …

Or, to put it another way, SRE believes in the same things as DevOps but for slightly different reasons.

Where DevOps ends and SRE begins is open to interpretation. Perhaps the best illustration of this comes from Seeking SRE, Chapter 12, which editor David N. Blank-Edelman included to demonstrate 35 different viewpoints on the DevOps-SRE relationship.

What organizations need to know before adopting SRE

Implementation of SRE requires planning and consensus.

For example, a company that wants to implement SRE needs to be prepared to make agreed-upon customizations that map to its specific business. “The most important thing an organization should consider before moving to SRE is to define what SRE is to them and how it will work for them with a strong position statement that is widely adopted outside the new group,” Schlossnagle says.

SRE requires full buy-in from the company. “Once an SRE team is in place, their potentially unorthodox approaches to service management require strong management support,” writes Treynor Sloss in Site Reliability Engineering.

Here are two examples from Blank-Edelman in Seeking SRE that illustrate the level of buy-in an organization should expect to encounter. First, for SRE to take root, workplace toil needs to be acknowledged and addressed:

The SRE model of working—and all of the benefits that come with it—depends on teams having ample capacity for engineering work. If toil eats up that capacity, the SRE model can’t be launched or sustained. An SRE perpetually buried under toil isn’t an SRE, they are just a traditional long-suffering SysAdmin with a new title.

Second, organizations should consider how SRE teams are funded:

If you want your move to SRE to be more than a change in job titles, you are going to need to make the case that the SRE team should be funded to do engineering work and attached to teams who own the service lifecycle, from inception to decommissioning.

As you can see, a move to SRE cannot be in name only. It needs commitment and follow-through to succeed.

Learn more about SRE

Ready to take the next step? Check out these resources.

Site Reliability EngineeringIn this collection of essays and articles, key members of Google’s Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world.

The Site Reliability Workbook — The Google engineers who wrote Site Reliability Engineering return with this companion that uses concrete examples to show you how to put SRE principles and practices to work in your environment.

Seeking SRE — The more than two dozen chapters in Seeking SRE bring you into the important conversations going on in the SRE world right now.

How to make a lion bulletproof: Setting up site reliability engineering (SRE) in a global financial organization — Janna Brummel and Robin van Zijll share lessons learned from a year of doing SRE.

Architecting a postmortem — Will Gallego walks you through the structure of postmortems used at large tech companies and debunks myths regularly attributed to failures.

Principia SLOdica: A treatise on the metrology of service level objectives — Jamie Wilkinson offers an overview of SLOs and demonstrates how to implement them in your own projects.

Article image: Stone wall pattern (source: TheDigitalArtist via Pixabay).