Chapter 1. Defining “SRE”

Site Reliability Engineering.

Even when the acronym is spelled out, confusion often remains. The “E” can stand for the practice (“Engineering”) or the people (“Engineers”)—we’ll use it to mean both. The “R” generally stands for “Reliability,” but we’ve heard people use “Resilience” instead. And the original interpretation of the “S” (“Site,” as in “website”) has expanded over time to include “System,” “Service,” “Software,” and even more widely “online Stuff.”

In general, SREs work across the realm of “Anything” as a Service, whether that is Infrastructure (IaaS), Networking (NaaS), Software (SaaS), or Platforms (PaaS)—anywhere the fundamental customer expectation is that the online service can and must be reliable.

The use of service level indicators (SLIs) and service level objectives (SLOs) as meaningful indicia of service health is one of the distinguishing characteristics of SRE practice. It is important to recognize that SLOs are symptoms of a healthy relationship between the reliability (SRE) team and the feature team, not a compliance exercise dictated by management. In the pursuit of greater reliability, SREs will focus on bringing as many components of the greater system space as possible into a resilient, predictable, consistent, repeatable, and measured state. Major areas of expertise can include:

  • Release engineering

  • Change management

  • Monitoring and observability

  • Managing and learning from incidents

  • Self-service automation

  • Troubleshooting

  • Performance

  • The use of deliberate adversity (chaos engineering)

As Stephen Thorne puts it:

[SREs] … have the skills and the mandate to apply engineering to the problem space. [A] well functioning SRE team must do […] operations mindfully and with respect to their actual goal, [helping] the entire organisation take appropriate risks.

SREs (engineers) can be deployed to focus on infrastructure components, as short-term consultants for feature-oriented teams, or as long-term “embedded” teams working with their feature-oriented counterparts.

Depending on the size and organizational structures present within a company’s engineering organization, SRE may be visibly manifested in distinct roles and teams with distinct management, or SRE principles and approaches may be evangelized through portions of the engineering team(s) by motivated individuals without explicit role recognition. SRE will look different when instantiated in organizations of 50, 500, or 5,000 engineers. This context is important, but often missing when writers or speakers are discussing how their companies implement SRE.

Digging Into the Terms in These Definitions

While it can be helpful to have pithy definitions to refer to, it is important to understand and share an understanding of the key terms within those definitions. Let’s explore them in a bit more detail.

Production Feedback Loops

Everyone knows and loves feedback loops—at least in theory. Often, feedback processes and systems don’t get the care, feeding, and attention that they need to be effective. Feedback loops are, at their core, about communication within a sociotechnical system: communication on a technical level between threads, processes, servers, and services; and communication on a social level between individuals, teams, companies, regions, or any other level of distinction.

Inadequate feedback and communication channels lead to scenarios such as the classical divide between (feature) developers and operations. Jennifer Davis and Ryn Daniels explain in Effective DevOps (O’Reilly) that people naturally shift to focus more and more narrowly on the areas that they are interested in and/or are rewarded and evaluated on. Feature developers are evaluated on their success at creating and delivering “features.” In the classical dev/ops split, operators or SysAdmins are evaluated on their success at keeping systems running and stable. Because of these different incentives, the teams are pushed into conflict as each contends for the primacy of “its” goal.

SREs have an intermediary role, and part of their effectiveness comes from having a dedicated purpose that includes establishing and maintaining feedback loops from operations to the feature developers. If services are not working well and the developers don’t know about it, then either the right feedback mechanisms have not been built or the mechanisms have been built but inadequately socialized with or adopted by the dev teams.

Data-Informed

It is critical that these feedback loops be automated in order to scale. Scale is further enabled by relying on data rather than opinion. Measurements are inevitably artifacts of their time and environment, constrained by the technologies that are used to obtain them. Changes in the environment or better understandings of the dynamics of a system can lead to valid technical arguments about whether a measurement is accurate or effective in a particular context. Continually improving the measurements to adequately inform product decisions is one of the benefits of having a standing SRE team. As noted by Lord Kelvin:

When you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind; it may be the beginning of knowledge, but you have scarcely, in your thoughts, advanced to the stage of science, whatever the matter may be.

Appropriate Level (of Reliability)

A simple assumption is that a service should “always” be available. In the Western world and throughout many of the major cities around the globe, consumers are accustomed to a continuous supply of electricity, water, and “the internet.” The suppliers of those services put a significant amount of work into making them “always” available, but if you look closely at the long-term availability there are frequently outages. Often the outages are unnoticed by the end consumers, but when they are prolonged—caused, for example, by major natural disasters such as hurricanes—the loss of usual services becomes a headline issue.

In the mid-third century B.C. philosophers in China captured the paradox of trying to make a service never have an outage. The Chinese phrasing of the issue is “a one-foot stick, every day take away half of it, in a myriad ages it will not be exhausted”3—and this applies directly to reducing outages.

If a service has 500 units of outage in a given measurement period, it will take progressively greater efforts to maintain that same cumulative outage count as longer and longer measurement periods are considered.

Increasing the reliability requirement will also always require the reliability increase to be supported by all of the dependencies.

Determining the appropriate level of availability for a service based on the nature of the service, the users, and the costs involved is a business decision, not a technical one. The SRE team(s) provide mechanisms to track and manage the outage “budget.” The usual industry terms in this realm are:

Service level indicator (SLI)

What you measure and where the measurement is taken

Service level objective (SLO)

The goal or threshold of acceptable values for the SLI within a given time period

Identifying good SLIs and establishing meaningful SLOs can be difficult and nuanced. It is a perennial topic of discussion on Twitter and in various conferences. For now, we’ll refer you to the much longer expositions of this topic in Chapter 4 of Site Reliability Engineering and Chapters 2 and 3 of The Site Reliability Workbook, edited by Betsy Beyer et al. (O’Reilly).

When failures do happen, SREs are frequently at the forefront to remediate them because of the system-wide perspectives that they are able to bring to the incident response team. SREs often play a role in incident response processes because of their commitment to learning from failures and reducing outage durations. They have an explicit concern with making incident response and learning as efficient as possible.

Sustainable

A site needs to have an appropriate target of availability, based on an analysis of the business costs and benefits. Part of those costs are the human aspects of stress and disruption involved in developing and maintaining the desired level of availability. Traditional “ops” roles have often romanticized the heroic on-call first responder who single-handedly gets the systems back on line by carrying bits from one rack to another at speeds typically only achieved by The Flash.

Generally, the SRE community deplores the need for heroic measures and strives instead for response patterns and system capabilities that do not require extraordinary efforts. This leads to valuing low-noise, actionable alerting, teams that are sized for reasonable on-call rotations, automated response and remediation, and self-service platforms for feature teams to be able to perform their appropriate work without interrupting the SREs’ development work.

“Sustainable” also drives the emphasis on blameless postmortems to learn from the failures that do happen so that the systemic defects that led to a failure can be addressed in both current and future services.

Reliability-Focused Engineering Work

To be considered an SRE team, the team needs to be working on projects that will “make tomorrow better than today”. It needs to be fixing reliability problems in the product codebase as well as building tools and systems that will contribute to the reliability of the systems that it supports.

In some cases this may involve building out continuous integration/continuous delivery (CI/CD) pipelines for the organization, but in many cases SREs take that level of automation for granted and are able to focus elsewhere: on fixing design and code choices that degrade reliability or working on monitoring/alerting/observability or capacity modeling/forecasting or load balancing or chaos engineering or dozens of other areas appropriate to a given organization’s needs.

Continuous Improvement

Especially in the consumer-facing online internet service world, nothing stands still for long. Services add new features daily, if not multiple times per day. User expectations rise inexorably, and they demand more, better, faster. Ongoing investment is required to meet these expectations.

Organizational Model

Effective and successful teams don’t happen by accident. The discussions and agreements around SLOs take time and conscious effort to negotiate and track. Keeping teams from being consumed by the ever-increasing demands of users, developers, and services so that they can do the necessary design and development to engineer solutions also requires a nontrivial organizational commitment to reliability and SRE.

Companies in which SRE teams are successful are ones which have made reliability a priority. They staff their SRE team(s) appropriately to have sustainable on-call responsibilities and long-term engineering output. They support the engineering project work balance of SRE teams by pushing back on the forces of entropy (interruption) that would erode the team’s ability to have productive product output.

Where Did SRE Come From?

Site Reliability Engineering is, first and foremost, an outgrowth of the “always-on” world of online services. When customers or users are able to immediately detect service-impacting events, and when delivering either a fix or a new experience has blurred from a discrete “new version” delivery to a continuous process, critical measures for the services become the time to detect (TTD) problems and time to respond (TTR) to or mitigate (TTM) such events. SRE can trace some of its thought pattern lineage back to various historical precursors. However, the term was first applied to a designated role at Google in around 2003, when Ben Treynor Sloss and his team recognized that traditional approaches could not effectively scale to handle the massive growth of pervasive online services and began to apply software engineering approaches to the previously heavily manual processes of system operations. Besides being manual, these processes were frequently “bespoke,” with custom work being done for each system, rather than following a more “factory” model of churning out hundreds or thousands of identical, largely interchangeable commodity systems (and systems to manage those systems).

Case Study: SRE at Google

According to Treynor Sloss in his keynote talk at SREcon14, the origin of SRE as a role at Google dates back to his assignment to run the “production” team in 2003. At the time, that team consisted of seven people. He was no more interested in the flawed approach of using ever more human labor (toil) to prop up badly functioning systems than any other developer, so he undertook the task of avoiding the historical divide between dev and ops by designing aligned incentives for both groups through objective data (SLOs). With a commitment to reliability that was backed organizationally from the top, Google set in place incentive frameworks that supported a balanced approach to reliability and new, shiny features: release control by SLO,4 hiring and workload management practices that kept SREs from succumbing to operational overload,5 and the outage-related goals of minimizing impact and learning from each event to prevent repeat occurrences.6

At the time that Treynor Sloss was creating the SRE team, Google also had a team that was known as “cluster ops” to tend to problems in the cluster systems that provided the foundational environment for all of the other Google services to function. Over time, the cluster ops team was upleveled to the SRE functions and eventually merged into the SRE team.

Google’s SRE team has grown along with the size and scope of Google engineering: by late 2018 it included over 2,500 people. The SRE team at Google is also supported by technical writing experts and technical program managers, who have the same engagement/disengagement prerogatives that SRE teams have with feature teams. These types of specialized roles become important scale enablers as the teams grow in size.

What’s the Relationship Between SRE and DevOps?

There have been lots of opinions expressed across various online and print media comparing and contrasting SRE and DevOps. Donovan Brown distilled one of the most widely accepted definitions of DevOps as “the union of people, process, and products to enable continuous delivery of value to our end users.” Somewhat more expansively, Ryn Daniels and Jennifer Davis wrote:

Devops is a cultural movement that changes how individuals think about their work, values the diversity of work done, supports intentional processes that accelerate the rate by which businesses realize value, and measures the effect of social and technical change. It is a way of thinking and a way of working that enables individuals and organizations to develop and maintain sustainable work practices. It is a cultural framework for sharing stories and developing empathy, enabling people and teams to practice their crafts in effective and lasting ways.

Drawing out the distinction between the most common DevOps practices and SRE, Jayne Groll wrote:

DevOps focuses on engineering continuous delivery to the point of deployment; SRE focuses on engineering continuous operations at the point of customer consumption.

The formal definition of DevOps as reflected in Brown’s description and in his more expansive blog posts exploring the topic differs from general industry practice, which limits the focus of “DevOps engineers” to the “continuous delivery” part of the software lifecycle (as noted in the preceding quotation). Having feature developers on call for incident response for their services in production use may be a part of the “we do DevOps” picture, or it may not.

The priority for SRE teams is on the “delivery of value to end users” portion of Brown’s definition. For an online service, value can’t and won’t be delivered if end users can’t rely upon accessing it—hence the importance of identifying and tracking service reliability. By focusing on value delivery, SRE provides a complement to teams that focus on developer productivity and CI/CD pipelines.

How Do I Get My Company to “Do SRE”?

You can’t buy SRE in a box or off the shelf.7 One of the areas that companies struggle with is the paradox of the underlying simplicity of the principles and the difficulty of applying them. It’s like the tagline for the game Othello says: “A minute to learn…a lifetime to master.”

1 Hat tip to Laura Nolan for this wording.
Also note that the skills and capabilities to troubleshoot production problems and feed that learning back into making things better can and do exist in teams where reliability may be a shared mandate. The relative balance of concerns between reliability and “other things” will affect the effectiveness of the execution.

2 Hat tip to David Blank-Edelman and the Azure SRE leadership team for this wording.

3 Interestingly, at around the same time the Greek philosopher Zeno posed his “Achilles and the tortoise” paradox, which is an alternate formulation of the same puzzle.

4 As observed by William Gibson, the implementation of these principles was unevenly distributed.

5 The principle was “Drop urgent, nonimportant tasks if you can’t make time for important, nonurgent tasks.”

6 Private communications indicate that this “fully formed” picture of SRE did take some time to evolve and become generally instantiated throughout the organization. The fits and starts, the blind alleys, have been somewhat lost in the mists of time.

7 Much as “people sell devops but you can’t buy it”.

Get What Is SRE? now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.