Chapter 16. SRE Organizational Evolutionary Stages

Talking with managers of a new SRE team, I find that they are quite curious about how their team or organization can be expected to change over time. They have a very clear understanding of what their team is doing now and how it relates to the rest of the organization, but the future is not nearly as clear.

The best conceptual framework I’ve seen for SRE team evolution comes from a talk given by Benjamin Purgason, formerly of LinkedIn, at SREcon Asia 2018 called “The Evolution of Site Reliability Engineering”. In this talk, Ben drew upon this experience leading numerous teams through a set of stages, really nailing five possible stages SRE teams can go through over time (though not in a specific linear order). This chapter draws (with permission) heavily from that talk with my added commentary. It is also worthwhile watching the original talk as well for the extended examples taken from Ben’s experience at LinkedIn.

Stage 1: The Firefighter

As you know by now, people often come to SRE after having some bad or worrisome experiences with reliability. Maybe they have a bunch of outages, or perhaps another company in the same field makes it into the press because of some significant spate of downtimes. A slightly cheerier scenario is that engineering management comes to the stark realization that there is no way they are going to be able accelerate their development velocity without things breaking more often. Neither going slower ...

Get Becoming SRE now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.