Chapter 4. Talking About SRE (SRE Advocacy)
Welcome to a highly meta chapter in which we are going to be talking about talking about SRE.1 Effective SRE advocacy turns out to be crucial for both individuals and organizations, for reasons we will see in a minute. In this chapter, we’re going to explore the craft of SRE advocacy and what resources you have available to hone it for both vantage points. We will talk about the kinds of stories that matter and how to pick the most effective ones, and I will offer a whole slew of tips on the subject. This chapter is about the stories we tell ourselves.
Why It Matters, Even Early in Your Experience with SRE
Without fail, you will at some point have to explain to others what SRE is and why it matters. That point invariably comes much sooner than you would expect,2 which is why this chapter is right up front with the rest of the introductory material.
When I say explain it to others, I mean people who run the gamut from your grandparents to your CEO. Sometimes you will just be satisfying someone’s curiosity at a cocktail party, but it’s equally likely you will be in a position of singing for your supper as you have to justify SRE’s presence in your organization and why someone should continue to pay for it.3 It is also very common to have to explain to another group in your org what SRE is and why it would be in their best interest to engage with you. I don’t believe I am exaggerating when I say that the survival of SRE at a company is predicated on the strength of its advocacy.
I believe that survival is probably a strong enough motivator for you to pay attention to advocacy, but I will throw one more reason into the mix: identity. I’m convinced that the stories we tell ourselves are a major way identity is formed. I’m very confident that what SRE means personally and organizationally matters to you (or you are in the wrong book, my friend).
When It Matters
So when is SRE advocacy particularly important? I’ve already mentioned a few scenarios, but let’s explore this question a little further. In my experience, SRE advocacy is crucial in the personal context when dealing with hiring and career changes. Not only do you have to sell yourself and why you personally will improve your potential employers’ lives, but part of the sell has to include why your SRE approach/mindset can make a difference.
If you are applying for a job at an organization that already has an SRE story (weak or strong), this is your moment to determine whether your SRE story matches theirs. I’d strongly suggest that you pay very close attention to how well they are able to articulate their SRE story in the hiring process. If they can’t be clear or consistent in their description to you of SRE and its purpose, that will give you a really good signal about how effective they are at expressing this to others in the organization.
Organizationally, I’ve seen SRE advocacy be crucial in two settings: during the early stages of SRE and when attempting an expansion of influence. The first setting is pretty obvious; if you are trying to establish SRE in an organization, there’s a lot of education and justification that has to take place. We’ll talk about this more in this chapter, but how SRE gets framed in discussions can have a significant impact on how people will treat it going forward.
The second setting where advocacy is crucial is the expansion phase. I might say it like this: “Cool, you’ve been able to set up a new SRE group. Now you have to get others to play with you. How are you going to do that?”4 Effective advocacy is how you are going to do that.
I realize that everything I have said so far in this chapter makes it sound like advocacy is a do-or-die proposition—it’s true, I actually believe that. The good news is that this entire book is chock-full of ideas, resources, and guidance to support your advocacy efforts. Pretty please use everything you can from it to support those efforts.
Get Your Story (and Audience) Straight
For me, SRE advocacy (well, any advocacy) starts with a story I want to tell.5 The reason why I start with a story is that humans are wired to be story-receiving machines. It is one of the best ways we know to relate complex, multivariate information. It may not be obvious at first glance, but that is exactly what the concept/definition of SRE is—a collection of some pretty sophisticated ideas bound together. And when you begin to discuss it, different people zero in on different aspects of it based on their background and prior experience.
This is why I came up with a precisely vague6 definition that I could hang all of this on to frame the conversation. This is the definition of SRE we saw in Chapter 1; let me repeat it here for convenience:
Site reliability engineering is an engineering discipline devoted to helping organizations sustainably achieve the appropriate level of reliability in their systems, services, and products.
As I mentioned before, whenever I speak to people using this definition, I usually ask them to pull out the keywords they notice (like reliability and appropriate). Each one of those words is a door into an entire room’s worth of discussion; letting others choose means the listener gets to choose the doors that stand out to them. I have used this definition over and over again, as it seemed to work well to draw people into a discussion.
My talks with groups are usually one part me passing air over my larynx, two parts open discussion/Ask Me Anything. I try to get a sense of my audience’s situation and how I can help. To my surprise, over time, something weird happened with the talks that used this interactive definition—something I didn’t expect.
I started to notice a pattern. While it wasn’t completely predictive,7 I found that different audiences would pull out different words based on their current organizational challenges. During the Q&A period after the talk, the people who felt underwater more often than not had noticed the words sustainable and appropriate. The groups that felt they hadn’t yet reached the level of credibility with others they desired wanted to talk about the word discipline. Those who desired this credibility from partner development teams often wanted to dig into engineering. Groups coming off a series of outages couldn’t get enough of reliability, and so on. I wouldn’t say this approach is solidly diagnostic, but I have come to find it practically useful for helping me shape how I talk about the subject.
Why do I bring this up? For me it is a clear reminder that when we tell someone a story about SRE as part of our advocacy efforts, different people will hear different things in that story based on their background and their current needs. I know that “consider your audience” isn’t remotely new advice, but it never hurts to have a reminder.
If you can (with authenticity and proficiency) speak using the terms and register of your audience, do so. If you speak the language of the finance people in your organization and you are talking to an audience of those people, by all means use finance terms in a nongratuitous fashion.
That being said, it is possible to take this too far. If, during your talk preparation, you get even a whiff of constructing a “buzzword bingo card,” dial it back. Or maybe when you say certain things, they ring hollow to your ear. You get to decide what works for you. I’ve learned that there are certain business-speak words that set me on edge when I hear them, so they almost never pass my lips.8 Ideally you can find the right intersection between the words you like to speak and the words the audience is accustomed to hearing.
Some Story Ideas
A moment ago, I gave an example of a story you can tell about SRE, namely the “What is it?” or definitional story. It’s a good place to start, but there’s a whole panoply of stories that can and should be told about SRE, depending on what you hope to accomplish in the telling. Here’s a list of some other ideas off the top of my head:
- Efficacy
A story about how a partner group was suffering with reliability issues, SRE got involved and helped with X, Y, Z, and now they are in a better place, as shown by…
- Reputation
A story about how famous company X adopted SRE.9
- Possibility
A story about how comparable company X adopted SRE (how it went well, how it had issues but then overcame them, etc.). If they can do it, surely we can too…
- Surprise
A story about an outage and the surprising result or finding uncovered by SRE as part of their deftly run postincident review process.
- Transformation
This is what things used to be like for us, but now, N months later, we are in this better place.
- Day in the life
Here’s what happened during a sample day/week, including a selection of the things we did to contribute to company-wide or partner team success.
- Mystery/puzzle
X was a situation that made no sense; here’s how we solved the mystery step by step.
- Expert at work
Here’s how an expert approached a problem, how they thought about it, steps they took, etc.
There are many other ideas for stories you can tell at work. If none of the listed ideas inspire you, perhaps look back over the years of videos of sessions at SREcon (see Appendix C) and I’m pretty sure you will find some compelling seeds for your own talks.
As a tip on this topic, I recommend you collect stories as you go. The life of an SRE is fortunately or unfortunately never dull. On a daily basis, we find ourselves in situations that make for good stories to tell others. Be it an outage, a meeting with an aha moment where someone has an interesting take on the subject, a tech problem where the answer to a question led to an even better question—all of these are great story fodder. I highly recommend you do this: keep notes on these things as they come up, either in a running file/online document or in a (gasp) paper notebook.
Other People’s Stories
As an important part of teaching you how to collect stories, I want to mention that you must be sure to get explicit permission to retell these stories from both the people involved and the organization. Many organizations have explicit policies and processes around public presentations. If you plan to tell these stories publicly, be sure to get the proper clearance to do so.10
One variation related to other people’s stories that is a little harder, but magnitudes of order more effective: don’t just collect people’s stories to retell, collect the people instead. It is great to tell someone else’s story, but often, if you can have that person do the telling, it will be a kerjillion times more effective and impactful.11 Even if they can’t be a part of your presentation every time you give it, you might be able to record a video of them speaking—replay that.
Secondary Stories
Just a quick note about stories like the ones we’ve been discussing: all of them have room for an ulterior motive or two. Because stories can be such good carriers of information, there’s bandwidth not only for the main purpose of the telling but also for secondary stories. I’ll pick an idea from the previous story ideas at random to demonstrate.
Let’s say you’ve been frustrated with the lackluster postincident reviews your organization has been doing lately. Maybe they have been a bit on the perfunctory side; perhaps it is clear to you that there’s more to be learned in the process. One sign of this is that the last three have all been attributed to “human error” as the final conclusion.
Next time you get called upon to talk about a past outage to management, perhaps you could choose the “surprise” idea from the list. In that telling (and I know you see this coming), you could be sure to construct a cliffhanger midstory that includes something like “Originally, we were going to attribute this to human error, but something about that didn’t sit right…” At the end of the story, you could posit the question, “What else could we learn if we didn’t prematurely conclude our investigations and attribute failures to human error?” or make some other not-so-subtle statement.12
In a similar category of “a good device, but try not to be too heavy-handed,” it can be useful to find a story from your own experience where you successfully modeled the behavior you would like to see the organization adopt. “Here’s how I failed and leveled up based on that experience” can be popular because everyone loves a good failure story. It has the plus of coming across as authentic (perhaps the best story is your own story) without being too preachy if handled properly. One suggestion: have a colleague review your presentation before you give it. These kinds of stories can have a “devil is the details” trap. It can be hard to decide how detailed your recollection needs to be to get your point across. Other people are likely a better judge of this than you are, hence, the suggestion to have someone else review it first.
The Challenges the Stories Present
The stories we deal with in SRE advocacy are sometimes harder to tell than you might expect. Let’s talk about a few challenges that get in the way:
Challenge 1: Difficult stories
One very specific challenge we have when constructing stories for SRE advocacy is that sometimes, we have to tell the story of the dog that didn’t bark.13 Often we have to describe situations where the value of our work is seen in what didn’t happen—the systems that didn’t go down, the outages we didn’t have, the data loss that was prevented, and so on. Telling a compelling story of a negative or of things functioning the way they were designed to is almost always harder than describing some crisis that did happen.
So how do we handle this challenge? For me, the answer centers on contrast. That’s the key element that lets us make sense of photographic negatives. Our task in this scenario is to bring into sharp relief an object (like your system and how it operates) against a background (the load, the behavior of your dependencies, the conditions that would have taken it down in the past, the sociotechnical context, etc.). Sometimes we can begin with a description of a related outage, stopping at the point in the story when the problem is no longer happening and explaining what you changed and its positive results.14
In the sidebar “Resilience Engineering Again?”, I note that we should be taking the opportunity to discuss questions like “What contributed to things going well? And how could things have gone worse?” Here’s that opportunity.
Challenge 2: How the stories develop
Another challenge you are going to encounter sooner or later, especially if you are speaking to Western audiences,16 is that reliability work is very seldom linear in nature. The dragon we slay once doesn’t usually stay slayed. Don’t expect your SRE stories to be linear, either. At some point, it will become abundantly clear to you that the shape of the work is a lot messier. Sometimes we have loops (recall the nurture feedback loops from Chapter 2), sometimes our reliability zigs and zags, maybe you had a bad month due to seasonal traffic, and so on.
We very seldom get a full picture that looks like a perfectly straight line from bad to better. If you were to zoom out from the complete graph, you would more likely get something that looked like a child’s crayon picture. This complicates the story we want to tell. But that’s OK—it’s just the existential truth we have chosen to live with.
There are two ways I know to handle this concern: either elide the issue in your head and make peace with the gross simplification you are about to engage in (ideally, disclosing it to your audience) or be very clear that you are describing a select slice of or window into the larger picture. I believe the longer you are in the SRE realm, the more striking the nonlinearity of our basic reality becomes. My hope for you is that your skill at translating this reality for others to understand grows at the same rate as your awareness.
Challenge 3: Conveying the right lessons
Be very cautious about emphasizing “heroic effort” stories because they can have unintended negative consequences. It can be very tempting, especially in situations where you are craving external respect and recognition, to lean into narratives where a person on the team rappelled down the side of the building and then valiantly fought the blaze for 30 hours straight without food or sleep until it was vanquished.
All of that may have happened, but glorifying “hero culture” will lead you to construct a culture and organizational expectations that are unhealthy and unsustainable. When I hear “30 hours without food or sleep,” I hear it as a failure in the organization’s incident response procedures, not something to be celebrated. “Worked straight through the weekend/holiday/night,” “woke up the entire team,” and “80-hour work week” are similar red flags that should be approached as problems, not evidence of commitment or dedication. If you do need to say these things during a readout of an incident, be sure to emphasize fixing them in your postincident review along with the rest of the repair items you might have.
To understand this topic better, I highly recommend you watch one of the most powerful talks I’ve seen: “The Cult(Ure) of Strength” by Emily Gorcenski. It was one of two tech conference talks where I cried. In this session, Gorcenski did an excellent job of capturing the broken thinking that leads us into the “hero culture” trap.17
Challenge 4: Picking the right main character
Another people-related tip: don’t forget the people when telling stories for SRE advocacy. An SQL server is not the only important character in your outage story. Another existential truth when it comes to SRE is that all of our systems are sociotechnical. Large, complex systems do not run in isolation. They run in a context that includes people, so if your story consists entirely of things with blinking lights that go beep-boop, it is almost certainly incomplete.
One Last Tip
To end this chapter, let me offer one last tip that is true for all sorts of advocacy and public speaking, not just SRE advocacy. I’ve had the good fortune to be able to give many talks and presentations over the years. I have learned that my best talks are those that changed me during the preparation or presentation. I wish the same experience for you at some point. Get in touch; I’d love to hear that story from you.
1 You know what you must do. Find someone who does not have this book yet and talk about this chapter. The fate of the meta metaverse is in your hands.
2 Like “day one” sooner. In addition to wanting to prepare you for this situation, I strongly feel that the process of talking about SRE with other people will immediately strengthen your own understanding of SRE, which is yet another reason to think about this early in the book. It also lives here because advocacy has a foot in both individual and organizational contexts.
3 There’s plenty more discussion on handling the business aspects of SRE in Chapter 13.
4 Remember that relentlessly collaborative thing?
5 Later in this book, I describe storytelling as a core skill to have as an SRE—here’s one context where that is very clearly the case.
6 By precisely vague, I mean the definition is intentionally vague enough to encompass a wide range of work toward reliability without being so vague as to be applicable to all engineering.
7 I will be the first to admit I could be making this up; I’m not immune from the human trait of identifying patterns where they don’t really exist.
8 I’m a little hesitant to reveal them publicly for fear of them showing up in some sort of torture scenario, but here’s one: learnings. I can’t stand the word learnings. Which are your sandpaper words?
9 Full disclosure: this is my least favorite out of the bunch. Your company and famous company X are almost always going to be very different entities on the inside—what works for them may not work for your company (see the stories of people not becoming Google even though they followed everything in the SRE book). That being said, sometimes management wants to be reassured by SRE’s bona fides based on a famous company they hope to emulate. Use with caution if you have to use this at all.
10 You didn’t hear it from me, but it is in your best interest to build a good direct relationship with whomever clears materials for external publishing in your organization. If you gain a reputation for being extra careful around these rules and extra easy to work with, that will often smooth your path to future approvals.
11 I want to acknowledge that this path can be fraught with peril; for example, in cases where you are a much better speaker than your special guest. This then becomes a speaker preparation and coaching problem (or a video-editing problem), which in almost all cases can be overcome. I assert that hearing someone’s experience firsthand is ultimately going to be more impactful, even if the speaker isn’t a pro. I can coach someone to be a better speaker, but I can’t coach someone into having the original experience.
12 Want someone else’s premade story with exactly this conclusion? Check out Nick Stenning’s superb 2019 SREcon EMEA talk, “Building Resilience: How to Learn More from Incidents”.
13 Arthur Conan Doyle reference, although have you heard of the preparedness paradox? Might not want to read that Wikipedia article; it may make you sad.
14 Ironically, this is an example of counterfactual reasoning (i.e., using something that didn’t happen to explain something that did), which I will warn about in Chapter 10.
15 Pretty sure at least one of these was a John Allspaw talk, so credit to him for giving such good talks that the questions stick in my head even past my precise memory of when I heard them.
16 Other cultures don’t necessarily expect their stories to follow a linear structure.
17 It’s also the talk that stopped me from ever again using the term war stories to refer to an outage story.
Get Becoming SRE now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.