Anyone can do site reliability engineering (SRE). Sure, Google pioneered the practice, but you don’t have to work for a tech giant to use SRE to increase reliability and improve system performance. At Google’s 2019 Cloud Next conference, I sat down with Stephen Thorne, site reliability engineer on Google’s customer reliability engineering team and co-author of The Site Reliability Workbook, to talk about how organizations, both large and small, can use SRE to reduce operational costs, improve reliability, and create productive cross-functional teams.
During the interview, we covered strategies for getting started with SRE, including how to get buy-in from the whole team, from management to operations. We also talked about potential hurdles to implementing SRE; why postmortems should always be blameless; what success looks like for an SRE team; and best practices for reducing toil, measuring reliability, moving to the cloud, and more.
Here are some highlights from our conversation:
Getting buy-in from management
When we're talking about how to get management buy-in, we see SRE as providing value. The value that SRE provides to a business might come in various different forms. It might be that you're currently having problems with reliability, with your operational load, your operational costs. There is something you need to do in order to be able to scale up and be more effective in your environment. SRE allows you to say, "Are we reliable enough? And if we're not, what are we going to do about it?"
Biggest roadblock to doing SRE
One of the things I see being a significant barrier is the psychological safety required in order to be confident working in production, and being responsible for production, and being responsible for engineering and production. At Google, we have this locked down. We've got the concept of blameless postmortems, but it's not just that. You have this confidence that if you're toiling too hard, you can go to your leadership and say, "Help." And your leadership will say, "Absolutely. That's a problem. We'll help you drive that down."
But in another organization, you might go to leadership and say, "Help, we have too much toil right now." They might say, "Okay, so you're going to work harder, aren't you?" I think one of the things we have at Google, which I would love to see in more organizations, is the implementation and the feeling of psychological safety. That if you have problems, it's not your fault that you have problems. You can go to leadership, you can go to your peers, you can go to your development teams, and you can say, "Let's work together to make this a better place for everyone."
Keeping the post-mortem blameless
The reason you really want a blameless postmortem is because as soon as you blame a system, or a human, or a thing that happened, you stop looking for all of those other causes for what went wrong.
Signs of a successful SRE team
What you want to see from a successful site reliability engineering team is that they know how reliable their system is. They have a plan for how to improve it over time, or reduce their toil over time; they're delivering on that plan; and those deliverables are actually causing a change.
So, a successful SRE team is able to demonstrate the impact of the work they're doing. If you have an SRE team that was running a reliable service last year and running a reliable service this year, but can't tell you what projects they completed in that time that actually had a measurable impact, it's like, what are we doing here?
Is it possible to automate yourself out of a job?
If you find SREs who have actually managed to automate themselves out of a job, you have struck gold. Because they now know how to do this for other teams, they know how to scale up their work, and you should grab onto these people with both hands and say, "You are the best SREs we have right now. Help everyone else achieve your success."
This post is a part of a collaboration between O’Reilly and Google. See our statement of editorial independence.