Chapter 1. First Things First
Welcome to the book! Let’s level set about what site reliability engineering (SRE) is and where it comes from.
What Is SRE?
There are a number of definitions for site reliability engineering you can find in the world. Here is the best one I have been able to construct over the years:
Site reliability engineering is an engineering discipline devoted to helping organizations sustainably achieve the appropriate level of reliability in their systems, services, and products.
If I am presenting this definition to an audience, I usually say that there are at least three words in that definition whose presence, if understood correctly, will lead you to a decent grasp of SRE. If given the chance, I will ask that audience, “Which three words do you think are the most important in this definition?” Please feel free to pause here and reread the definition above and answer that question for yourself before you read any further.
I ask this question not just because I like audience interaction, but also because it offers diagnostic insight into the crowd itself. In Chapter 4, I will go deeper into what you can learn from this diagnostic. But in the meantime, let’s look at the three words I would choose first if asked this question.
Reliability
Pretty easy first guess, right? Reliability is central to everything we do in SRE (heck, it’s right in the name). One way to stress the importance of reliability is to note that an organization can spend millions in the ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access