Chapter 2. How to Think About Reliability

The tech industry has a habit of becoming enamored with certain terms, phrases, or philosophies and overusing them to the point that they become just meaningless marketing jargon. One well-known recent example of this is the term DevOps, which was coined to describe a certain approach to getting things done. DevOps as originally formulated was intended to be a philosophy that could help shorten development release cycles and provide quicker feedback loops, but today it’s often used as a job title or assigned to a category of vendor tools. Another closely related example is the term Site Reliability Engineering, and along with it, the word reliability itself.

Reliability has far too often come to mean only availability in the tech world. Although availability and reliability are closely linked, availability doesn’t tell the whole story. Words like reliability, robustness, and resilience have all, unfortunately, strayed from their original meanings when used to talk about computer services. Common terms like uptime and downtime further complicate the matter, because when people say, “Is it up?” they don’t always mean, “Is the binary running?” Much more often, they mean something more nuanced.

The truth of the matter is that none of these things is new. Reliability engineering as a discipline is not a new invention or idea. SLOs are an approach that is most often tied to the tech world, but building systems and ...

Get Implementing Service Level Objectives now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.