Chapter 5. The Reliability Stack
Alex Hidalgo
Think about your favorite digital media streaming service. You’ve settled down on the couch to watch a movie and you click a button on your remote. Most of the time, the movie buffers for a few seconds and then starts playing.
But what if the movie takes a full 20 seconds to buffer? You’d probably be a little annoyed in the moment, but ultimately, the rest of the movie streams just fine. Even with this little bit of failure, this service has still acted reliably for you, since the majority of the time it doesn’t take anywhere near 20 seconds.
What happens if it takes 20 seconds to buffer every single time? Now things go from momentarily annoying to fully unreliable. With the plethora of digital media streaming services available, you might choose to abandon this service and switch to a different one.
Nothing is ever perfect and nothing can ever be 100% reliable. This is not only the way of the world, it also turns out that people are totally fine with this! No one actually expects computer systems to run perfectly all the time; we just need them to be reliable enough often enough.
How do we figure out the right level of reliability? This is where the reliability stack comes into play. It’s made up of three components: SLIs (service level indicators), SLOs (service level objectives), and error budgets.
At the base of the reliability ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access