O'Reilly logo
live online training icon Live Online training

Implementing Service Level Objectives

How to make SLIs, SLOs, and error budgets work for you

Topic: System Administration
Alex Hidalgo

Service-level objectives (SLOs)—the bedrock upon which the discipline of site reliability engineering (SRE) was built—have never been more popular. But it can be difficult to find practical advice that helps you actually get started. And while the concepts are easy to learn, it turns out that actually putting them into practice takes much more work than most people realize.

Expert Alex Hidalgo introduces you to an SLO-based approach to reliability and walks you through real-world example applications—showing you how to get started on your SLO journey right away. Learn how to do SLOs the right way to get the data you need to make better decisions, understand your services better, increase your release cadence, and end up with happier customers.

What you'll learn-and how you can apply it

By the end of this live online course, you’ll understand:

  • What SLIs, SLOs, and error budgets are
  • Why this philosophy is essential to adopting site reliability engineering
  • How this approach can lead to happier engineers, happier users, and a happier business

And you’ll be able to:

  • Pick meaningful SLI measurements
  • Choose good SLO targets
  • Use error budgets to drive decision making
  • Increase your release cadence
  • Report on reliability to leadership in a more cohesive manner

This training course is for you because...

  • You’re an engineer on the front lines and care about the reliability of your service.
  • You’re a product manager who wants to see a quicker release cadence.
  • You’re a member of leadership who wants to see better reporting on the reliability of your products and services.

Prerequisites

  • A basic understanding of web-based computer services, including the concepts of microservices, APIs, load balancers, databases, and other common pieces of modern computer service architectures

Recommended preparation:

Recommended follow-up:

About your instructor

  • Alex Hidalgo is a Site Reliability Engineer and author of the upcoming Implementing Service Level Objectives (O'Reilly Media, September 2020). During his career he has developed a deep love for sustainable operations, proper observability, and using SLO data to drive discussions and make decisions. Alex's previous jobs have included IT support, network security, restaurant work, t-shirt design, and hosting game shows at bars. When not sharing his passion for technology with others, you can find him scuba diving or watching college basketball. He lives in Brooklyn with his partner Jen and a rescue dog named Taco. Alex has a BA in philosophy from Virginia Commonwealth University.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Reliability (50 minutes)

  • Presentation: The reliability stack—an overview of how SLO-based approaches to reliability work; how all the parts work together
  • Group discussion: How do you currently think about reliability?; What does reliability mean?; How do users think about reliability differently than engineers?
  • Break (5 minutes)

Meaningful SLIs, good SLOs, and effective error budgets (50 minutes)

  • Presentation: Developing meaningful SLIs; thinking about risk—what other engineering disciplines have already figured out about risk; choosing good SLOs—the math and basic statistics behind how to choose good targets; how to use error budgets

Wrap-up and Q&A (15 minutes)