O'Reilly logo
live online training icon Live Online training

Systems design for site reliability engineers

How to build a reliable system in three hours

Salim Virji

Distributed systems form the foundation for most of our modern computing infrastructure as well as much of our application development—whether on-premises or mobile. The software built with distributed systems comes with distinct failure modes. In order to build reliable systems, you must understand how to assess and develop with these modes.

In this hands-on three-hour course, Salim Virji walks you through the fundamentals of systems design and evaluation, helping you build the skills necessary to design, improve, and scale your own system or application using SRE best practices developed at Google.

What you'll learn-and how you can apply it

By the end of this live, hands-on, online course, you’ll understand:

  • How to design a software system to meet a service-level objective (SLO)
  • How to incrementally improve a system
  • How to identify single points of failure (SPOFs) in a large software system

And you’ll be able to:

  • Make required resource estimates to create a bill of materials
  • Incrementally scale a system

This training course is for you because...

  • You’re a site reliability engineer (SRE) or work in a related discipline, such as DevOps, systems engineering, or system administration.
  • You manage SREs.
  • You want to develop an understanding of practical distributed systems.


  • Familiarity with “box and arrows” diagrams
  • A working knowledge of orders-of-magnitude math (e.g., How many copies of a 1 MB file can a 1 TB drive hold?)

Recommended preparation:

Recommended follow-up:

About your instructor

  • Salim Virji is a site reliability engineer at Google, where he has built distributed systems that enable planet-scale storage and datacenter-size compute loads.


The timeframes are only estimates and may vary according to how the class is progressing

Identify the problem (50 minutes)

  • Lecture: Problem statement—We're building an image-serving application; terminology and concepts; service-level objectives
  • Hands-on exercise: Design a distributed system
  • Q&A
  • Break (10 minutes)

The solution has limitations. Let’s improve it (50 minutes)

  • Lecture: How to quantitatively assess the failure domains in a distributed system; how to provide defense in depth so that failures are isolated
  • Group discussion: Where are the failure domains?
  • Hands-on exercises: Identify failure domains; make the design tolerant to failure; make a highly available image-serving system
  • Q&A
  • Break (10 minutes)

Commonly encountered limitations and how to design for them (50 minutes)

  • Lecture: Capacity limitations, bottlenecks, and compromises; the boundaries of a system; how to decide when further scale is important
  • Group discussion: Designing for 10x scale (and why this is a good rule of thumb)

Wrap-up and Q&A (10 minutes)