Skip to Content
Site Reliability Engineering
book

Site Reliability Engineering

by Niall Richard Murphy, Betsy Beyer, Chris Jones, Jennifer Petoff
April 2016
Intermediate to advanced
552 pages
15h 44m
English
O'Reilly Media, Inc.
Audiobook available
Content preview from Site Reliability Engineering

Chapter 12. Effective Troubleshooting

Be warned that being an expert is more than understanding how a system is supposed to work. Expertise is gained by investigating why a system doesn’t work.

Brian Redman

Ways in which things go right are special cases of the ways in which things go wrong.

John Allspaw

Troubleshooting is a critical skill for anyone who operates distributed computing systems—especially SREs—but it’s often viewed as an innate skill that some people have and others don’t. One reason for this assumption is that, for those who troubleshoot often, it’s an ingrained process; explaining how to troubleshoot is difficult, much like explaining how to ride a bike. However, we believe that troubleshooting is both learnable and teachable.

Novices are often tripped up when troubleshooting because the exercise ideally depends upon two factors: an understanding of how to troubleshoot generically (i.e., without any particular system knowledge) and a solid knowledge of the system. While you can investigate a problem using only the generic process and derivation from first principles,1 we usually find this approach to be less efficient and less effective than understanding how things are supposed to work. Knowledge of the system typically limits the effectiveness of an SRE new to a system; there’s little substitute to learning how the system is designed and built.

Let’s look at a general model of the troubleshooting process. Readers with expertise in ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Site Reliability Engineering Fundamentals

Site Reliability Engineering Fundamentals

Emil Stolarsky, Jaime Woo
Observability Engineering

Observability Engineering

Charity Majors, Liz Fong-Jones, George Miranda
The Site Reliability Workbook

The Site Reliability Workbook

Betsy Beyer, Niall Richard Murphy, David K. Rensin, Kent Kawahara, Stephen Thorne
AI Engineering

AI Engineering

Chip Huyen

Publisher Resources

ISBN: 9781491929117Errata Page