Chapter 12. Effective Troubleshooting
Be warned that being an expert is more than understanding how a system is supposed to work. Expertise is gained by investigating why a system doesnât work.
Ways in which things go right are special cases of the ways in which things go wrong.
Troubleshooting is a critical skill for anyone who operates distributed computing systemsâespecially SREsâbut itâs often viewed as an innate skill that some people have and others donât. One reason for this assumption is that, for those who troubleshoot often, itâs an ingrained process; explaining how to troubleshoot is difficult, much like explaining how to ride a bike. However, we believe that troubleshooting is both learnable and teachable.
Novices are often tripped up when troubleshooting because the exercise ideally depends upon two factors: an understanding of how to troubleshoot generically (i.e., without any particular system knowledge) and a solid knowledge of the system. While you can investigate a problem using only the generic process and derivation from first principles,1 we usually find this approach to be less efficient and less effective than understanding how things are supposed to work. Knowledge of the system typically limits the effectiveness of an SRE new to a system; thereâs little substitute to learning how the system is designed and built.
Letâs look at a general model of the troubleshooting process. Readers with ...