O'Reilly logo
live online training icon Live Online training

Network Troubleshooting: Basic Theory and Process

Topic: System Administration
Russ White

Troubleshooting is a fundamental skill for all network engineers, from the least to most experienced. However, there is little material on correct and efficient troubleshooting techniques in a network engineering context, and no (apparent) live training in this area. Some chapters in books exist (such as the Computer Networking Problems and Solutions, published in December 2017), and some presentations in Cisco Live, but the level of coverage for this critical skill is far below what engineers working in the field to develop solid troubleshooting skills.

This training focuses on the half-split system of troubleshooting, which is widely used in the electronic and civil engineering domains. The importance of tracing the path of the signal, using models to put the system in context, and the use of a simple troubleshooting “loop” to focus on asking how, what, and why are added to the half-split method to create a complete theory of troubleshooting. Other concepts covered in this course are the difference between permanent and temporary fixes and a review of measuring reliability. The final third of the course contains several practical examples of working through problems to help in applying the theory covered in the first two sections to the real world.

What you'll learn-and how you can apply it

This course will focus on the theory of troubleshooting. By taking this course, you will develop a strong mental model of efficient troubleshooting, helping you reduce MTTR, and even MTBM, in real life deployments. The half-split method, the use of models from forwarding systems to protocol layers, and the general concepts of root cause analysis are all covered.

This training course is for you because...

  • You want to move from ad hoc styles of troubleshooting to more systematic styles
  • You want to have specific, actionable methods to use for troubleshooting network problems and to stage information to improve MTTR
  • You want to understand the relationship between redundancy and resilience better
  • You want to understand the relationship between technical debt, root causes, and problem repair better


  • A basic understanding of network design and operation (perhaps at the network professional level)
  • A basic understanding of OSPF, IS-IS, BGP, and IP forwarding


About your instructor

  • Russ White began working with computers in the mid-1980's and computer networks in 1990. He has co-authored forty-seven software patents, participated in the development of several Internet standards, helped develop the CCDE and the CCAr, and worked in Internet governance with the Internet Society. Russ is a co-host of the History of Networking and Hedge podcasts, serves on the Routing Area Directorate at the IETF, co-chairs the BABEL working group, and serves on the Technical Services Council/as a maintainer on the open source FR Routing project. Russ holds an MSIT from Capella University, an MACM from Shepherds Theological Seminary, and is a PhD Candidate in philosophy at SEBTS.


The timeframes are only estimates and may vary according to how the class is progressing

Segment 1: Foundations (50 minutes)

  • Resiliency in terms of troubleshooting
  • Positive feedback loops
  • Automated processes and fragility
  • The troubleshooting process
  • Avoiding the narrows
  • Using models to dive deeper
  • Using abstraction to counter the combinatorial explosion
  • When abstractions leak
  • What, how, and why models
  • 10 Minute Break

Segment 2: Process (50 minutes)

  • The theory of half split, as seen from search trees
  • Putting it together: a simple troubleshooting loop and the half-split
  • Using manipulability theory to prove it
  • Observations on observations
  • 10 Minute Break

Segment 3: Examples (50 minutes)

  • The EIGRP case
  • The BGP case
  • IS-IS and BFD

10 minute final Question and Answer Period