O'Reilly logo
live online training icon Live Online training

Network Troubleshooting Using the Half Split and OODA

Russ White

Troubleshooting is a fundamental skill for all network engineers, from the least to most experienced. However, there is little material on correct and efficient troubleshooting techniques in a network engineering context, and no (apparent) live training in this area. Some chapters in books exist (such as the Computer Networking Problems and Solutions, published in December 2017), and some presentations in Cisco Live, but the level of coverage for this critical skill is far below what engineers working in the field to develop solid troubleshooting skills.

This training focuses on one process, the half-split, and one model, the Observe/Orient/Decide/Act (OODA) loop, to provide engineers with a solid set of mental tools to effectively troubleshoot problems. This training considers the difference between the root cause and the immediate cause, and the concept of technical debt in terms of break/fix. This training also considers some basic concepts of resilience, including the tradeoffs around redundancy, and how they impact the Mean Time to Repair (MTTR).

What you'll learn-and how you can apply it

In this live training, you learn two basic processes or action models useful for troubleshooting computer networks at any scale. The first of these, the half split, has been used in electronic and radio frequency engineering for decades; it is one of the most useful and productive troubleshooting techniques when dealing with complex systems in real life. The second, the OODA loop, is often applied to security, but it is applicable to troubleshooting (and preparing to troubleshoot) as well.

You can apply these techniques to real-world failures and outages, reducing the time required to find a solution, in turn reducing MTTR.

This training course is for you because...

  • You want to move from ad hoc styles of troubleshooting to more systematic styles
  • You want to have specific, actionable methods to use for troubleshooting network problems and to stage information to improve MTTR
  • You want to understand the relationship between redundancy and resilience better
  • You want to understand the relationship between technical debt, root causes, and problem repair better

Prerequisites

  • A basic understanding of network design and operation (perhaps at the network professional level)
  • A basic understanding of OSPF, IS-IS, BGP, and IP forwarding

Resources

Common Misunderstandings

  • Troubleshooting is best learned through experience alone; there are no processes or techniques that can help
  • Troubleshooting always leads to the root cause, and repairs always improve the overall stance of the system
  • Troubleshooting is almost always ad-hoc
  • Finding the problem quickly is most often just luck or instinct

About your instructor

  • Russ White began working with computers in the mid-1980's, and computer networks in 1990. He has experience in designing, deploying, breaking, and troubleshooting large scale networks, and is a strong communicator from the white board to the board room. Across that time, he has co-authored more than forty software patents, participated in the development of several Internet standards, helped develop the CCDE and the CCAr, and worked in Internet governance with the Internet Society. Russ has a background covering a broad spectrum of topics, including radio frequency engineering and graphic design, and is an active student of philosophy and culture.

    Russ is a co-host at the Network Collective, serves on the Routing Area Directorate at the IETF, co-chairs the BABEL working group, serves on the Technical Services Council/as a maintainer on the open source FR Routing project, and serves on the Linux Foundation (Networking) board. His most recent works are Computer Networking Problems and Solutions, The Art of Network Architecture, Navigating Network Complexity, and the Intermediate System to Intermediate System LiveLesson.

    MSIT Capella University, MACM Shepherds Theological Seminary, PhD (in progress) Southeastern Baptist Theological Seminary CCIE #2635, CCDE 2007::1, CCAr

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Segment 1: Understanding MTBF, MTBM, MTTR, and the redundant to resilient tradeoff (50 minutes)

  • Redundancy as the traditional mechanism to add resilience to a network system
  • Why this works from the perspective of MTBF calculations
  • Why this doesn’t work from the perspective of MTBM, MTTR (through complexity), and grey failures

10 Minute Break

Segment 2: Staging Troubleshooting: The OODA Loop (50 minutes)

  • An introduction to the OODA loop
  • How to improve observation for troubleshooting
  • How to improve orientation for troubleshooting
  • How to lay out premade decisions to counter failure

10 Minute Break

Segment 3: The Half Split Method (50 minutes)

  • Understanding the half split method
  • How the half split interacts with OODA
  • An example of half splitting to find a problem

10 minute final Question and Answer Period