Until
I have talked about the various standard network topologies, it will
be difficult to have an in-depth discussion of failure modes. But I
can still talk about failure modes in general. Obviously, the worst
failure mode is a single point of failure for the entire network.
But, as the previous section showed, the overall stability of the
network may be governed by less obvious factors.
At the same time, this proves that any
place where you can implement redundancy in a network drastically
improves the stability for that component. In theory it would be nice
to be able to do detailed calculations as earlier. Then you could
look for the points where the weighted failure rates are highest. But
in a large network this is often not practical. There may be
thousands of components to consider. So this is where the simpler
qualitative method described earlier is useful.
What the quantitative analysis of the last section shows, though, is
that it is a serious problem every time you have a failure that can
affect a large number of users. Even worse, it showed that the
probability of failure grows quickly with each additional possible
point of failure. The qualitative analysis just finds the problem
spots; it doesn't make it clear what the consequences are.
Having one single point of failure in your network that affects a
large number of users is not always such a serious problem,
particularly if that failure never happens. But the more points like
this that you have, the more likely it is that these failures will
happen.
Suppose you have a network with 100,000 elements that can fail. This
may sound like a high number, but in practice it isn't out of
the ordinary for a large-scale LAN. Remember that the word
"element" includes every hub, switch, cable, fiber, card
in every network device, and even your patch panels.
If the average MTBF for these 100,000
elements is 100,000 hours (which is probably a little low), then on
net you can expect about one element per day to break. Even if there
is redundancy, the elements will still break and need to be replaced:
it just won't affect production traffic. Most of these failures
will affect very small numbers of users. But the point is that, the
larger your network, the more you need to understand what can go
wrong, and the more you will need to design around these failure
modes.