Chapter 11. Network Observability
One of the dirtiest secrets in systems engineering is just how many outages are never really fully explained or understood. Or how many can’t actually be explained or understood given existing telemetry.
Charity Majors
A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.
Leslie Lamport
Two distributed systems experts, one a theoretician and the other a practitioner, separated by a generation, make the same observation. Distributed systems are hard to understand, hard to control, and always frustrating when things go wrong. And sandwiched in the middle between the endpoints is the network operator. “Is it the network?” is not too far down the list of universal questions such as, “What is the meaning of life, the universe, and everything?” Sadly, network operators do not even have the humor of a Douglas Adams story to fall back on.
The modern data center with its scale and the ever increasing distributed nature of its applications only makes it more difficult to answer the questions that network operators have been dealing with since the dawn of distributed applications. Observability represents the operator’s latest attempt to respond adequately to the questions. Along with automation, observability has become one of the central pillars of the cloud native data center.
The primary goal of this chapter is to leave you with an understanding of the importance of observability ...
Get Cloud Native Data Center Networking now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.