One of the dirtiest secrets in systems engineering is just how many outages are never really fully explained or understood. Or how many _can’t_ actually be explained or understood given existing telemetry.
A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.
Two distributed systems experts, one a theoretician and the other a practitioner, separated by a generation, make the same observation. Distributed systems are hard to understand, hard to control, and always frustrating when things go wrong. And sandwiched in the middle between the endpoints is the network operator. “Is it the network?” is not too far down the list of universal questions such as “What is the meaning of life, the universe and everything?”. Sadly, network operators do not even have the humor of a Douglas Adams story to fall back on.
The modern data center with its scale and the ever increasing distributed nature of its applications only makes it harder to answer the questions that network operators have been dealing with since the dawn of distributed applications. Observability represents the operator’s latest attempt to respond adequately to the questions. Along with automation, observability has become one of the central pillars of the cloud-native data center.
The primary goal of this chapter is to leave the reader with an understanding of the importance of observability ...