Chapter 14. Monitoring and Observability Patterns
One of the core differences between client applications and distributed systems is that generally distributed systems implement services. Services are always on, always available for users around the world in all time zones and ways of working. Because of the 24/7 nature of these systems, monitoring and observability become critical to building reliable systems. To deliver reliability, you must notice a problem before the customer notices a problem; and to solve any problems you find, you need to be able to understand how your system is operating. This chapter focuses on best practices for such monitoring and observability.
Monitoring and Observability Basics
Before we get into the details of implementing monitoring and observability, it is useful to ground ourselves in the core set of concepts that make up any monitoring and observability solution.
In any system, there are four key concepts which make up our solutions:
-
Logging
-
Metrics
-
Alerting
-
Tracing
We’ll step through each of these in a little more detail.
It’s highly likely that anyone who has built even the smallest system has implemented
logging, even if they don’t realize that they have. The simplest version of logging
is the humble printf
statement. Of course, there are many more sophisticated ways
to
do logging, but ultimately they all serve the same purpose as that print statement. Namely, they show us that a particular place in our code has executed, and they ...
Get Designing Distributed Systems, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.