Chapter 14. Monitoring and Observability Patterns

One of the core differences between client applications and distributed systems is that generally distributed systems implement services. Services are always on, always available for users around the world in all time zones and ways of working. Because of the 24/7 nature of these systems, monitoring and observability become critical to building reliable systems. To deliver reliability, you must notice a problem before the customer notices a problem; and to solve any problems you find, you need to be able to understand how your system is operating. This chapter focuses on best practices for such monitoring and observability.

Monitoring and Observability Basics

Before we get into the details of implementing monitoring and observability, it is useful to ground ourselves in the core set of concepts that make up any monitoring and observability solution.

In any system, there are four key concepts which make up our solutions:

Logging
Metrics
Alerting
Tracing

We’ll step through each of these in a little more detail.

It’s highly likely that anyone who has built even the smallest system has implemented logging, even if they don’t realize that they have. The simplest version of logging is the humble printf statement. Of course, there are many more sophisticated ways to do logging, but ultimately they all serve the same purpose as that print statement. Namely, they show us that a particular place in our code has executed, and they ...

Get Designing Distributed Systems, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Designing Distributed Systems, 2nd Edition by Brendan Burns

Chapter 14. Monitoring and Observability Patterns

Monitoring and Observability Basics

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly