7 Cloud operations

This chapter covers

  • How to manage incidents
  • Health monitoring and alerts
  • Governance and usage tracking

So far in this book, we have focused on the happy path of our cloud-native setup, covering the signal sources, telemetry, and destinations to store, query, visualize, and interact with the signals to understand and influence the system. In this chapter, we will discuss an aspect of cloud-native solutions I call cloud operations, which spans several topics you will likely come across, especially in an operations role.

We start off with incidents: how to detect when something is not working the way that it should, react to abnormal behavior, and learn from previous mistakes. Then, we focus on alerts, or alarms (I’m using these ...

Get Cloud Observability in Action now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.