Chapter 2. Monitoring in a Reliability Engineering World
Monitoring systems is an extensive topic that has been heavily shaped in the past few years by the seminal work in Site Reliability Engineering: How Google Runs Production Systems (O’Reilly) and its followup, The Site Reliability Workbook: Practical Ways to Implement SRE (O’Reilly). Since these two books came out, site reliability engineering (SRE) has become a popular trend in open job listings. Some companies have gone as far as retitling existing staff as some flavor of “reliability engineering.”
Site reliability engineering has changed how teams think about operational work. This is because it consists of a set of principles that allow us to more easily answer questions like:
-
Are we providing an acceptable customer experience?
-
Should we focus on reliability and resilience work?
-
How do we balance new features against toil?
This chapter expects the reader to have an understanding of what these principles are. If you have not read either of the aforementioned books, we recommend these chapters from The Site Reliability Workbook as a crash course:
-
Chapter 1 offers a deeper understanding of the philosophy behind moving toward service-level performance management in production.
-
Chapter 2 covers how to implement service level objectives (SLOs).
-
Chapter 5 covers alerting on SLOs.
Some may argue that SRE implementation isn’t strictly a part of high performance MySQL, but we disagree. In her book, Accelerate,
Get High Performance MySQL, 4th Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.