I’ve been fortunate to get hired into medium-sized operations teams at large technology companies. All ops teams (a customary term for operations teams) share two interesting characteristics: compared to other engineering departments, they work under more pressure, and they attract bad attention much easier than good attention. Digital firefighting is the nature of the job. We might get noticed when things go awry and we fix them. If we don’t react fast enough, we definitely get noticed. If you know anyone in network operations, ask if that’s the way he or she feels about the job—I bet you’re going to get an answer along those lines.
Working in ops is all about effectiveness: there is no time for re-engineering. We must get things right the first time and we have to act fast. We go through a lot of reprioritizing and context-switching. There is relatively little room for creativity, at least the kind that doesn’t love constraints. All this makes operations a great place to learn and grow.
This book is based on experiences of working in ops. I was extremely lucky to work with some of the smartest people in the industry. I would like this book to be a tribute to all these invisible ops guys who struggle daily to maintain the highest standards of service availability.
In my career, I’ve stared at all sorts of timeseries plots, a lot of them. At one point it was my full-time job—no kidding. With time, I learned to extract meaning from data point fluctuations just by a brief glance, without having to study their origin. It’s a funny kind of intuition that system engineers develop in the course of their jobs, and one that probably saves us a lot of time. Some of us are unaware of it, and it’s definitely not something we brag about. It is a very useful skill, nevertheless, and in this book I attempt to verbalize it in order to assist you, dear Reader, to absorb it in a more conscious way than I did, possibly saving you weeks or months of getting up to speed.
Some people on my team believed that putting in motion the ideas described here led to a visible paradigm shift. I must agree that in a relatively short period of time, the work caused by our alerting configuration went from mundane to effortless.
This book focuses on monitoring and alerting in the context of distributed information systems, but I’m hoping that the principles presented here will also be applicable to timeseries and datasets generated by all sorts of complex systems. The book does not focus on any particular software package. Rather, it attempts to extract and summarize regularities that system engineers come across in their daily work. You won’t find many long code listings here, but you’ll definitely find ideas: ones that I hope you’ll be able to relate to and apply either at work or in a research project.
The main audience of this book are system operators, those who fight the daily battle of delivering the best performance at lowest cost as well as those who use monitoring as a means and not an end. Read it if you work extensively with monitoring and plan alerting configurations. If keeping high availability and continuity of service is your job, read on. If monitoring and alerting bring up unpleasant associations, that’s an even more valid reason to read the book. If you’re trying to quantify the effectiveness of your alerting configurations, the book might have good answers.
Administrators who are setting up a monitoring or alerting configuration with a potential to grow big might also find the book useful. The ideas presented here have been tested on large alerting configurations with a high degree of success. By “large,” I mean thousands of monitors and hundreds of alarms. The book should help you replicate this setup in your environment.
The following typographical conventions are used in this book:
Indicates new terms, URLs, email addresses, filenames, and file extensions.
Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.
Constant width bold
Shows commands or other text that should be typed literally by the user.
Constant width italic
Shows text that should be replaced with user-supplied values or by values determined by context.
This icon signifies a tip, suggestion, or general note.
This icon indicates a warning or caution.
This book is here to help you get your job done. In general, if this book includes code examples, you may use the code in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.
We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Effective Monitoring and Alerting by Slawek Ligus (O’Reilly). Copyright 2013 Slawek Ligus, 978-1-449-33352-2.”
If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at email@example.com.
Technology professionals, software developers, web designers, and business and creative professionals use Safari Books Online as their primary resource for research, problem solving, learning, and certification training.
Safari Books Online offers a range of product mixes and pricing programs for organizations, government agencies, and individuals. Subscribers have access to thousands of books, training videos, and prepublication manuscripts in one fully searchable database from publishers like O’Reilly Media, Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, Course Technology, and dozens more. For more information about Safari Books Online, please visit us online.
Please address comments and questions concerning this book to the publisher:
|O’Reilly Media, Inc.|
|1005 Gravenstein Highway North|
|Sebastopol, CA 95472|
|800-998-9938 (in the United States or Canada)|
|707-829-0515 (international or local)|
We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://oreil.ly/Monitoring_and_Alerting.
The author has set up a small blog for this book. It can be accessed at http://effectivemonitoring.info/.
To comment or ask technical questions about this book, send email to firstname.lastname@example.org.
For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia
I’d like to start by saying thanks to my grandparents, Zuzanna and Marian Osiak, who in 1998 helped me buy my first O’Reilly book, the first edition of Linux in a Nutshell by Ellen Siever et al., when at 13 years of age I was on a very limited budget. Specifically, grandma Zuzia persuaded the shop clerk in Katowice, Poland to drop the price by 50% despite bookstore’s strict policy of not offering discounts in excess of 20%. Little did we suspect that after fast-forwarding into the future by a decade and a half, I got to work with Ellen’s editor, who created the idea of this Linux book.
The person most helpful in the creation of the book was my wonderful partner, Natalia Czachowicz, who assisted me at all stages of the authoring process from coming up with an idea and writing the proposal through to setting up the plan, its execution and finalizing. Natalia acted as my consultant, editor, reviewer, proofreader, marketer and counsellor, and the amount of support she provided is ineffable; Nati, I’m indebted to you for life!
I want to offer my gratitude to Benoît “tsuna” Sigoure, my technical reviewer, whose critical remarks and suggestions greatly added to the value of this book. Special thanks go to Viktor “vic” Trnka who kindly allowed me to instrument the network and systems of MS-Free.NET to use generated data points for illustrations. Last but certainly not least I’d like to give credit to Andy Oram, who patiently edited our way into completion of this work.
I’d also like to take this opportunity to say massive thanks to all my friends and family for enormous support in idea bouncing, spreading the word on social networks, proofreading and for all kind words I received in the process—thank you all, it really meant a lot.