Chapter 1. Monitoring Anti-Patterns
Before we can start off on our journey to great monitoring, we have to identify and correct some bad habits you may have adopted or observed in your environment.
As with many habits, they start off well-meaning. After years of inadequate tools, the realities of keeping legacy applications running, and a general lack of knowledge about modern practices, these bad habits become âthe way itâs always been doneâ and are often taken with people when they leave one job for another. On the surface, they donât look that harmful. But rest assuredâthey are ultimately detrimental to a solid monitoring platform. For this reason, weâll refer to them as anti-patterns.
An anti-pattern is something that looks like a good idea, but which backfires badly when applied.
Jim Coplien
These anti-patterns can often be difficult to fix for various reasons: entrenched practices and culture, legacy infrastructure, or just plain old FUD (fear, uncertainty, and doubt). Weâll work through all of those, too, of course.
Anti-Pattern #1: Tool Obsession
Thereâs a great quote from Richard Bejtlich in his book The Practice of Network Security Monitoring (No Starch Press, 2013) that underscores the problem with an excessive focus on tools over capabilities:
Too many security organizations put tools before operations. They think âwe need to buy a log management systemâ or âI will assign one analyst to antivirus duty, one to data leakage protection duty.â And so on. A tool-driven team will not be effective as a mission-driven team. When the mission is defined by running software, analysts become captive to the features and limitations of their tools. Analysts who think in terms of what they need in order to accomplish their mission will seek tools to meet those needs, and keep looking if their requirements arenât met. Sometimes they even decide to build their own tools.
Richard Bejtlich
Many monitoring efforts start out the same way. âWe need better monitoring!â someone says. Someone else blames the current monitoring toolset for the troubles theyâre experiencing and suggests evaluating new ones. Fast-forward six months and the cycle repeats itself.
If you learn nothing else from this book, remember this: there are no silver bullets.
Anything worth solving takes a bit of effort, and monitoring a complex system is certainly no exception. Relatedly, there is no such thing as the single-pane-of-glass tool that will suddenly provide you with perfect visibility into your network, servers, and applications, all with little to no tuning or investment of staff. Many monitoring software vendors sell this idea, but itâs a myth.
Monitoring isnât just a single, cut-and-dry problemâitâs actually a huge problem set. Even limiting the topic to server monitoring, weâre still talking about handling metrics and logs for server hardware (everything from the out-of-band controller to the RAID controller to the disks), the operating system, all of the various services running, and the complex interactions between all of them. If you are running a large infrastructure (like I suspect many of you are), then paying attention only to your servers wonât get you very far: youâll need to monitor the network infrastructure and the applications too.
Hoping to find a single tool that will do all of that for you is simply delusional.
So what can you do about it?
Monitoring Is Multiple Complex Problems Under One Name
Weâve already established that monitoring isnât a single problem, so it stands to reason that it canât be solved with a single tool either. Just like a professional mechanic has an entire box of tools, some general and some specialized, so should you:
-
If youâre trying to profile and monitor your applications at the code level, you might look at APM tools, or instrumenting the application yourself (e.g., StatsD).
-
If you need to monitor performance of a cloud infrastructure, you might look at modern server monitoring solutions.
-
If you need to monitor for spanning tree topology changes or routing updates, you might look at tools with a network focus.
In any mature environment, youâll fill your toolbox with a set of general and specialized tools.
Iâve found itâs common for people to be afraid of tool creep. That is, they are wary of bringing more tools into their environment for fear of increasing complexity. This is a good thing to be wary of, though I think itâs less of a problem than most people imagine.
My advice is to choose tools wisely and consciously, but donât be afraid of adding new tools simply because itâs yet another tool. Itâs a good thing that your network engineers are using tools specialized for their purpose. Itâs a good thing that your software engineers are using APM tools to dive deep into their code.
In essence, itâs desirable that your teams are using tools that solve their problems, instead of being forced into tools that are a poor fit for their needs in the name of âconsolidating tools.â If everyone is forced to use the same tools, itâs unlikely that youâre going to have a great outcome, simply due to a poor fit. On the other hand, where you should be rightfully worried is when you have many tools that have an inability to work together. If your systems team canât correlate latency on the network with poor application responsiveness, you should reevaluate your solutions.
What if you want to set some company standards on tools to prevent runaway adoption? In essence, you might have dozens of tools all doing the same thing. In such a case, youâre missing out on the benefits that come with standardization: institutional expertise, easier implementation of monitoring, and lower expenses. How would you go about determining if youâre in that situation?
Itâs easy. Well, sort of: you have to start talking to people, and a lot of them. I find it helpful to start an informal conversation with managers of teams and find out what monitoring tools are being used and for what purpose. Make it clear right away that youâre not setting out to change how they workâyouâre gathering information so you can help make their jobs easier later. Forcing change on people is a great way to derail any consolidation effort like this, so keep it light and informal for now. If youâre unable to get clear answers, check with accounting: purchase orders and credit card purchases for the past year will reveal both monthly SaaS subscriptions and annual licensing/SaaS subscriptions. Make sure to confirm what you find is actually in use thoughâyou may just find tools that are no longer in use and havenât been cancelled yet.
Avoid Cargo-Culting Tools
There is a story recounted in Richard Feynmanâs book Surely Youâre Joking, Mr. Feynman! about what Feynman dubbed cargo cult science:
In the South Seas there is a cargo cult of people. During the war they saw airplanes land with lots of good materials, and they want the same thing to happen now. So theyâve arranged to imitate things like runways, to put fires along the sides of the runways, to make a wooden hut for a man to sit in, with two wooden pieces on his head like headphones and bars of bamboo sticking out like antennasâheâs the controllerâand they wait for the airplanes to land. Theyâre doing everything right. The form is perfect. It looks exactly the way it looked before. But it doesnât work. No airplanes land. So I call these things cargo cult science, because they follow all the apparent precepts and forms of scientific investigation, but theyâre missing something essential, because the planes donât land.
Over the years, this observation from the science community has become applied to software engineering and system administration: adopting tools and procedures of more successful teams and companies in the misguided notion that the tools and procedures are what made those teams successful, so they will also make your own team successful in the same ways. Sadly, the cause and effect are backward: the success that team experienced led them to create the tools and procedures, not the other way around.
Itâs become commonplace for companies to publicly release tools and procedures they use for monitoring their infrastructure and applications. Many of these tools are quite slick and have influenced the development of other monitoring solutions widely used today (for example, Prometheus drew much inspiration from Googleâs internal monitoring system, Borgmon).
Hereâs the rub: what you donât see are the many years of effort that went into understanding why a tool or procedure works. Blindly adopting these wonât necessarily lead to the same success the authors of those tools and procedures experienced. Tools are a manifestation of ways of working, of assumptions, of cultural and social norms. Those norms are unlikely to map directly to the norms of your own team.
I donât mean to discourage you from adopting tools published by other teamsâby all means, do so! Some of them are truly amazing and will change the way you and your colleagues work for the better. Rather, donât adopt them simply because a well-known company uses them. It is important to evaluate and prototype solutions rather than choosing them because someone else uses them or because a team member used them in the past. Make sure the assumptions the tools make are assumptions you and your team are comfortable with and will work well within. Life is too short to suffer with crummy tools (or even great tools that donât fit your workflow), so be sure to really put them through their paces before integrating them into your environment. Choose your tools with care.
Sometimes, You Really Do Have to Build It
When I was growing up, I loved to go through my grandfatherâs toolbox. It had every tool imaginable, plus some that baffled me as to their use. One day, while helping my grandfather fix something, he suddenly stopped, looking a little perplexed, and began rummaging through the toolbox. Unsatisfied, he grabbed a wrench, a hammer, and a vice. A few minutes later he had created a new tool, built for his specific need. What was once a general-purpose wrench became a specialized tool for solving a problem he never had before. Sure, he could have spent many more hours solving the problem with the tools he had, but creating a new tool allowed him to solve a particular problem in a highly effective manner, in a fraction of the time he might have spent otherwise.
Creating your own specialized tool does have its advantages. For example, one of the first tools many teams build is something to allow the creation of AWS EC2 instances quickly and with all the standards of their company automatically applied. Another example, this one monitoring-related, is a tool I once created: while working with SNMP (which weâll be going into in Chapter 9), I needed a way to comb through a large amount of data and pull out specific pieces of information. No other tool on the market did what I needed, so with a bit of Python, I created a new tool suited for my purpose.
Note that Iâm not suggesting you build a completely new monitoring platform. Most companies are not at the point where the ground-up creation of a new platform is a wise idea. Rather, Iâm speaking to small, specialized tools.
The Single Pane of Glass Is a Myth
Every Network Operations Center (NOC) Iâve been in has had gargantuan monitors covering the wall, filled with graphs, tables, and other information. I once worked in a NOC (pronounced like âknockâ) myself that had six 42â monitors spanning the wall, with constant updates on the state of the servers, network infrastructure, and security stance. Itâs great eye candy for visitors.
However, Iâve noticed there can often be a misconception around what the single pane of glass approach to monitoring means. This approach to monitoring manifests as the desire to have one single place to go to look at the state of things. Note that I didnât say one tool or one dashboardâthis is crucial to understanding the misconception.
There does not need to be a one-to-one mapping of tools to dashboards. You might use one tool to output multiple dashboards or you might even have multiple tools feeding into one dashboard. More likely, youâre going to have multiple tools feeding multiple dashboards. Given that monitoring is a complex series of problems, attempting to shoehorn everything into one tool or dashboard system is just going to hamper your ability to work effectively.
Anti-Pattern #2: Monitoring-as-a-Job
As companies grow, itâs common for them to adopt specialized roles for team members. I once worked for a large enterprise organization that had specialized roles for everyone: there was the person who specialized in log collection, there was the person who specialized in managing Solaris servers, and another person whose job it was to create and maintain monitoring for all of it. Three guesses which one was me.
At first glance, it makes sense: create specialized roles so people can focus on doing that function perfectly, instead of being a generalist and doing a mediocre job on everything. However, when it comes to monitoring, thereâs a problem: how can you build monitoring for a thing you donât understand?
Thus, the anti-pattern: monitoring is not a jobâitâs a skill, and itâs a skill everyone on your team should have to some degree. You wouldnât expect only one member of your team to be the sole person familiar with your config management tool, or how to manage your database servers, so why would you expect that when it comes to monitoring? Monitoring canât be built in a vacuum, as itâs a crucial component to the performance of your services.
As you move along your monitoring journey, insist that everyone be responsible for monitoring. One of the core tenets of the DevOps movement is that weâre all responsible for production, not just the operations team. Network engineers know best what should be monitored in the network and where the hot spots are. Your software engineers know the applications better than anyone else, putting them in the perfect position to design great monitoring for the applications.
Strive to make monitoring a first-class citizen when it comes to building and managing services. Remember, itâs not ready for production until itâs monitored. The end result will be far more robust monitoring with great signal-to-noise ratio, and likely far better signal than youâve ever had before.
There is a distinction that must be made here, of course: the job of building self-service monitoring tools as a service you provide to another team (commonly called an observability team) is a valid and common approach. In these situations, there is a team whose job is to create and cultivate the monitoring tools that the rest of the company relies on. However, this team is not responsible for instrumenting the applications, creating alerts, etc. The anti-pattern I want to caution you against isnât having a person or team responsible for building and providing self-service monitoring tools, but rather, itâs having your company shirk the responsibility of monitoring at all by resting it solely on the shoulders of a single person.
Anti-Pattern #3: Checkbox Monitoring
When people tell me that their monitoring sucks, I find that this anti-pattern is usually at the center of it all.
Checkbox monitoring is when you have monitoring systems for the sole sake of saying you have them. Perhaps someone higher up in the organization made it a requirement, or perhaps you suddenly had specific compliance regulations to meet, necessitating a quick monitoring deployment. Regardless of how you got here, the result is the same: your monitoring is ineffective, noisy, untrustworthy, and probably worse than having no monitoring at all.
How do you know if youâve fallen victim to this anti-pattern? Here are some common signs:
-
You are recording metrics like system load, CPU usage, and memory utilization, but the service still goes down without your knowing why.
-
You find yourself consistently ignoring alerts, as they are false alarms more often than not.
-
You are checking systems for metrics every five minutes or even less often.
-
You arenât storing historical metric data (Iâm looking at you, Nagios).
This anti-pattern is commonly found with the previous anti-pattern (monitoring-as-a-job). Since the person(s) setting up monitoring doesnât completely understand how the system works, they often set up the simplest and easiest things and check it off the to-do list.
There are a few things you can do to fix this anti-pattern.
What Does âWorkingâ Actually Mean? Monitor That.
To fix this problem, you first need to understand what it is youâre monitoring. What does âworkingâ mean in this context? Talking to the service/app owner is a great place to start.
Are there high-level checks you can perform to verify itâs working? For example, if weâre talking about a webapp, the first check I would set up is an HTTP GET /
. I would record the HTTP response code, expect an HTTP 200 OK
response, specific text to be on the page, and the request latency. This one check has given me a wealth of information about whether the webapp is actually working. When things go south, latency might increase while I continue to receive an HTTP 200
response, which tells me there might be a problem. In another scenario, I might get back the HTTP 200
, but the text that should be on the page isnât found, which tells me there might be a problem.
Every service and product your company has will have these sorts of high-level checks. They donât necessarily tell you whatâs wrong, but theyâre great leading indicators that something could be wrong. Over time, as you understand your service/app more, you can add more specific checks and alerts.
OS Metrics Arenât Very Usefulâfor Alerting
Early in my career as a systems administrator, I went to my lead engineer and told him that the CPU usage on a particular server was quite high, and asked what we should do about it. His response was illuminating for me: âIs the server still doing what itâs supposed to?â It was, I told him. âThen thereâs not really a problem, is there?â
Some services we run are resource-intensive by nature and thatâs OK. If MySQL is using all of the CPU consistently, but response times are acceptable, then you donât really have a problem. Thatâs why itâs far more beneficial to alert on what âworkingâ means as opposed to low-level metrics such as CPU and memory usage.
That isnât to say these metrics arenât useful, of course. OS metrics are critical for diagnostics and performance analysis, as they allow you to spot blips and trends in underlying system behavior that might be impacting performance. 99% of the time, they arenât worth waking someone up over. Unless you have a specific reason to alert on OS metrics, stop doing it.
Collect Your Metrics More Often
In a complex system (like the one you are running), a lot can happen in a few minutes, or even a few seconds. Letâs consider an example: imagine latency between two services spikes every 30 seconds, for whatever reason. At a five-minute metric resolution, you would miss the event. Only collecting your metrics every five minutes means youâre effectively blind. Opt for collecting metrics at least every 60 seconds. If you have a high-traffic system, opt for more often, such as every 30 seconds or even every 10 seconds.
Some people have argued that collecting metrics more often places too much load on the system, which I call baloney. Modern servers and network gear have very high performance and can easily handle the minuscule load more monitoring will place on them.
Of course, keeping high-granularity metrics around on disk for a long period of time can get expensive. You probably donât need to store a year of CPU metric data at 10-second granularity. Make sure you configure a roll-up period that makes sense for your metrics.1
The one caveat with this is that many older network devices often have very low performance available to the management cards, causing them to fall over when hit with too many requests for monitoring data (Iâm looking at you, Cisco). Be sure to test them in a lab before increasing the polling interval for these.
Anti-Pattern #4: Using Monitoring as a Crutch
I once worked with a team that ran a legacy PHP app. This app had a large amount of poorly written and poorly understood code. As things tended to break, the teamâs usual response was to add more monitoring around whatever it was that broke. Unfortunately, while this response seems at first glance to be the correct response, it does little to solve the real problem: a poorly built app.
Avoid the tendency to lean on monitoring as a crutch. Monitoring is great for alerting you to problems, but donât forget the next step: fixing the problems. If you find yourself with a finicky service and youâre constantly adding more monitoring to it, stop and invest your effort into making the service more stable and resilient instead. More monitoring doesnât fix a broken system, and itâs not an improvement in your situation.
Anti-Pattern #5: Manual Configuration
Iâm sure we all can agree that automation is awesome. Thatâs why itâs surprising to me how often monitoring configuration is manual. The question I never want to hear is âCan you add this to monitoring?â
Your monitoring should be 100% automated. Services should self-register instead of someone having to add them. Whether youâre using a tool such as Sensu that allows for instant self-registration and deregistration of nodes, or using Nagios coupled with config management, monitoring ought to be automatic.
The difficulty in building a well-monitored infrastructure and app without automation cannot be overstated. Iâm often called on to consult on monitoring implementations, and in most cases, the team spends more time on configuration than on monitoring. If you cannot quickly configure new checks or nodes, building better monitoring becomes frustrating. After a while, youâll just stop bothering. On the other hand, if it takes only a few minutes to add new checks for every web server in your fleet, you wonât be so hesitant to do more of it.
Wrap-Up
We learned about five common anti-patterns in monitoring in this chapter:
-
Tool obsession doesnât give you better monitoring.
-
Monitoring is everyoneâs job, not a single role on the team or a department.
-
Great monitoring is more than checking the box marked âYep, we have monitoring.â
-
Monitoring doesnât fix broken things.
-
Lack of automation is a great way to ensure youâve missed something important.
Now that you know the monitoring anti-patterns to watch out for and how to fix them, you can build positive monitoring habits. If you were to do nothing but fix these five problems in your environment, youâd be in good shape. Of course, who wants to settle for good when they can be great? And for that, weâll need to talk about the inverse of the anti-pattern: the design pattern.
1 Consult the documentation for your metrics tool on roll-up configuration and best practices.
Get Practical Monitoring now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.