Chapter 8. Service Monitoring

Running a website without service monitoring is an exercise in flying blind. Nobody cares about monitoring when everything is going well, but as soon as something goes wrong, the additional information and warnings provided can be instrumental in quickly and correctly diagnosing the problem.

There are different types of service monitoring. Tools like Icinga and Nagios are designed to watch hosts and services, sending alerts when a service check falls outside an acceptable range. Other tools, such as Cacti and Munin, provide a graphical look at server and service information in order to give historical context to performance and usage statistics. Still other applications, such as Zabbix, aim to combine these two types of monitoring. Throughout this chapter, we’ll take a look at the various types of monitoring systems and give examples of how they can be used to ensure your site is stable and performing optimally.

The Importance of Monitoring Services

Imagine a situation where you are using an alternate cache backend for Drupal, such as Memcache—we’ll provide details on how to implement this in Chapter 16. Your cache items are now only accessible if the Memcached service is running. Imagine if the service stopped responding, unexpectedly causing your entire Drupal cache to disappear. This would force a great deal more load on your database server. Without service monitoring in place, it may take some time to figure out what the problem is; all you know for sure is that the site feels slower than normal. You might not realize what’s wrong until you start going through recent Drupal log entries.

If you had a monitoring server that was configured to run periodic tests to ensure the Memcached service was responding as expected, you would receive an email or text message alerting you as soon as the service failed to respond. You would know exactly what was happening before you even logged into the server. Being able to respond to issues immediately, or even be warned as services start to deteriorate, is extremely important—especially if things can be fixed before users realize something is wrong.

Another important aspect of monitoring is collecting data over time. Having a strong grasp of the baseline usage of your servers and services makes it easy to see when something out of the ordinary is occurring. One example is tracking general things such as server load going up during times of increased traffic on the site. Going beyond that, monitoring could be used to track specific service information, like if and when your APC opcode cache (see Chapter 18) fills up. By setting up a thorough and reliable monitoring system, you will be able to stay ahead of problems with your site as well as providing yourself with an indispensable troubleshooting tool.

Monitoring Alerts with Icinga

Nagios is a very popular open source monitoring system that was initially released in 1999. It gained popularity in the following years, though there were complaints in the development community about bugs going unfixed and general lack of transparency within the core of Nagios. In 2009 this led to a core group of Nagios developers (who felt that their efforts to contribute to Nagios core were being ignored) to fork off a new project, which they named Icinga.

Since forking, the Icinga project has grown substantially in both developers and users; it includes many bug fixes and feature improvements that are not found in Nagios while remaining compatible with all external plug-ins. One of these improvements is a new web interface that was designed to be more modern and configurable than that provided by Nagios. If you don’t have much of a preference of which system to use, then our opinion is that the new web interface alone should be enough to encourage you to adopt Icinga over Nagios, unless you are paying for the Enterprise version of Nagios.

There are other open source and commercial monitoring options as well: these include OpenNMS, Sensu, Zabbix, and ZenOSS, to name just a few open source options. However, our past experiences have kept us coming back to Icinga when we need to set up a monitoring system. For that reason, we will be using Icinga-specific examples here, but we encourage you to review other options and pick the one that works best for you.

What to Monitor

Deciding what to monitor and with what failure conditions should be viewed as an iterative task, where you continually improve the monitoring configuration in response to false positives, or lack of alerts when expected. On the one hand, you want to monitor as many different aspects of your infrastructure as possible; but on the other hand, if too many alerts are being generated (or worse, false positives), they will be ignored and important alerts may go unseen. We recommend striving for the middle ground—monitoring as much as possible without becoming the monitoring system that cries wolf.

There are actually two things to consider when choosing what to monitor: first is which services and information to monitor, and second is how to set your thresholds. We’ll start with “what to monitor,” and then discuss how to select and refine threshold values.

It’s obvious that you would want to monitor all of your core servers and services. This generally means starting with simple checks such as ping or ssh checks against servers, and overall health checks against services—for example, checking that your website returns a 200 OK status code, or that the MySQL server accepts connections. Beyond those simple checks, there is a virtually unlimited number of things that you can monitor within each server or service. For example, server monitoring might include:

  • RAM and swap usage
  • Disk usage
  • CPU usage
  • Network connection count

And service monitoring might include things like web server response time to serve a request or a whole wealth of MySQL information, such as:

  • MySQL thread and connection counts
  • MySQL replication status
  • MySQL query activity
  • InnoDB buffer usage

It’s advisable to start with at least a set of “simple” checks for the various server resources (RAM, CPU, disk, network). Service-specific checks are more subjective and depend on what is important in your environment. You’ll definitely want some simple up/down checks for services such as web and MySQL services (plus any others running in your environment). Beyond that, review some of the common checks for MySQL to see what you feel is important. For example, if you know your InnoDB buffer pool is relatively full, it would be prudent to monitor its usage in order to have a warning before it completely fills up.

How to Tune Monitoring

Most prepackaged and third-party checks for Icinga will come with suggested threshold values when needed for warnings and errors. If you are creating your own checks, you will need to set those thresholds yourself. In that case, it’s important to set them low enough that you are sure they will trigger before or during a problem. A monitoring check does no good if it is configured to such a high value that the site can become unusably slow or go offline completely without the alert actually catching anything. For new sites and infrastructures, this can be a bit of a guessing game until you have established some baseline data. Remember, you can always increase the alert thresholds if you are receiving too many false positives.

All plug-ins include options for setting warning and critical thresholds. When defining a check in Icinga, these command-line flags are passed either in the service or the command configuration file as part of the check_command option. Icinga will automatically configure the hostname for the check command based on the host_name value set in the service configuration file.

A service definition for checking the HTTP response time of your website might look something like the following to warn if the page takes longer than three seconds to respond, or send a critical alert if the HTTP response time is longer than five seconds:

 define service{
        host_name               www1
        service_description     WEB_RESPONSE_TIME
        check_command           check_http!3!5

Graphing Monitoring Data

In addition to setting up an active monitoring system to send alerts, it’s also very useful to be able to view historic data for your servers and services. Building on the preceding example, it’s great to receive an alert when the website begins to load slowly, but in order to troubleshoot what might be causing the slowdown, it would be ideal to be able to view information like the number of Apache processes over the last hour, how loaded the database has been over the last day, etc. By implementing a monitoring tool to track resources over time, you can have this and other important graphs at your fingertips, whether for urgently debugging a problem or just for a periodic review of how services are performing.

Two systems that we use the most for this capability are Cacti and Munin. Both are very capable applications, and the choice between them (or one of the many other options) can often boil down to personal preference. In the case of Munin and Cacti, both use RRDTool to graph their data and make it available from a web interface, and both have a plug-in system for monitoring various applications and server resources (generally, people find Munin plug-ins easier to implement). Figure 8-1 shows a sample Munin load graph.

Munin / RRDTool graph showing system load

Figure 8-1. Munin load graph

There are many plug-ins available for Munin, Cacti, and other monitoring systems. We generally try to graph as much data as possible, because even if something isn’t a problem today, it might end up being important at some point in the future. When considering what to monitor, we recommend looking through the default and popular plug-ins for whatever system you select. However, here is a list of things that we generally monitor:

System data
This includes disk I/O, disk usage, network traffic and errors for all network interfaces, network connections, email activity/queues, CPU load, memory usage, total swap usage, and swap activity.
Web server data
This can include the number of Apache processes; APC memory usage and evictions; Varnish requests, hit rates, and evictions; Memcache (or other external cache) memory usage, hit rates, and evictions.
Database data
MySQL has a ton of data, most of which is worth tracking: this includes slave lag; command types and counts; connections; InnoDB information such as buffer pool size, activity, and I/O; query cache information such as memory usage and hits/inserts/prunes; slow query counts; table locks; and temp table types and counts.

Internal Versus Remote Monitoring

The location—internal or external to the rest of your server and network infrastructure—of your monitoring server(s) is very important. There is a case to be made for either location, and generally those that want to be very thorough in their monitoring will end up with both an internal and an external monitoring system in order to benefit from each. In fact, the services can and should be set up to complement each other instead of duplicating all monitoring in both locations.

The use of an external (hosted on a separate network, and generally geographically dispersed) monitoring system is important for the ability to monitor externally facing services. The reason for externally monitoring those services is quite easy to understand: you want to see the same thing your users are seeing. For example, testing a web page load over a local network won’t see any delays introduced on your outbound network (or worse, if that network goes down). If you are testing from a separate network, however, the monitoring system will see those faults and delays. For this reason, we generally recommend setting up an external monitoring server to do at least basic ping and web page load checks.

An internal monitoring system is much better suited to monitoring backend services (MySQL, Memcached, Solr, etc.) and server resources. Keeping that monitoring internal means you don’t need to worry about external network bandwidth or the security implications of allowing an external host to connect to your internal services. As you add more servers and service checks to your monitoring system, having it on a low-latency local connection can help improve monitoring performance.

Get High Performance Drupal now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.