Chapter 1. Welcome! Read This First

There are few things you need to know before we begin.

First, what you’re holding in your hand is not the entire report. This print version only includes the 25 open source tools from the full taxonomy. Think of it as the teaser. The rest (62 tools in all) are online at http://github.com/librato/taxonomy.

Second, this report was not intended to be read cover-to-cover. This is a reference work. What I’m trying to do here is categorize your problem and then provide you some descriptions of monitoring tools that might help you solve your problem. I’m not trying to fully document every monitoring tool, I’m just trying to get you pointed in the right direction. See the following section for a description of how we’re going to do that.

Third, I didn’t choose the tools that were included in this report. They are self-selected based on their use in the real world by other engineers like you. See “How Did You Choose the Tools?” if you’re curious about why tool X was included but not tool Y.

How Does This Report Work?

This report is predicated on the assumption that you hold in your mind a loose set of requirements (or at least desires) for a monitoring system. Maybe you need an open source tool that can inspect SFlow traffic and alert you to a bit-torrent happening on your network, or maybe you need a commercial byte-code injection tool that can measure latency between your worker threads and your Kafka queue.

Maybe you don’t know exactly what you’re requirements are, but you’d like to weigh the options either way, or maybe you’re aware of one popular tool and want to learn about other tools that do the same thing.

This report tries to help you out in any of those cases by categorizing your problem and then presenting you with a group of tools that might work for you. We use a hierarchy in which you begin by choosing how you want to manage the tool (hosted? on-premises?), how you want to pay for it, and so on, filtering out the tools that don’t apply to you as you proceed.

In this document, I categorize and describe 25 different open source monitoring tools along a path of four taxa. Then, I provide a summary of each tool in the form of a small questionnaire to give you a sense of how the tool does what it does, how well it gets along with other tools, and how well it might fit into your environment/culture/stack.

Let’s Begin

There are four top-level categories, and each have a varying number of classification, which we define in the sections that follow.

Operations Burden

Who do you want to run your monitoring tool? The classifications are as folows:

Traditional

You download and install tools in this category on your network.

Hosted

Another party runs the tools in this category for you, off site.

Appliances and Sensors

Tools in this category encompass vendor-provided hardware that is either managed by the vendor or you, and other hybrid models that involve hardware.

Pay Model

How do you want to pay for your monitoring tool? Here are the classifications:

Free/open

Open source tools cost nothing to obtain and come with their source code.

Commercial

Commercially licensed software often costs money to obtain (legally) and is usually distributed in binary form. This category includes demo software that’s free to use for a limited time as well as tools whose free-to-use tiers are too limited on which to run a small startup.

Freemium

Freemium software is free to use but comes with a paid premium component or usage tier. You can find commercial tools that have a usable free tier. By usable we mean the free tier provides the base-line operability a resonable person would consider using for running a small startup.

Free/Closed

This is closed source software that doesn’t cost money to obtain. You can find shareware and nagware tools here.

Subscription hardware

Appliances in this category typically provide the hardware for free but charge a monthly or annual subscription fee for use.

Activity Model

Do you need a tool to collect measurements or process measurements, or both? Here are the classifications to look for:

Collectors

These are tools designed primarily to measure things and make observations. This includes monitoring agents, instrumentation libraries, and sensors.

Processors

These are tools designed primarily to accept and/or process data from other tools. This includes data visualization tools and time series databases as well as stream processing systems and glue projects.

Monoliths

This category of tools is designed to be all-in-one solutions. It’s probably possible to import/export data from these tools, but they were designed to consume the data that they collect themselves. Most traditional operations-oriented “monitoring” tools (e.g., openview, patrol, and Nagios) fit into this category.

Focus

What do you need your tool to actually monitor? Here are the classifications:

System availability

These tools seek to answer the question “Is it up?” You will find any tool that was primarily designed to check for system or service availability at a one-minute or greater resolution. Most classic operations-centric monitoring tools like Nagios fall into this category.

App/database performance

Application performance management (APM) tools insert themselves into popular databases and language interpretors for the purpose of analyzing their performance. This is usually done by patching the interpretor or other binary with instrumentation code. These tools can give very detailed performance data at a highly granular resolution on databases like MySQL or even on custom apps by, for example, instrumenting the java virtual machine (JVM). Examples include New Relic, Appneta, and Vivid Cortex.

Networks

This is a broad class of tools designed to monitor and analyze network availability, performance, and content. Packet taps, SNMP collectors, Netflow, and SFlow related tools can be found herein.

Data Processing

Tools in this category were designed to collect or accept ad hoc metrics and log data with the intention to do something useful with it such as visualizing it (drawing graphs), parsing it, transforming it, alerting on it, and possibly forwarding it to other tools. Examples include Splunk, Heka, Graphite, Riemann, and Librato.

Storage Analysis

These tools specialize in analyzing storage area networks (SAN) and other distributed storage technologies.

Platform-Specific

Tools in this category were created specifically to monitor a particular platform, or a class of platforms. AWS Cloudwatch, Heroku metrics, and Rackspace monitoring all fall into this category

Ok, I Think I Know What I’m Looking For, Now What?

Each chapter in this report (apart from this one) represents one taxonomic path. So, if you were looking for a traditional → free/open → monolithic → system_availability–type tool—which happens to be the most popular taxonomic path (Nagios and its myriad clones [12 tools in all] categorize there)—you would thumb to Chapter 5.

How Did You Choose the Tools?

The initial list of tools I’m documenting were taken from James Turnbull’s “2015 Monitoring Survey” results, which you can see for yourself at https://kartar.net/2015/08/monitoring-survey-2015---tools. I’ll be expanding on that, and I’m also happy to take pull requests for new tools to include or corrections to anything I’ve misrepresented.

Again, in this short document you’ll find only open source tools that have a minimum of two respondents in the Katar survey. You can find the rest of the tools (commercial, open source, and points in between—62 in all) at the gitHub repo for this project at http://github.com/librato/taxonomy.

Why Did You Write This?

I wrote this because monitoring is a mess.

What does it even mean!? Monitoring…I mean, what do you want to know? Do you want to know that the servers are up? Do you want to know the maximum of the p95 request latency on your frontend API? Do you want to know that someone left the back door open in your Albuquerque datacenter? These are very different problems. Each requires a different set of tools and expertise, and yet there’s no way to categorically research any of these tools independently of the others. They all somehow belong to the same category—monitoring. I find this confounding in the extreme—the magnitude of our hand-wavey imprecision in describing this discipline. Especially in the context of our wholesale obsession with definition. Every other engineering discipline in the world is endlessly subcategorizing itself into oblivion with increasingly incomprehensible (but extremely precise) jargon. But nope: all of these tools are just “monitoring tools” (he says, waving his hand in the general direction of a massive assortment of completely different tools).

When science has a vast and incomprehensible landscape to make sense of, it creates a taxonomy, and that’s what I’d like to begin here. I want to direct the attention of overwhelmed engineers toward the subset of “monitoring” tools that are likely to work for their problem domain in their corporate culture. Like a map; something to augment your already well-developed sense of direction and help you to avoid stepping in anything too smelly.

Get Monitoring Taxonomy now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.