Chapter 1. What Is Prometheus?
Prometheus does one thing and it does it well. It has a simple yet powerful data model and a query language that lets you analyse how your applications and infrastructure are performing. It does not try to solve problems outside of the metrics space, leaving those to other more appropriate tools.
Since its beginnings with no more than a handful of developers working in SoundCloud in 2012, a community and ecosystem has grown around Prometheus. Prometheus is primarily written in Go and licensed under the Apache 2.0 license. There are hundreds of people who have contributed to the project itself, which is not controlled by any one company. It is always hard to tell how many users an open source project has, but I estimate that as of 2018, tens of thousands of organisations are using Prometheus in production. In 2016 the Prometheus project became the second member1 of the Cloud Native Computing Foundation (CNCF).
For instrumenting your own code, there are client libraries in all the popular languages and runtimes, including Go, Java/JVM, C#/.Net, Python, Ruby, Node.js, Haskell, Erlang, and Rust. Software like Kubernetes and Docker are already instrumented with Prometheus client libraries. For third-party software that exposes metrics in a non-Prometheus format, there are hundreds of integrations available. These are called exporters, and include HAProxy, MySQL, PostgreSQL, Redis, JMX, SNMP, Consul, and Kafka. A friend of mine even added support for monitoring Minecraft servers, as he cares a lot about his frames per second.
A simple text format makes it easy to expose metrics to Prometheus. Other monitoring systems, both open source and commercial, have added support for this format. This allows all of these monitoring systems to focus more on core features, rather than each having to spend time duplicating effort to support every single piece of software a user like you may wish to monitor.
The data model identifies each time series not just with a name, but also with an unordered set of key-value pairs called labels. The PromQL query language allows aggregation across any of these labels, so you can analyse not just per process but also per datacenter and per service or by any other labels that you have defined. These can be graphed in dashboard systems such as Grafana.
Alerts can be defined using the exact same PromQL query language that you use for graphing. If you can graph it, you can alert on it. Labels make maintaining alerts easier, as you can create a single alert covering all possible label values. In some other monitoring systems you would have to individually create an alert per machine/application. Relatedly, service discovery can automatically determine what applications and machines should be scraped from sources such as Kubernetes, Consul, Amazon Elastic Compute Cloud (EC2), Azure, Google Compute Engine (GCE), and OpenStack.
For all these features and benefits, Prometheus is performant and simple to run. A single Prometheus server can ingest millions of samples per second. It is a single statically linked binary with a configuration file. All components of Prometheus can be run in containers, and they avoid doing anything fancy that would get in the way of configuration management tools. It is designed to be integrated into the infrastructure you already have and built on top of, not to be a management platform itself.
Now that you have an overview of what Prometheus is, let’s step back for a minute and look at what is meant by “monitoring” in order to provide some context. Following that I will look at what the main components of Prometheus are, and what Prometheus is not.
What Is Monitoring?
In secondary school one of my teachers told us that if you were to ask ten economists what economics means, you’d get eleven answers. Monitoring has a similar lack of consensus as to what exactly it means. When I tell others what I do, people think my job entails everything from keeping an eye on temperature in factories, to employee monitoring where I was the one to find out who was accessing Facebook during working hours, and even detecting intruders on networks.
Prometheus wasn’t built to do any of those things.2 It was built to aid software developers and administrators in the operation of production computer systems, such as the applications, tools, databases, and networks backing popular websites.
Knowing when things are going wrong is usually the most important thing that you want monitoring for. You want the monitoring system to call in a human to take a look.
Now that you have called in a human, they need to investigate to determine the root cause and ultimately resolve whatever the issue is.
Alerting and debugging usually happen on time scales on the order of minutes to hours. While less urgent, the ability to see how your systems are being used and changing over time is also useful. Trending can feed into design decisions and processes such as capacity planning.
When all you have is a hammer, everything starts to look like a nail. At the end of the day all monitoring systems are data processing pipelines. Sometimes it is more convenient to appropriate part of your monitoring system for another purpose, rather than building a bespoke solution. This is not strictly monitoring, but it is common in practice so I like to include it.
Depending on who you talk to and their background, they may consider only some of these to be monitoring. This leads to many discussions about monitoring going around in circles, leaving everyone frustrated. To help you understand where others are coming from, I’m going to look at a small bit of the history of monitoring.
A Brief and Incomplete History of Monitoring
When I say Nagios I am including any software within the same broad family, such as Icinga, Zmon, and Sensu. They work primarily by regularly executing scripts called checks. If a check fails by returning a nonzero exit code, an alert is generated. Nagios was initially started by Ethan Galstad in 1996, as an MS-DOS application used to perform pings. It was first released as NetSaint in 1999, and renamed Nagios in 2002.
To talk about the history of Graphite, I need to go back to 1994. Tobias Oetiker created a Perl script that became Multi Router Traffic Grapher, or MRTG 1.0, in 1995. As the name indicates, it was mainly used for network monitoring via the Simple Network Management Protocol (SNMP). It could also obtain metrics by executing scripts.3 The year 1997 brought big changes with a move of some code to C, and the creation of the Round Robin Database (RRD) which was used to store metric data. This brought notable performance improvements, and RRD was the basis for other tools including Smokeping and Graphite.
Started in 2006, Graphite uses Whisper for metrics storage, which has a similar design to RRD. Graphite does not collect data itself, rather it is sent in by collection tools such as collectd and Statsd, which were created in 2005 and 2010, respectively.
The key takeway here is that graphing and alerting were once completely separate concerns performed by different tools. You could write a check script to evaluate a query in Graphite and generate alerts on that basis, but most checks tended to be on unexpected states such as a process not running.
Another holdover from this era is the relatively manual approach to administering computer services. Services were deployed on individual machines and lovingly cared for by systems administrators. Alerts that might potentially indicate a problem were jumped upon by devoted engineers. As cloud and cloud native technologies such as EC2, Docker, and Kubernetes have come to prominence, treating individual machines and services like pets with each getting individual attention does not scale. Rather, they should be looked at more as cattle and administered and monitored as a group. In the same way that the industry has moved from doing management by hand, to tools like Chef and Ansible, to now starting to use technologies like Kubernetes, monitoring also needs to make a similar transition from checks on individual processes on individual machines to monitoring based on service health as a whole.
You may have noticed that I didn’t mention logging. Historically logs have been used as something that you use tail, grep, and awk on by hand. You might have had an analysis tool such as AWStats to produce reports once a hour or day. In more recent years they have also been used as a significant part of monitoring, such as with the Elasticsearch, Logstash, and Kibana (ELK) stack.
Now that we have looked a bit at graphing and alerting, let’s look at how metrics and logs fit into things. Are there more categories of monitoring than those two?
Categories of Monitoring
Receiving a HTTP request
Sending a HTTP 400 response
Entering a function
Leaving a function
A user logging in
Writing data to disk
Reading data from the network
Requesting more memory from the kernel
All events also have context. A HTTP request will have the IP address it is coming from and going to, the URL being requested, the cookies that are set, and the user who made the request. A HTTP response will have how long the response took, the HTTP status code, and the length of the response body. Events involving functions have the call stack of the functions above them, and whatever triggered this part of the stack such as a HTTP request.
Having all the context for all the events would be great for debugging and understanding how your systems are performing in both technical and business terms, but that amount of data is not practical to process and store. Thus there are what I would see as roughly four ways to approach reducing that volume of data to something workable, namely profiling, tracing, logging, and metrics.
Tcpdump is one example of a profiling tool. It allows you to record network traffic based on a specified filter. It’s an essential debugging tool, but you can’t really turn it on all the time as you will run out of disk space.
Debug builds of binaries that track profiling data are another example. They provide a plethora of useful information, but the performance impact of gathering all that information, such as timings of every function call, means that it is not generally practical to run it in production on an ongoing basis.
In the Linux kernel, enhanced Berkeley Packet Filters (eBPF) allow detailed profiling of kernel events from filesystem operations to network oddities. These provide access to a level of insight that was not generally available previously, and I’d recommend reading Brendan Gregg’s writings on the subject.
Profiling is largely for tactical debugging. If it is being used on a longer term basis, then the data volume must be cut down in order to fit into one of the other categories of monitoring.
Tracing doesn’t look at all events, rather it takes some proportion of events such as one in a hundred that pass through some functions of interest. Tracing will note the functions in the stack trace of the points of interest, and often also how long each of these functions took to execute. From this you can get an idea of where your program is spending time and which code paths are most contributing to latency.
Rather than doing snapshots of stack traces at points of interest, some tracing systems trace and record timings of every function call below the function of interest. For example, one in a hundred user HTTP requests might be sampled, and for those requests you could see how much time was spent talking to backends such as databases and caches. This allows you to see how timings differ based on factors like cache hits versus cache misses.
Distributed tracing takes this a step further. It makes tracing work across processes by attaching unique IDs to requests that are passed from one process to another in remote procedure calls (RPCs) in addition to whether this request is one that should be traced. The traces from different processes and machines can be stitched back together based on the request ID. This is a vital tool for debugging distributed microservices architectures. Technologies in this space include OpenZipkin and Jaeger.
For tracing, it is the sampling that keeps the data volumes and instrumentation performance impact within reason.
Logging looks at a limited set of events and records some of the context for each of these events. For example, it may look at all incoming HTTP requests, or all outgoing database calls. To avoid consuming too much resources, as a rule of thumb you are limited to somewhere around a hundred fields per log entry. Beyond that, bandwidth and storage space tend to become a concern.
For example, for a server handling a thousand requests per second, a log entry with a hundred fields each taking ten bytes works out as a megabyte per second. That’s a nontrivial proportion of a 100 Mbit network card, and 84 GB of storage per day just for logging.
A big benefit of logging is that there is (usually) no sampling of events, so even though there is a limit on the number of fields, it is practical to determine how slow requests are affecting one particular user talking to one particular API endpoint.
Just as monitoring means different things to different people, logging also means different things depending on who you ask, which can cause confusion. Different types of logging have different uses, durability, and retention requirements. As I see it, there are four general and somewhat overlapping categories:
- Transaction logs
These are the critical business records that you must keep safe at all costs, likely forever. Anything touching on money or that is used for critical user-facing features tends to be in this category.
- Request logs
If you are tracking every HTTP request, or every database call, that’s a request log. They may be processed in order to implement user facing features, or just for internal optimisations. You don’t generally want to lose them, but it’s not the end of the world if some of them go missing.
- Application logs
Not all logs are about requests; some are about the process itself. Startup messages, background maintenance tasks, and other process-level log lines are typical. These logs are often read directly by a human, so you should try to avoid having more than a few per minute in normal operations.
- Debug logs
Debug logs tend to be very detailed and thus expensive to create and store. They are often only used in very narrow debugging situations, and are tending towards profiling due to their data volume. Reliability and retention requirements tend to be low, and debug logs may not even leave the machine they are generated on.
Treating the differing types of logs all in the same way can end you up in the worst of all worlds, where you have the data volume of debug logs combined with the extreme reliability requirements of transaction logs. Thus as your system grows you should plan on splitting out the debug logs so that they can be handled separately.
Examples of logging systems include the ELK stack and Graylog.
Metrics largely ignore context, instead tracking aggregations over time of different types of events. To keep resource usage sane, the amount of different numbers being tracked needs to be limited: ten thousand per process is a reasonable upper bound for you to keep in mind.
Examples of the sort of metrics you might have would be the number of times you received HTTP requests, how much time was spent handling requests, and how many requests are currently in progress. By excluding any information about context, the data volumes and processing required are kept reasonable.
That is not to say, though, that context is always ignored. For a HTTP request you could decide to have a metric for each URL path. But the ten thousand metric guideline has to be kept in mind, as each distinct path now counts as a metric. Using context such as a user’s email address would be unwise, as they have an unbounded cardinality.4
You can use metrics to track the latency and data volumes handled by each of the subsystems in your applications, making it easier to determine what exactly is causing a slowdown. Logs could not record that many fields, but once you know which subsystem is to blame, logs can help you figure out which exact user requests are involved.
This is where the tradeoff between logs and metrics becomes most apparent. Metrics allow you to collect information about events from all over your process, but with generally no more than one or two fields of context with bounded cardinality. Logs allow you to collect information about all of one type of event, but can only track a hundred fields of context with unbounded cardinality. This notion of cardinality and the limits it places on metrics is important to understand, and I will come back to it in later chapters.
As a metrics-based monitoring system, Prometheus is designed to track overall system health, behaviour, and performance rather than individual events. Put another way, Prometheus cares that there were 15 requests in the last minute that took 4 seconds to handle, resulted in 40 database calls, 17 cache hits, and 2 purchases by customers. The cost and code paths of the individual calls would be the concern of profiling or logging.
Now that you have an understanding of where Prometheus fits in the overall monitoring space, let’s look at the various components of Prometheus.
Figure 1-1 shows the overall architecture of Prometheus. Prometheus discovers targets to scrape from service discovery. These can be your own instrumented applications or third-party applications you can scrape via an exporter. The scraped data is stored, and you can use it in dashboards using PromQL or send alerts to the Alertmanager, which will convert them into pages, emails, and other notifications.
Metrics do not typically magically spring forth from applications; someone has to add the instrumentation that produces them. This is where client libraries come in. With usually only two or three lines of code, you can both define a metric and add your desired instrumentation inline in code you control. This is referred to as direct instrumentation.
Client libraries are available for all the major languages and runtimes. The Prometheus project provides official client libraries in Go, Python, Java/JVM, and Ruby. There are also a variety of third-party client libraries, such as for C#/.Net, Node.js, Haskell, Erlang, and Rust.
Client libraries take care of all the nitty-gritty details such as thread-safety, bookkeeping, and producing the Prometheus text exposition format in response to HTTP requests. As metrics-based monitoring does not track individual events, client library memory usage does not increase the more events you have. Rather, memory is related to the number of metrics you have.
If one of the library dependencies of your application has Prometheus instrumentation, it will automatically be picked up. Thus by instrumenting a key library such as your RPC client, you can get instrumentation for it in all of your applications.
Some metrics are typically provided out of the box by client libraries such as CPU usage and garbage collection statistics, depending on the library and runtime environment.
Client libraries are not restricted to outputting metrics in the Prometheus text format. Prometheus is an open ecosystem, and the same APIs used to feed the generation text format can be used to produce metrics in other formats or to feed into other instrumentation systems. Similarly, it is possible to take metrics from other instrumentation systems and plumb it into a Prometheus client library, if you haven’t quite converted everything to Prometheus instrumentation yet.
Not all code you run is code that you can control or even have access to, and thus adding direct instrumentation isn’t really an option. For example, it is unlikely that operating system kernels will start outputting Prometheus-formatted metrics over HTTP anytime soon.
Such software often has some interface through which you can access metrics. This might be an ad hoc format requiring custom parsing and handling, such as is required for many Linux metrics, or a well-established standard such as SNMP.
An exporter is a piece of software that you deploy right beside the application you want to obtain metrics from. It takes in requests from Prometheus, gathers the required data from the application, transforms them into the correct format, and finally returns them in a response to Prometheus. You can think of an exporter as a small one-to-one proxy, converting data between the metrics interface of an application and the Prometheus exposition format.
Unlike the direct instrumentation you would use for code you control, exporters use a different style of instrumentation known as custom collectors or ConstMetrics.5
The good news is that given the size of the Prometheus community, the exporter you need probably already exists and can be used with little effort on your part. If the exporter is missing a metric you are interested in, you can always send a pull request to improve it, making it better for the next person to use it.
Once you have all your applications instrumented and your exporters running, Prometheus needs to know where they are. This is so Prometheus will know what is meant to monitor, and be able to notice if something it is meant to be monitoring is not responding. With dynamic environments you cannot simply provide a list of applications and exporters once, as it will get out of date. This is where service discovery comes in.
You probably already have some database of your machines, applications, and what they do. It might be inside Chef’s database, an inventory file for Ansible, based on tags on your EC2 instance, in labels and annotations in Kubernetes, or maybe just sitting in your documentation wiki.
Prometheus has integrations with many common service discovery mechanisms, such as Kubernetes, EC2, and Consul. There is also a generic integration for those whose setup is a little off the beaten path (see “File”).
This still leaves a problem though. Just because Prometheus has a list of machines
and services doesn’t mean we know how they fit into your architecture. For example, you might be using the EC2
Name tag6 to indicate what application
runs on a machine, whereas others might use a tag called
As every organisation does it slightly differently, Prometheus allows you to configure how metadata from service discovery is mapped to monitoring targets and their labels using relabelling.
Service discovery and relabelling give us a list of targets to be monitored. Now Prometheus needs to fetch the metrics. Prometheus does this by sending a HTTP request called a scrape. The response to the scrape is parsed and ingested into storage. Several useful metrics are also added in, such as if the scrape succeeded and how long it took. Scrapes happen regularly; usually you would configure it to happen every 10 to 60 seconds for each target.
Prometheus stores data locally in a custom database. Distributed systems are challenging to make reliable, so Prometheus does not attempt to do any form of clustering. In addition to reliability, this makes Prometheus easier to run.
Over the years, storage has gone through a number of redesigns, with the storage system in Prometheus 2.0 being the third iteration. The storage system can handle ingesting millions of samples per second, making it possible to monitor thousands of machines with a single Prometheus server. The compression algorithm used can achieve 1.3 bytes per sample on real-world data. An SSD is recommended, but not strictly required.
Prometheus has a number of HTTP APIs that allow you to both request raw data and evaluate PromQL queries. These can be used to produce graphs and dashboards. Out of the box, Prometheus provides the expression browser. It uses these APIs and is suitable for ad hoc querying and data exploration, but it is not a general dashboard system.
It is recommended that you use Grafana for dashboards. It has a wide variety of features, including official support for Prometheus as a data source. It can produce a wide variety of dashboards, such as the one in Figure 1-2. Grafana supports talking to multiple Prometheus servers, even within a single dashboard panel.
Recording Rules and Alerts
Although PromQL and the storage engine are powerful and efficient, aggregating metrics from thousands of machines on the fly every time you render a graph can get a little laggy. Recording rules allow PromQL expressions to be evaluated on a regular basis and their results ingested into the storage engine.
Alerting rules are another form of recording rules. They also evaluate PromQL expressions regularly, and any results from those expressions become alerts. Alerts are sent to the Alertmanager.
The Alertmanager does more than blindly turn alerts into notifications on a one-to-one basis. Related alerts can be aggregated into one notification, throttled to reduce pager storms,7 and different routing and notification outputs can be configured for each of your different teams. Alerts can also be silenced, perhaps to snooze an issue you are already aware of in advance when you know maintenance is scheduled.
The Alertmanager’s role stops at sending notifications; to manage human responses to incidents you should use services such as PagerDuty and ticketing systems.
Alerts and their thresholds are configured in Prometheus, not in the Alertmanager.
Since Prometheus stores data only on the local machine, you are limited by how much disk space you can fit on that machine.8 While you usually care only about the most recent day or so worth of data, for long-term capacity planning a longer retention period is desirable.
Prometheus does not offer a clustered storage solution to store data across multiple machines, but there are remote read and write APIs that allow other systems to hook in and take on this role. These allow PromQL queries to be transparently run against both local and remote data.
What Prometheus Is Not
Now that you have an idea of where Prometheus fits in the broader monitoring landscape and what its major components are, let’s look at some use cases for which Prometheus is not a particularly good choice.
As a metrics-based system, Prometheus is not suitable for storing event logs or individual events. Nor is it the best choice for high cardinality data, such as email addresses or usernames.
Prometheus is designed for operational monitoring, where small inaccuracies and race conditions due to factors like kernel scheduling and failed scrapes are a fact of life. Prometheus makes tradeoffs and prefers giving you data that is 99.9% correct over your monitoring breaking while waiting for perfect data. Thus in applications involving money or billing, Prometheus should be used with caution.
In the next chapter I will show you how to run Prometheus and do some basic monitoring.
1 Kubernetes was the first member.
2 Temperature monitoring of machines and datacenters is actually not uncommon. There are even a few users using Prometheus to track the weather for fun.
3 I have fond memories of setting up MRTG in the early 2000s, writing scripts to report temperature and network usage on my home computers.
4 Email addresses also tend to be personally identifiable information (PII), which bring with them compliance and privacy concerns that are best avoided in monitoring.
5 The term ConstMetric is colloquial, and comes from the Go client library’s
MustNewConstMetric function used to produce metrics by exporters written in Go.
6 The EC2
Name tag is the display name of an EC2 instance in the EC2 web console.
7 A page is a notification to an oncall engineer which they are expected to prompty investigate or deal with. While you may receive a page via a traditional radio pager, these days it more likely comes to your mobile phone in the form of an SMS, notification, or phone call. A pager storm is when you receive a string of pages in rapid succession.
8 However, modern machines can hold rather a lot of data locally, so a separate clustered storage system may not be necessary for you.