Chapter 4. Exposition

In Chapter 3 we mainly focused on adding instrumentation to your code. But all the instrumentation in the world isn’t much use if the metrics produced don’t end up in your monitoring system. The process of making metrics available to Prometheus is known as exposition.

Exposition to Prometheus is done over HTTP. Usually you expose metrics under the /metrics path, and the request is handled for you by a client library. Prometheus supports two human-readable text formats: the Prometheus text format and OpenMetrics. You have the option of producing the exposition format by hand, in which case it will be easier with the Prometheus text format, which is less strict. You may choose to do this if there is no suitable library for your language, but it is recommended you use a library as it’ll get all the little details like escaping correct. Most of the libraries will also provide the ability to produce metrics using both the OpenMetrics and Prometheus text format.

Exposition is typically done either in your main function or another top-level function and only needs to be configured once per application.

Metrics are usually registered with the default registry when you define them. If one of the libraries you are depending on has Prometheus instrumentation, the metrics will be in the default registry and you will gain the benefit of that additional instrumentation without having to do anything. Some users prefer to explicitly pass a registry all the way down from the main function, so you’d have to rely on every library between your application’s main function and the Prometheus instrumentation being aware of the instrumentation. This presumes that every library in the dependency chain cares about instrumentation and agrees on the choice of instrumentation libraries.

This design allows for instrumentation for Prometheus metrics with no exposition at all.1 In that case, aside from still paying the (tiny) resource cost of instrumentation, there is no impact on your application. If you are the one writing a library, you can add instrumentation for your users using Prometheus without requiring extra effort for your users who don’t monitor. To better support this use case, the instrumentation parts of client libraries try to minimize their dependencies.

Let’s take a look at exposition in some of the popular client libraries. We are going to presume here that you know how to install the client libraries and any other required dependencies.

Python

You have already seen start_http_server in Chapter 3. It starts up a background thread with an HTTP server that only serves Prometheus metrics, as follows:

from prometheus_client import start_http_server

if __name__ == '__main__':
    start_http_server(8000)
    // Your code goes here.

start_http_server is very convenient to get up and running quickly. But it is likely that you already have an HTTP server in your application that you would like your metrics to be served from.

In Python there are various ways this can be done depending on which frameworks you are using.

WSGI

Web Server Gateway Interface (WSGI) is a Python standard for web applications. The Python client provides a WSGI app that you can use with your existing WSGI code. In Example 4-1, the metrics_app is delegated to by my_app if the /metrics path is requested; otherwise, it performs its usual logic. By chaining WSGI applications, you can add middleware such as authentication, which client libraries do not offer out of the box.

Example 4-1. Exposition using WSGI in Python
from prometheus_client import make_wsgi_app
from wsgiref.simple_server import make_server

metrics_app = make_wsgi_app()

def my_app(environ, start_fn):
    if environ['PATH_INFO'] == '/metrics':
        return metrics_app(environ, start_fn)
    start_fn('200 OK', [])
    return [b'Hello World']

if __name__ == '__main__':
    httpd = make_server('', 8000, my_app)
    httpd.serve_forever()

Twisted

Twisted is a Python event-driven network engine. It supports WSGI so you can plug in make_wsgi_app, as shown in Example 4-2.

Example 4-2. Exposition using Twisted in Python
from prometheus_client import make_wsgi_app
from twisted.web.server import Site
from twisted.web.wsgi import WSGIResource
from twisted.web.resource import Resource
from twisted.internet import reactor

metrics_resource = WSGIResource(
        reactor, reactor.getThreadPool(), make_wsgi_app())

class HelloWorld(Resource):
      isLeaf = False
      def render_GET(self, request):
          return b"Hello World"

root = HelloWorld()
root.putChild(b'metrics', metrics_resource)

reactor.listenTCP(8000, Site(root))
reactor.run()

Multiprocess with Gunicorn

Prometheus assumes that the applications it is monitoring are long-lived and multithreaded. But this can fall apart a little with runtimes such as CPython.2 CPython is effectively limited to one processor core due to the Global Interpreter Lock (GIL). To work around this, some users spread the workload across multiple processes using a tool such as Gunicorn.

If you were to use the Python client library in the usual fashion, each worker would track its own metrics. Each time Prometheus went to scrape the application, it would randomly get the metrics from only one of the workers, which would be only a fraction of the information and would also have issues such as counters appearing to be going backward. Workers can also be relatively short-lived.

The solution to this problem offered by the Python client is to have each worker track its own metrics. At exposition time all the metrics of all the workers are combined in a way that provides the semantics you would get from a multithreaded application. There are some limitations to the approach used: the process_ metrics and custom collectors will not be exposed, and the Pushgateway cannot be used.3

Using Gunicorn, you need to let the client library know when a worker process exits.4 This is done in a config file like the one in Example 4-3.

Example 4-3. Gunicorn config.py to handle worker processes exiting
from prometheus_client import multiprocess

def child_exit(server, worker):
    multiprocess.mark_process_dead(worker.pid)

You will also need an application to serve the metrics. Gunicorn uses WSGI, so you can use make_wsgi_app. You must create a custom registry containing only a MultiProcessCollector for exposition, so that it does not include both the multiprocess metrics and metrics from the local default registry (Example 4-4).

Example 4-4. Gunicorn application in app.py
from prometheus_client import multiprocess, make_wsgi_app, CollectorRegistry
from prometheus_client import Counter, Gauge

REQUESTS = Counter("http_requests_total", "HTTP requests")
IN_PROGRESS = Gauge("http_requests_inprogress", "Inprogress HTTP requests",
        multiprocess_mode='livesum')

@IN_PROGRESS.track_inprogress()
def app(environ, start_fn):
    REQUESTS.inc()
    if environ['PATH_INFO'] == '/metrics':
        registry = CollectorRegistry()
        multiprocess.MultiProcessCollector(registry)
        metrics_app = make_wsgi_app(registry)
        return metrics_app(environ, start_fn)
    start_fn('200 OK', [])
    return [b'Hello World']

As you can see in Example 4-4, counters work normally, as do summaries and histograms. For gauges there is additional optional configuration using multiprocess_mode. You can configure the gauge based on how you intended to use it, as follows:

all

The default, which returns a time series from each process, whether it is alive or dead. This allows you to aggregate the series as you wish in PromQL. They will be distinguished by a pid label.

liveall

Returns a time series from each alive process.

livesum

Returns a single time series that is the sum of the value from each alive process. You would use this for things like in-progress requests or resource usage across all processes. A process might have aborted with a nonzero value, so dead processes are excluded.

max

Returns a single time series that is the maximum of the value from each alive or dead process. This is useful if you want to track the last time something happened, such as a request being processed, which could have been in a process that is now dead.

min

Returns a single time series that is the minimum of the value from each alive or dead process.

There is a small bit of setup before you can run Gunicorn, as shown in Example 4-5. You must set an environment variable called prometheus_multiproc_dir. This points to an empty directory the client library uses for tracking metrics. Before starting the application, you should always wipe this directory to handle any potential changes to your instrumentation.

Example 4-5. Preparing the environment before starting Gunicorn with two workers
hostname $ export prometheus_multiproc_dir=$PWD/multiproc
hostname $ rm -rf $prometheus_multiproc_dir
hostname $ mkdir -p $prometheus_multiproc_dir
hostname $ gunicorn -w 2 -c config.py app:app
[2018-01-07 19:05:30 +0000] [9634] [INFO] Starting gunicorn 19.7.1
[2018-01-07 19:05:30 +0000] [9634] [INFO] Listening at: http://127.0.0.1:8000 (9634)
[2018-01-07 19:05:30 +0000] [9634] [INFO] Using worker: sync
[2018-01-07 19:05:30 +0000] [9639] [INFO] Booting worker with pid: 9639
[2018-01-07 19:05:30 +0000] [9640] [INFO] Booting worker with pid: 9640

When you look at the /metrics path, you will see the two defined metrics, but python_info and the process_ metrics will not be there.

Tip

Each process creates several files that must be read at exposition time in prometheus_multiproc_dir. If your workers stop and start a lot, this can make exposition slow when you have thousands of files.

It is not safe to delete individual files as that could cause counters to incorrectly go backward, but you can either try to reduce the churn (for example, by increasing or removing a limit on the number of requests workers handle before exiting5), or regularly restarting the application and wiping the files.

These steps are for Gunicorn. The same approach also works with other Python multiprocess setups, such as using the multiprocessing module.

Go

In Go, http.Handler is the standard interface for providing HTTP handlers, and promhttp.Handler provides that interface for the Go client library. To demonstrate how this works, place the code in Example 4-6 in a file called example.go.

Example 4-6. A simple Go program demonstrating instrumentation and exposition
package main

import (
  "log"
  "net/http"

  "github.com/prometheus/client_golang/prometheus"
  "github.com/prometheus/client_golang/prometheus/promauto"
  "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
  requests = promauto.NewCounter(
    prometheus.CounterOpts{
      Name: "hello_worlds_total",
      Help: "Hello Worlds requested.",
    })
)

func handler(w http.ResponseWriter, r *http.Request) {
  requests.Inc()
  w.Write([]byte("Hello World"))
}

func main() {
  http.HandleFunc("/", handler)
  http.Handle("/metrics", promhttp.Handler())
  log.Fatal(http.ListenAndServe(":8000", nil))
}

You can fetch dependencies and run this code in the usual way:

hostname $ go get -d -u github.com/prometheus/client_golang/prometheus
hostname $ go run example.go

This example uses promauto, which will automatically register your metric with the default registry. If you do not wish to do so, you can use prometheus.NewCounter instead and then use MustRegister in an init function:

func init() {
  prometheus.MustRegister(requests)
}

This is a bit more fragile, as it is easy for you to create and use the metric but forget the MustRegister call.

Java

The Java client library is also known as the simpleclient. It replaced the original client, which was developed before many of the current practices and guidelines around how to write a client library were established. The Java client should be used for any instrumentation for languages running on a Java Virtual Machine (JVM).

HTTPServer

Similar to start_http_server in Python, the HTTPServer class in the Java client gives you an easy way to get up and running (Example 4-7).

Example 4-7. A simple Java program demonstrating instrumentation and exposition
import io.prometheus.client.Counter;
import io.prometheus.client.hotspot.DefaultExports;
import io.prometheus.client.exporter.HTTPServer;

public class Example {
  private static final Counter myCounter = Counter.build()
      .name("my_counter_total")
      .help("An example counter.").register();

  public static void main(String[] args) throws Exception {
    DefaultExports.initialize();
    HTTPServer server = new HTTPServer(8000);
    while (true) {
      myCounter.inc();
      Thread.sleep(1000);
    }
  }
}

You should generally have Java metrics as class static fields, so that they are only registered once.

The call to DefaultExports.initialize is needed for the various process and jvm metrics to work. You should generally call it once in all of your Java applications, such as in the main function. However, DefaultExports.initialize is idempotent and thread safe, so additional calls are harmless.

In order to run the code in Example 4-7, you will need the simpleclient dependencies. If you are using Maven, Example 4-8 is what the dependencies in your pom.xml should look like.

Example 4-8. pom.xml dependencies for Example 4-7
  <dependencies>
    <dependency>
      <groupId>io.prometheus</groupId>
      <artifactId>simpleclient</artifactId>
      <version>0.16.0</version>
    </dependency>
    <dependency>
      <groupId>io.prometheus</groupId>
      <artifactId>simpleclient_hotspot</artifactId>
      <version>0.16.0</version>
    </dependency>
    <dependency>
      <groupId>io.prometheus</groupId>
      <artifactId>simpleclient_httpserver</artifactId>
      <version>0.16.0</version>
    </dependency>
  </dependencies>

Servlet

Many Java and JVM frameworks support using subclasses of HttpServlet in their HTTP servers and middleware. Jetty is one such server, and you can see how to use the Java client’s MetricsServlet in Example 4-9.

Example 4-9. A Java program demonstrating exposition using MetricsServlet and Jetty
import io.prometheus.client.Counter;
import io.prometheus.client.exporter.MetricsServlet;
import io.prometheus.client.hotspot.DefaultExports;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import javax.servlet.ServletException;
import org.eclipse.jetty.server.Server;
import org.eclipse.jetty.servlet.ServletContextHandler;
import org.eclipse.jetty.servlet.ServletHolder;
import java.io.IOException;


public class Example {
  static class ExampleServlet extends HttpServlet {
    private static final Counter requests = Counter.build()
        .name("hello_worlds_total")
        .help("Hello Worlds requested.").register();

    @Override
    protected void doGet(final HttpServletRequest req,
        final HttpServletResponse resp)
        throws ServletException, IOException {
      requests.inc();
      resp.getWriter().println("Hello World");
    }
  }

  public static void main(String[] args) throws Exception {
      DefaultExports.initialize();

      Server server = new Server(8000);
      ServletContextHandler context = new ServletContextHandler();
      context.setContextPath("/");
      server.setHandler(context);
      context.addServlet(new ServletHolder(new ExampleServlet()), "/");
      context.addServlet(new ServletHolder(new MetricsServlet()), "/metrics");

      server.start();
      server.join();
  }
}

You will also need to specify the Java client as a dependency. If you are using Maven, this will look like Example 4-10.

Example 4-10. pom.xml dependencies for Example 4-9
  <dependencies>
    <dependency>
      <groupId>io.prometheus</groupId>
      <artifactId>simpleclient</artifactId>
      <version>0.16.0</version>
    </dependency>
    <dependency>
      <groupId>io.prometheus</groupId>
      <artifactId>simpleclient_hotspot</artifactId>
      <version>0.16.0</version>
    </dependency>
    <dependency>
      <groupId>io.prometheus</groupId>
      <artifactId>simpleclient_servlet</artifactId>
      <version>0.16.0</version>
    </dependency>
    <dependency>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>jetty-servlet</artifactId>
      <version>11.0.11</version>
    </dependency>
  </dependencies>

Pushgateway

Batch jobs are typically run on a regular schedule, such as hourly or daily. They start up, do some work, and then exit. As they are not continuously running, Prometheus can’t exactly scrape them.6 This is where the Pushgateway comes in.

The Pushgateway7 is a metrics cache for service-level batch jobs. Its architecture is shown in Figure 4-1. It remembers only the last push that you make to it for each batch job. You use it by having your batch jobs push their metrics just before they exit. Prometheus scrapes these metrics from your Pushgateway and you can then alert and graph them. Usually you run a Pushgateway beside a Prometheus.

Pushgateway architecture diagram
Figure 4-1. The Pushgateway architecture

A service-level batch job is one where there isn’t really an instance label to apply to it. That is to say it applies to all of one of your services, rather than being innately tied to one machine or process instance.8 If you don’t particularly care where a batch job runs but do care that it happens (even if it happens to currently be set up to run via cron on one machine), it is a service-level batch job. Examples include a per-datacenter batch job to check for bad machines, or one that performs garbage collection across a whole service.

Note

The Pushgateway is not a way to convert Prometheus from pull to push. If, for example, there are several pushes between one Prometheus scrape and the next, the Pushgateway will only return the last push for that batch job. This is discussed further in “Networks and Authentication”.

You can download the Pushgateway from the Prometheus download page. It is an exporter that runs by default on port 9091, and Prometheus should be set up to scrape it. However, you should also provide the honor_labels: true setting in the scrape config, as shown in Example 4-11. This is because the metrics you push to the Pushgateway should not have an instance label, and you do not want the Pushgateway’s own instance target label to end up on the metrics when Prometheus scrapes them.9 honor_labels is discussed in “Label Clashes and honor_labels”.

Example 4-11. prometheus.yml scrape config for a local Pushgateway
scrape_configs:
 - job_name: pushgateway
   honor_labels: true
   static_configs:
    - targets:
      - localhost:9091

You can use client libraries to push to the Pushgateway. Example 4-12 shows the structure you would use for a Python batch job. A custom registry is created so that only the specific metrics you choose are pushed. The duration of the batch job is always pushed,10 and the time it ended is pushed only if the job is successful.

There are three different ways you can write to the Pushgateway. In Python these are the push_to_gateway, pushadd_to_gateway, and delete_from_gateway functions:

push

Any existing metrics for this job are removed and the pushed metrics added. This uses the PUT HTTP method under the covers.

pushadd

The pushed metrics override existing metrics with the same metric names for this job. Any metrics that previously existed with different metric names remain unchanged. This uses the POST HTTP method under the covers.

delete

The metrics for this job are removed. This uses the DELETE HTTP method under the covers.

As Example 4-12 is using pushadd_to_gateway, the value of my_job_duration_seconds will always get replaced. However, my_job_last_success_seconds# will only get replaced if there are no exceptions; it is added to the registry and then pushed.

Example 4-12. Instrumenting a batch job and pushing its metrics to a Pushgateway
from prometheus_client import CollectorRegistry, Gauge, pushadd_to_gateway

registry = CollectorRegistry()
duration = Gauge('my_job_duration_seconds',
        'Duration of my batch job in seconds', registry=registry)
try:
    with duration.time():
        # Your code here. 
        pass

    # This only runs if there wasn't an exception. 
    g = Gauge('my_job_last_success_seconds',
            'Last time my batch job successfully finished', registry=registry)
    g.set_to_current_time()
finally:
    pushadd_to_gateway('localhost:9091', job='batch', registry=registry)

You can see pushed data on the status page, as Figure 4-2 shows. An additional metric push_time_seconds has been added by the Pushgateway because Prometheus will always use the time at which it scrapes as the timestamp of the Pushgateway metrics. push_time_seconds gives you a way to know the actual time the data was last pushed. Another metric, push_failure_time_seconds, has been introduced, which represents the last time when an update to this group in the Pushgateway failed.

Pushgateway Status page.
Figure 4-2. The Pushgateway status page showing metrics from a push

You might have noticed in Figure 4-2 that the push is referred to as a group. You can provide labels in addition to the job label when pushing, and all of these labels are known as the grouping key. In Python this can be provided with the grouping_key keyword argument. You would use this if a batch job was sharded or split up somehow. For example, if you have 30 database shards and each had its own batch job, you might distinguish them with a shard label.

Tip

Once pushed, groups stay forever in the Pushgateway. You should avoid using grouping keys that vary from one batch job run to the next, as this will make the metrics difficult to work with and cause performance issues. When decommissioning a batch job, don’t forget to delete its metrics from the Pushgateway.

Bridges

Prometheus client libraries are not limited to outputting metrics in the Prometheus format. There is a separation of concerns between instrumentation and exposition so that you can process the metrics in any way you like.

For example, the Go, Python, and Java clients each include a Graphite bridge. A bridge takes metrics output from the client library registry and outputs it to something other than Prometheus. So the Graphite bridge will convert the metrics into a form that Graphite can understand11 and write them out to Graphite, as shown in Example 4-13.

Example 4-13. Using the Python GraphiteBridge to push to Graphite every 10 seconds
import time
from prometheus_client.bridge.graphite import GraphiteBridge

gb = GraphiteBridge(['graphite.your.org', 2003])
gb.start(10)
while True:
    time.sleep(1)

This works because the registry has a method that allows you to get a snapshot of all the current metrics. This is CollectorRegistry.collect in Python, CollectorRegistry.metricFamilySamples in Java, and Registry.Gather in Go. This is the method that HTTP exposition uses, and you can use it too. For example, you could use this method to feed data into another non-Prometheus instrumentation library.12

Tip

If you ever want to hook into direct instrumentation, you should instead use the metrics output by a registry. Wanting to know every time a counter is incremented does not make sense in terms of a metrics-based monitoring system. However, the count of increments is already provided for you by CollectorRegistry.collect and works for custom collectors.

Parsers

In addition to a client library’s registry allowing you to access metric output, the Go13 and Python clients also feature a parser for the Prometheus and OpenMetrics exposition formats. Example 4-14 only prints the samples, but you could feed Prometheus metrics into other monitoring systems or into your local tooling.

Example 4-14. Parsing the Prometheus text format with the Python client
from prometheus_client.parser import text_string_to_metric_families

for family in text_string_to_metric_families(u"counter_total 1.0\n"):
  for sample in family.samples:
    print("Name: {0} Labels: {1} Value: {2}".format(*sample))

DataDog, InfluxDB, Sensu, and Metricbeat14 are some of the monitoring systems that have components that can parse the text format. Using one of these monitoring systems, you could take advantage of the Prometheus ecosystem without ever running the Prometheus server. We believe that this is a good thing, as there is currently a lot of duplication of effort between the various monitoring systems. Each of them has to write similar code to support the myriad custom metric outputs provided by the most commonly used software.

Text Exposition Format

The Prometheus text exposition format is relatively easy to produce and parse. Although you should almost always rely on a client library to handle it for you, there are cases such as with the Node Exporter textfile collector (discussed in “Textfile Collector”) where you may have to produce it yourself.

We will be showing you version 0.0.4 of the text format, which has the content type header:

Content-Type: text/plain; version=0.0.4; charset=utf-8

In the simplest cases, the text format is just the name of the metric followed by a 64-bit floating-point number. Each line is terminated with a line-feed character (\n):

my_counter_total 14
a_small_gauge 8.3e-96

Metric Types

More complete Prometheus text format output would include the HELP and TYPE of the metrics, as shown in Example 4-15. HELP is a description of what the metric is, and should not generally change from scrape to scrape. TYPE is one of counter, gauge, summary, histogram, or untyped. untyped is used when you do not know the type of the metric, and is the default if no type is specified. It is invalid for you to have a duplicate metric, so make sure all the time series that belong to a metric are grouped together.

Example 4-15. Exposition format for a gauge, counter, summary, and histogram
# HELP example_gauge An example gauge
# TYPE example_gauge gauge
example_gauge -0.7
# HELP my_counter_total An example counter
# TYPE my_counter_total counter
my_counter_total 14
# HELP my_summary An example summary
# TYPE my_summary summary
my_summary_sum 0.6
my_summary_count 19
# HELP latency_seconds An example histogram
# TYPE latency_seconds histogram
latency_seconds_bucket{le="0.1"} 7 1
latency_seconds_bucket{le="0.2"} 18
latency_seconds_bucket{le="0.4"} 24
latency_seconds_bucket{le="0.8"} 28
latency_seconds_bucket{le="+Inf"} 29
latency_seconds_sum 0.6
latency_seconds_count 29 2
1

For histograms, the le labels have floating-point values and must be sorted. You should note how the histogram buckets are cumulative, as le stands for less than or equal to.

2

The _count must match the +Inf bucket, and the +Inf bucket must always be present. Buckets should not change from scrape to scrape, as this will cause problems for PromQL’s histogram_quantile function.

Labels

The histogram in the preceding example also shows how labels are represented. Multiple labels are separated by commas, and it is OK to have a trailing comma before the closing brace.

The ordering of labels does not matter, but it is a good idea to have the ordering consistent from scrape to scrape. This will make writing your unit tests easier, and consistent ordering ensures the best ingestion performance in Prometheus.

Here is an example of a summary in text format:

# HELP my_summary An example summary
# TYPE my_summary summary
my_summary_sum{foo="bar",baz="quu"} 1.8
my_summary_count{foo="bar",baz="quu"} 453
my_summary_sum{foo="blaa",baz=""} 0
my_summary_count{foo="blaa",baz="quu"} 0

It is possible to have a metric with no time series, if no children have been initialized, as discussed in “Child”:

# HELP a_counter_total An example counter
# TYPE a_counter_total counter

Escaping

The text exposition format is encoded in UTF-8, and full UTF-815 is permitted in both HELP and label values. Thus you need to use backslashes to escape characters that would cause issues using backslashes. For HELP this is line feeds and backslashes. For label values this is line feeds, backslashes, and double quotes.16 The format ignores extra whitespace.

Here is an example demonstrating escaping in the text exposition format:

# HELP escaping A newline \\n and backslash \\ escaped
# TYPE escaping gauge
escaping{foo="newline \\n backslash \\ double quote \" "} 1

Timestamps

It is possible to specify a timestamp on a time series. It is an integer value in milliseconds since the Unix epoch,17 and it goes after the value. Timestamps in the exposition format should generally be avoided as they are only applicable in certain limited use cases (such as federation) and come with limitations. Timestamps for scrapes are usually applied automatically by Prometheus. It is not defined as to what happens if you specify multiple lines with the same name and labels but different timestamps.

This gauge has a timestamp:

# HELP foo I'm trapped in a client library
# TYPE foo gauge
foo 1 15100992000000
Warning

Timestamps are expressed in milliseconds since epoch in the Prometheus text format, while in OpenMetrics they are expressed in seconds since epoch.

check metrics

Prometheus 2.0 uses a custom parser for efficiency. So, just because a /metrics endpoint can be scraped doesn’t mean that the metrics are compliant with the format.

Promtool is a utility included with Prometheus that among other things can verify that your metric output is valid and perform lint checks:

curl http://localhost:8000/metrics | promtool check metrics

Common mistakes include forgetting the line feed on the last line, using carriage return and line feed rather than just line feed,18 and invalid metric or label names. As a brief reminder, metric and label names cannot contain hyphens, and cannot start with a number.

You now have a working knowledge of the text format. You can find the full specification in the official Prometheus documentation.

OpenMetrics

The OpenMetrics format is similar to the Prometheus text exposition format but contains several incompatible changes with the Prometheus text format. Even if they look similar, for a given set of metrics, the output they generate would generally be different.

We will be showing you version 1.0.0 of the OpenMetrics format, which has the content type header:

Content-Type: application/openmetrics-text; version=1.0.0; charset=utf-8

In the simplest cases, the text format is just the name of the metric followed by a 64-bit floating-point number. Each line is terminated with a line-feed character (\n). The file is terminated by # EOF:

my_counter_total 14
a_small_gauge 8.3e-96
# EOF

Metric Types

The metric types supported by the Prometheus text exposition format are also supported in OpenMetrics. In addition to counters, gauges, summaries, and histograms, specific types have been added: StateSet, GaugeHistograms, and Info.

StateSets represent a series of related boolean values, also called a bitset. A value of 1 means true and 0 means false.

GaugeHistograms measure current distributions. The difference with histograms is that buckets values and sum can go up and down.

Info metrics are used to expose textual information that does not change during process lifetime. An application’s version, revision control commit, and the version of a compiler are good candidates. The value of these metrics is always 1.

In addition to HELP and TYPE, metric families in OpenMetrics have an optional UNIT metadata that specifies a metric’s unit.

All the types are demonstrated in Example 4-16.

Example 4-16. Exposition format for different types of metrics
# HELP example_gauge An example gauge
# TYPE example_gauge gauge
example_gauge -0.7
# HELP my_counter An example counter
# TYPE my_counter counter
my_counter_total 14
my_counter_created 1.640991600123e+09
# HELP my_summary An example summary
# TYPE my_summary summary
my_summary_sum 0.6
my_summary_count 19
# HELP latency_seconds An example histogram
# TYPE latency_seconds histogram
# UNIT latency_seconds seconds
latency_seconds_bucket{le="0.1"} 7
latency_seconds_bucket{le="0.2"} 18
latency_seconds_bucket{le="0.4"} 24
latency_seconds_bucket{le="0.8"} 28
latency_seconds_bucket{le="+Inf"} 29
latency_seconds_sum 0.6
latency_seconds_count 29
# TYPE my_build_info info
my_build_info{branch="HEAD",version="0.16.0rc1"} 1.0
# TYPE my_stateset stateset
# HELP my_stateset An example stateset
my_stateset{feature="a"} 1
my_stateset{feature="b"} 0
# TYPE my_gaugehistogram gaugehistogram
# HELP my_gaugehistogram An example gaugehistogram
my_gaugehistogram_bucket{le="1.0"} 0
my_gaugehistogram_bucket{le="+Inf"} 3
my_gaugehistogram_gcount 3
my_gaugehistogram_gsum 2
# EOF

In OpenMetrics, as shown in Example 4-16, GaugeHistograms use distinct _gcount and _gsum suffixes for counts and sums, differentiating them from Histograms’ _count and _sum.

Labels

The Histogram and GaugeHistogram in the preceding example also showed how labels are represented. Multiple labels are separated by commas, but unlike in the Prometheus wire format, commas before the closing brace are not allowed in OpenMetrics.

Timestamps

It is possible to specify a timestamp on a time series. It is a float value in seconds since the Unix epoch,19 and it goes after the value, as shown in this example:

# HELP foo I'm trapped in a client library
# TYPE foo gauge
foo 1 1.5100992e9
Warning

Timestamps are expressed in seconds since epoch in OpenMetrics, while in the Prometheus text format they are expressed in milliseconds since epoch.

You now have a working knowledge of the OpenMetrics format. You can find the full specification in the OpenMetrics GitHub repository.

We have mentioned labels a few times now. In the following chapter you’ll learn what they are in detail.

1 No exposition means that the metrics are not scraped by a Prometheus server.

2 CPython is the official name of the standard Python implementation. Do not confuse it with Cython, which can be used to write C extensions in Python.

3 The Pushgateway is not suitable for this use case, so this is not a problem in practice.

4 child_exit was added in Gunicorn version 19.7 released in March 2017.

5 Gunicorn’s --max-requests flag is one example of such a limit.

6 Though for batch jobs that take more than a few minutes to run, it may also make sense to scrape them normally over HTTP to help debug performance issues.

7 You may see it referenced as pgw in informal contexts.

8 For batch jobs such as database backups that are tied to a machine’s lifecycle, the Node Exporter textfile collector is a better choice. This is discussed in “Textfile Collector”.

9 The Pushgateway explicitly exports empty instance labels for metrics without an instance label. Combined with honor_labels: true, this results in Prometheus not applying an instance label to these metrics. Usually, empty labels and missing labels are the same thing in Prometheus, but this is the exception.

10 Just like summaries and histograms, gauges have a time function decorator and context manager. It is intended only for use in batch jobs.

11 The labels are flattened into the metric name. Tag (i.e., label) support for Graphite was only recently added in 1.1.0.

12 This works both ways. Other instrumentation libraries with an equivalent feature can have their metrics fed into a Prometheus client library. This is discussed in “Custom Collectors”.

13 The Go client’s parser is the reference implementation.

14 Part of the Elasticsearch stack.

15 The null byte is a valid UTF-8 character.

16 Yes, there are two different sets of escaping rules within the text format. In OpenMetrics, this has been unified to just one rule, as double quotes must be escaped in HELP as well.

17 Midnight January 1st 1970 UTC.

18 \r\n is the line ending on Windows, while on Unix, \n is used. Prometheus has a Unix heritage, so it uses \n.

19 Midnight January 1st 1970 UTC.

Get Prometheus: Up & Running, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.