Chapter 4. Observability

This chapter is dedicated to observing Node.js services that run on remote machines. Locally, tools like the debugger or console.log() make this a straightforward process. However, once a service is running in a faraway land, you’ll need to reach for a different set of tools.

When debugging locally, you’re usually concerned with a single request. You might ask yourself, “When I pass this value into a request, why do I get that value in the response?” By logging the inner workings of a function, you gain insight into why a function behaved in an unanticipated way. This chapter looks at technologies useful for debugging individual requests as well. “Logging with ELK” looks at log generation, which is a way to keep track of information on a per-request basis, much like you might print with console.log(). Later, “Distributed Request Tracing with Zipkin” looks at a tool for tracking requests as they’re passed around, associating related logs generated by different services.

You often need insight into situations that wouldn’t normally be considered a hard bug when dealing with production traffic. For example, you might have to ask, “Why are HTTP requests 100ms slower for users created before April 2020?” Such timing might not be worrying with a single request, but when such metrics are considered in aggregate over many requests, you’re able to spot trends of negative performance. “Metrics with Graphite, StatsD, and Grafana” covers this in more detail.

These tools mostly display information passively in a dashboard of some sort, which an engineer can later consult to determine the source of a problem. “Alerting with Cabot” covers how to send a warning to a developer when an application’s performance dips below a certain threshold, thus allowing the engineer to prevent an outage before it happens.

So far these concepts have been reactive, where a developer must look at data captured from an application. Other times it’s necessary to be more proactive. “Health Checks” covers how an application can determine if it’s healthy and able to serve requests or if it’s unhealthy and deserves to be terminated.

Environments

Environments are a concept for differentiating running instances of an application, as well as databases, from each other. They’re important for various reasons, including choosing which instances to route traffic to, keeping metrics and logs separate (which is particularly important in this chapter), segregating services for security, and gaining confidence that a checkout of application code is going to be stable in one environment before it is deployed to production.

Environments should remain segregated from one another. If you control your own hardware, this could mean running different environments on different physical servers. If you’re deploying your application to the cloud, this more likely means setting up different VPCs (Virtual Private Clouds)—a concept supported by both AWS and GCP.

At an absolute minimum, any application will need at least a single production environment. This is the environment responsible for handling requests made by public users. However, you’re going to want a few more environments than that, especially as your application grows in complexity.

As a convention, Node.js applications generally use the NODE_ENV environment variable to specify which environment an instance is running in. This value can be set in different ways. For testing, it can be set manually, like with the following example, but for production use, whatever tool you use for deploying will abstract this process away:

$ export NODE_ENV=production
$ node server.js

Philosophies for choosing what code to deploy to different environments, which branching and merging strategies to use, and even which VCS (version control system) to choose are outside the scope of this book. But, ultimately, a particular snapshot of the codebase is chosen to be deployed to a particular environment.

Choosing which environments to support is also important, and also outside the scope of this book. Usually companies will have, at a minimum, the following environments:

Development

Used for local development. Perhaps other services know to ignore messages associated with this environment. Doesn’t need some of the backing stores required by production; for example, logs might be written to stdout instead of being transmitted to a collector.

Staging

Represents an exact copy of the production environment, such as machine specs and operating system versions. Perhaps an anonymized database snapshot from production is copied to a staging database via a nightly cron job.

Production

Where the real production traffic is processed. There may be more service instances here than in staging; for example, maybe staging runs two application instances (always run more than one) but production runs eight.

The environment string must remain consistent across all applications, both those written using Node.js and those on other platforms. This consistency will prevent many headaches. If one team refers to an environment as staging and the other as preprod, querying logs for related messages then becomes an error-prone process.

The environment value shouldn’t necessarily be used for configuration—for example, having a lookup map where environment name is associated with a hostname for a database. Ideally, any dynamic configuration should be provided via environment variables. Instead, the environment value is mostly used for things related to observability. For example, log messages should have the environment attached in order to help associate any logs with the given environment, which is especially important if a logging service does get shared across environments. “Application Configuration” takes a deeper look at configuration.

Logging with ELK

ELK, or more specifically, the ELK stack, is a reference to Elasticsearch, Logstash, and Kibana, three open source tools built by Elastic. When combined, these powerful tools are often the platform of choice for collecting logs on-prem. Individually, each of these tools serves a different purpose:

Elasticsearch

A database with a powerful query syntax, supporting features like natural text searching. It is useful in many more situations than what are covered in this book and is worth considering if you ever need to build a search engine. It exposes an HTTP API and has a default port of :9200.

Logstash

A service for ingesting and transforming logs from multiple sources. You’ll create an interface so that it can ingest logs via User Datagram Protocol (UDP). It doesn’t have a default port, so we’ll just use :7777.

Kibana

A web service for building dashboards that visualize data stored in Elasticsearch. It exposes an HTTP web service over the port :5601.

Figure 4-1 diagrams these services and their relationships, as well as how they’re encapsulated using Docker in the upcoming examples.

Administrator uses Kibana, Kibana talks to Elasticsearch, Applications logs to Logstash, and Logstash stores in Elasticsearch
Figure 4-1. The ELK stack

Your application is expected to transmit well-formed JSON logs, typically an object that’s one or two levels deep. These objects contain generic metadata about the message being logged, such as timestamp and host and IP address, as well as information specific to the message itself, such as level/severity, environment, and a human-readable message. There are multiple ways to configure ELK to receive such messages, such as writing logs to a file and using Elastic’s Filebeat tool to collect them. The approach used in this section will configure Logstash to listen for incoming UDP messages.

Running ELK via Docker

In order to get your hands dirty, you’re going to run a single Docker container containing all three services. (Be sure to have Docker installed—see Appendix B for more information.) These examples won’t enable disk persistence. Within a larger organization, each of these services would perform better when installed on dedicated machines, and of course, persistence is vital.

In order to configure Logstash to listen for UDP messages, a configuration file must first be created. The content for this file is available in Example 4-1 and can be placed in a new directory at misc/elk/udp.conf. Once the file is created, you’ll make it available to the Logstash service running inside of the Docker container. This is done by using the -v volume flag, which allows a local filesystem path to be mounted inside of the container’s filesystem.

Example 4-1. misc/elk/udp.conf
input {
  udp {
    id => "nodejs_udp_logs"
    port => 7777
    codec => json
  }
}
output {
  elasticsearch {
    hosts => ["localhost:9200"]
    document_type => "nodelog"
    manage_template => false
    index => "nodejs-%{+YYYY.MM.dd}"
  }
}
Note

For brevity’s sake, these examples use UDP for sending messages. This approach doesn’t come with the same features as others, such as delivery guarantees or back pressure support, but it does come with reduced overhead for the application. Be sure to research the best tool for your use-case.

Once the file has been created you’re ready to run the container using the commands in Example 4-2. If you’re running Docker on a system-based Linux machine, you’ll need to run the sysctl command before the container will properly run, and you may omit the -e flag if you want. If you’re running Docker on macOS, you should skip the sysctl flag, but you will need to go into the Docker Desktop preferences and allocate at least 4 GB of memory in the Resources → Advanced tab.

Example 4-2. Running ELK within Docker
$ sudo sysctl -w vm.max_map_count=262144 # Linux Only
$ docker run -p 5601:5601 -p 9200:9200 \
  -p 5044:5044 -p 7777:7777/udp \
  -v $PWD/misc/elk/udp.conf:/etc/logstash/conf.d/99-input-udp.conf \
  -e MAX_MAP_COUNT=262144 \
  -it --name distnode-elk sebp/elk:683

This command downloads files from Dockerhub and configures the service and may take a few minutes to run. Once your console calms down a bit, visit http://localhost:5601 in your browser. If you see a successful message, then the service is now ready to receive messages.

Transmitting Logs from Node.js

For this example, you’re going to again start by modifying an existing application. Copy the web-api/consumer-http-basic.js file created in Example 1-7 to web-api/consumer-http-logs.js as a starting point. Next, modify the file to look like the code in Example 4-3.

Example 4-3. web-api/consumer-http-logs.js
#!/usr/bin/env node

// npm install fastify@3.2 node-fetch@2.6 middie@5.1
const server = require('fastify')();
const fetch = require('node-fetch');
const HOST = process.env.HOST || '127.0.0.1';
const PORT = process.env.PORT || 3000;
const TARGET = process.env.TARGET || 'localhost:4000';
const log = require('./logstash.js'); 1

(async () => {
  await server.register(require('middie')); 2
  server.use((req, res, next) => { 3
    log('info', 'request-incoming', { 4
      path: req.url, method: req.method, ip: req.ip,
      ua: req.headers['user-agent'] || null });
    next();
  });
  server.setErrorHandler(async (error, req) => {
    log('error', 'request-failure', {stack: error.stack, 5
      path: req.url, method: req.method, });
    return { error: error.message };
  });
  server.get('/', async () => {
    const url = `http://${TARGET}/recipes/42`;
    log('info', 'request-outgoing', {url, svc: 'recipe-api'}); 6
    const req = await fetch(url);
    const producer_data = await req.json();
    return { consumer_pid: process.pid, producer_data };
  });
  server.get('/error', async () => { throw new Error('oh no'); });
  server.listen(PORT, HOST, () => {
    log('verbose', 'listen', {host: HOST, port: PORT}); 7
  });
})();
1

The new logstash.js file is now being loaded.

2

The middie package allows Fastify to use generic middleware.

3

A middleware to log incoming requests.

4

A call to the logger that passes in request data.

5

A generic middleware for logging errors.

6

Information about outbound requests is logged.

7

Information about server starts is also logged.

This file logs some key pieces of information. The first thing logged is when the server starts. The second set of information is by way of a generic middleware handler. It logs data about any incoming request, including the path, the method, the IP address, and the user agent. This is similar to the access log for a traditional web server. Finally, the application tracks outbound requests to the recipe-api service.

The contents of the logstash.js file might be more interesting. There are many libraries available on npm for transmitting logs to Logstash (@log4js-node/logstashudp is one such package). These libraries support a few methods for transmission, UDP included. Since the mechanism for sending logs is so simple, you’re going to reproduce a version from scratch. This is great for educational purposes, but a full-featured package from npm will make a better choice for a production application.

Create a new file called web-api/logstash.js. Unlike the other JavaScript files you’ve created so far, this one won’t be executed directly. Add the content from Example 4-4 to this file.

Example 4-4. web-api/logstash.js
const client = require('dgram').createSocket('udp4'); 1
const host = require('os').hostname();
const [LS_HOST, LS_PORT] = process.env.LOGSTASH.split(':'); 2
const NODE_ENV = process.env.NODE_ENV;

module.exports = function(severity, type, fields) {
  const payload = JSON.stringify({ 3
    '@timestamp': (new Date()).toISOString(),
    "@version": 1, app: 'web-api', environment: NODE_ENV,
    severity, type, fields, host
  });
  console.log(payload);
  client.send(payload, LS_PORT, LS_HOST);
};
1

The built-in dgram module sends UDP messages.

2

The Logstash location is stored in LOGSTASH.

3

Several fields are sent in the log message.

This basic Logstash module exports a function that application code calls to send a log. Many of the fields are automatically generated, like @timestamp, which represents the current time. The app field is the name of the running application and doesn’t need to be overridden by the caller. Other fields, like severity and type, are fields that the application is going to change all the time. The fields field represents additional key/value pairs the app might want to provide.

The severity field (often called the log level in other logging frameworks) refers to the importance of the log. Most logging packages support the following six values, originally made popular by the npm client: error, warn, info, verbose, debug, silly. It’s a common pattern with more “complete” logging packages to set a logging threshold via environment variable. For example, by setting the minimum severity to verbose, any messages with a lower severity (namely debug and silly) will get dropped. The overly simple logstash.js module doesn’t support this.

Once the payload has been constructed, it’s then converted into a JSON string and printed to the console to help tell what’s going on. Finally, the process attempts to transmit the message to the Logstash server (there is no way for the application to know if the message was delivered; this is the shortcoming of UDP).

With the two files created, it’s now time to test the application. Run the commands in Example 4-5. This will start an instance of the new web-api service, an instance of the previous recipe-api service, and will also send a series of requests to the web-api. A log will be immediately sent once the web-api has been started, and two additional logs will be sent for each incoming HTTP request. Note that the watch commands continuously execute the command following on the same line and will need to be run in separate terminal windows.

Example 4-5. Running web-api and generating logs
$ NODE_ENV=development LOGSTASH=localhost:7777 \
  node web-api/consumer-http-logs.js
$ node recipe-api/producer-http-basic.js
$ brew install watch # required for macOS
$ watch -n5 curl http://localhost:3000
$ watch -n13 curl http://localhost:3000/error

Isn’t that exciting? Well, not quite yet. Now you’ll jump into Kibana and take a look at the logs being sent. Let the watch commands continue running in the background; they’ll keep the data fresh while you’re using Kibana.

Creating a Kibana Dashboard

Now that the application is sending data to Logstash and Logstash is storing the data in Elasticsearch, it’s time to open Kibana and explore this data. Open your browser and visit http://localhost:5601. At this point you should be greeted with the Kibana dashboard.

Within the dashboard, click the last tab on the left, titled Management. Next, locate the Kibana section of options and then click the Index Patterns option. Click Create index pattern. For Step 1, type in an Index pattern of nodejs-*. You should see a small Success! message below as Kibana correlates your query to a result. Click Next step. For Step 2, click the Time Filter drop-down menu and then click the @timestamp field. Finally, click Create index pattern. You’ve now created an index named nodejs-* that will allow you to query those values.

Click the second tab on the left, titled Visualize. Next, click the Create new visualization button in the center of the screen. You’ll be given several different options for creating a visualization, including the ones shown in Figure 4-2, but for now just click the Vertical Bar graph option.

Kibana Icons: Gauge, Goal, Heat Map, Horizontal Bar, Line, Maps, Markdown, Metric
Figure 4-2. Kibana visualizations

Select the nodejs-* index that you just created. Once that’s done, you’ll be taken to a new screen to fine-tune the visualization. The default graph isn’t too interesting; it’s a single bar showing a count of all logs matching the nodejs-* index. But not for long.

The goal now is to create a graph that displays the rate at which incoming requests are received by the web-api service. So, first add a few filters to narrow down the results to only contain applicable entries. Click the Add a Filter link near the upper-left corner of the screen. For the Field drop-down menu, enter the value type. For the Operator field, set it to is. For the Value field, enter the value request-incoming and then click Save. Next, click Add a Filter again and do the same thing, but this time set Field to app, then set Operator to is again, and set Value to web-api.

For the Metrics section, leave it displaying the count, since it should display the number of requests and the matching log messages correlate one to one with real requests.

For the Buckets section, it should be changed to group by time. Click the Add buckets link and select X-Axis. For the Aggregation drop-down menu, select Date Histogram. Click on the blue button with a play symbol above the Metrics section (it has a title of Apply changes), and the graph will update. The default setting of grouping by @timestamp with an automatic interval is fine.

In the upper-right corner is a drop-down menu for changing the time range of the logs being queried. Click the drop-down menu and configure it to display logs from the last hour, and then click the large Refresh button to the right of the drop-down menu. If all goes to plan, your screen should look like Figure 4-3.

A Kibana graph of timestamps over time
Figure 4-3. Requests over time in Kibana

Once your graph is complete, click the Save link at the top of the Kibana screen. Name the visualization web-api incoming requests. Next, create a similar visualization but this time, set the type field to request-outgoing and name that visualization web-api outgoing requests. Finally, create a third visualization with a type field of listen and name it web-api server starts.

Next, you’ll create a dashboard for these three visualizations. Select the third option in the sidebar titled Dashboard. Then, click Create new dashboard. A modal window will appear with your three visualizations in it. Click each visualization, and it will be added to the dashboard. Once you’ve added each visualization, dismiss the modal. Click the Save link at the top of the screen and save the dashboard as web-api overview.

Congratulations! You’ve created a dashboard containing information extracted from your application.

Running Ad-Hoc Queries

Sometimes you’ll need to run arbitrary queries against the data that’s being logged without a correlating dashboard. This is helpful in one-off debugging situations. In this section, you’ll write arbitrary queries in order to extract errors about the application.

Click the first tab in the left sidebar, the one titled Discover. This is a convenient playground for running queries without needing to commit them to a dashboard. By default, a listing of all recently received messages is displayed. Click inside of the Search field at the top of the screen. Then, type the following query into the search field and press Enter:

app:"web-api" AND (severity:"error" OR severity:"warn")

The syntax of this query is written in the Kibana Query Language (KQL). Essentially, there are three clauses. It’s asking for logs belonging to the web-api application and whose severity levels are set to either error or warn (in other words, things that are very important).

Click the arrow symbol next to one of the log entries in the list that follows. This will expand the individual log entry and allow you to view the entire JSON payload associated with the log. The ability to view arbitrary log messages like this is what makes logging so powerful. With this tool you’re now able to find all the errors being logged from the service.

By logging more data, you’ll gain the ability to drill down into the details of specific error situations. For example, you might find that errors occur when a specific endpoint within an application is being hit under certain circumstances (like a user updating a recipe via PUT /recipe in a more full-featured application). With access to the stack trace, and enough contextual information about the requests, you’re then able to re-create the conditions locally, reproduce the bug, and come up with a fix.

Warning

This section looks at transmitting logs from within an application, an inherently asynchronous operation. Unfortunately, logs generated when a process crashes might not be sent in time. Many deployment tools can read messages from stdout and transmit them on behalf of the application, which increases the likelihood of them being delivered.

This section looked at storing logs. Certainly, these logs can be used to display numeric information in graphs, but it isn’t necessarily the most efficient system for doing so since the logs store complex objects. The next section, “Metrics with Graphite, StatsD, and Grafana”, looks at storing more interesting numeric data using a different set of tools.

Metrics with Graphite, StatsD, and Grafana

“Logging with ELK” looked at transmitting logs from a running Node.js process. Such logs are formatted as JSON and are indexable and searchable on a per-log basis. This is perfect for reading messages related to a particular running process, such as reading variables and stack traces. However, sometimes you don’t necessarily care about individual pieces of numeric data, and instead you want to know about aggregations of data, usually as these values grow and shrink over time.

This section looks at sending metrics. A metric is numeric data associated with time. This can include things like request rates, the number of 2XX versus 5XX HTTP responses, latency between the application and a backing service, memory and disk use, and even business stats like dollar revenue or cancelled payments. Visualizing such information is important to understanding application health and system load.

Much like in the logging section, a stack of tools will be used instead of a single one. However, this stack doesn’t really have a catchy acronym like ELK, and it’s fairly common to swap out different components. The stack considered here is that of Graphite, StatsD, and Grafana:

Graphite

A combination of a service (Carbon) and time series database (Whisper). It also comes with a UI (Graphite Web), though the more powerful Grafana interface is often used.

StatsD

A daemon (built with Node.js) for collecting metrics. It can listen for stats over TCP or UDP before sending aggregations to a backend such as Graphite.

Grafana

A web service that queries time series backends (like Graphite) and displays information in configurable dashboards.

Figure 4-4 shows a diagram of these services and how they’re related. The Docker boundaries represent what the upcoming examples will use.

Administrator uses Grafana, Grafana talks to Graphite, Applications sends metrics to StatsD, and StatsD stores in Graphite
Figure 4-4. Graphite, StatsD, and Grafana

Much like in the logging section, these examples will transmit data using UDP. Due to the nature of metrics being rapidly produced, using UDP will help keep the application from getting overwhelmed.

Running via Docker

Example 4-6 starts two separate Docker containers. The first one, graphiteapp/graphite-statsd contains StatsD and Graphite. Two ports from this container are exposed. The Graphite UI/API is exposed via port :8080, while the StatsD UDP metrics collector is exposed as :8125. The second, grafana/grafana, contains Grafana. A single port for the web interface, :8000, is exposed for this container.

Example 4-6. Running StatsD + Graphite, and Grafana
$ docker run \
  -p 8080:80 \
  -p 8125:8125/udp \
  -it --name distnode-graphite graphiteapp/graphite-statsd:1.1.6-1
$ docker run \
  -p 8000:3000 \
  -it --name distnode-grafana grafana/grafana:6.5.2

Once the containers are up and running, open a web browser and visit the Grafana dashboard at http://localhost:8000/. You’ll be asked to log in at this point. The default login credentials are admin / admin. Once you successfully log in, you’ll then be prompted to change the password to something else. This password will be used to administer Grafana, though it won’t be used in code.

Once the password has been set, you’ll be taken to a wizard for configuring Grafana. The next step is to configure Grafana to communicate with the Graphite image. Click the Add Data Source button and then click the Graphite option. On the Graphite configuration screen, input the values displayed in Table 4-1.

Table 4-1. Configuring Grafana to use Graphite

Name

Dist Node Graphite

URL

http://<LOCAL_IP>:8080

Version

1.1.x

Note

Due to the way these Docker containers are being run, you won’t be able to use localhost for the <LOCAL_IP> placeholder. Instead, you’ll need to use your local IP address. If you’re on Linux, try running hostname -I, and if you’re on macOS, try running ipconfig getifaddr en0. If you’re running this on a laptop and your IP address changes, you’ll need to reconfigure the data source in Grafana to use the new IP address, or else you won’t get data.

Once you’ve entered the data, click Save & Test. If you see the message “Data source is working,” then Grafana was able to talk to Graphite and you can click the Back button. If you get HTTP Error Bad Gateway, make sure the Graphite container is running and that the settings have been entered correctly.

Now that Graphite and Grafana are talking to each other, it’s time to modify one of the Node.js services to start sending metrics.

Transmitting Metrics from Node.js

The protocol used by StatsD is extremely simple, arguably even simpler than the one used by Logstash UDP. An example message that increments a metric named foo.bar.baz looks like this:

foo.bar.baz:1|c

Such interactions could very easily be rebuilt using the dgram module, like in the previous section. However, this code sample will make use of an existing package. There are a few out there, but this example uses the statsd-client package.

Again, start by rebuilding a version of the consumer service. Copy the web-api/consumer-http-basic.js file created in Example 1-7 to web-api/consumer-http-metrics.js as a starting point. From there, modify the file to resemble Example 4-7. Be sure to run the npm install command to get the required package as well.

Example 4-7. web-api/consumer-http-metrics.js (first half)
#!/usr/bin/env node

// npm install fastify@3.2 node-fetch@2.6 statsd-client@0.4.4 middie@5.1
const server = require('fastify')();
const fetch = require('node-fetch');
const HOST = '127.0.0.1';
const PORT = process.env.PORT || 3000;
const TARGET = process.env.TARGET || 'localhost:4000';
const SDC = require('statsd-client');
const statsd = new (require('statsd-client'))({host: 'localhost',
  port: 8125, prefix: 'web-api'}); 1

(async () => {
  await server.register(require('middie'));
  server.use(statsd.helpers.getExpressMiddleware('inbound', { 2
    timeByUrl: true}));
  server.get('/', async () => {
    const begin = new Date();
    const req = await fetch(`http://${TARGET}/recipes/42`);
    statsd.timing('outbound.recipe-api.request-time', begin); 3
    statsd.increment('outbound.recipe-api.request-count'); 4
    const producer_data = await req.json();

    return { consumer_pid: process.pid, producer_data };
  });
  server.get('/error', async () => { throw new Error('oh no'); });
  server.listen(PORT, HOST, () => {
    console.log(`Consumer running at http://${HOST}:${PORT}/`);
  });
})();
1

Metric names are prefixed with web-api.

2

A generic middleware that automatically tracks inbound requests.

3

This tracks the perceived timing to recipe-api.

4

The number of outbound requests is also tracked.

A few things are going on with this new set of changes. First, it requires the statsd-client package and configures a connection to the StatsD service listening at localhost:8125. It also configures the package to use a prefix value of web-api. This value represents the name of the service reporting the metrics (likewise, if you made similar changes to recipe-api, you’d set its prefix accordingly). Graphite works by using a hierarchy for naming metrics, so metrics sent from this service will all have the same prefix to differentiate them from metrics sent by another service.

The code makes use of a generic middleware provided by the statsd-client package. As the method name implies, it was originally designed for Express, but Fastify mostly supports the same middleware interface, so this application is able to reuse it. The first argument is another prefix name, and inbound implies that the metrics being sent here are associated with incoming requests.

Next, two values are manually tracked. The first is the amount of time the web-api perceives the recipe-api to have taken. Note that this time should always be longer than the time recipe-api believes the response took. This is due to the overhead of sending a request over the network. This timing value is written to a metric named outbound.recipe-api.request-time. The application also tracks how many requests are sent. This value is provided as outbound.recipe-api.request-count. You could even get more granular here. For example, for a production application, the status codes that the recipe-api responds with could also be tracked, which would allow an increased rate of failures to be visible.

Next, run the following commands each in a separate terminal window. This will start your newly created service, run a copy of the producer, run Autocannon to get a stream of good requests, and also trigger some bad requests:

$ NODE_DEBUG=statsd-client node web-api/consumer-http-metrics.js
$ node recipe-api/producer-http-basic.js
$ autocannon -d 300 -R 5 -c 1 http://localhost:3000
$ watch -n1 curl http://localhost:3000/error

Those commands will generate a stream of data, which gets passed to StatsD before being sent to Graphite. Now that you have some data, you’re ready to create a dashboard to view it.

Creating a Grafana Dashboard

As the owner of the web-api service, there are (at least) three different sets of metrics that should be extracted so that you can measure its health. This includes the incoming requests and, importantly, differentiating 200 from 500. It also includes the amount of time that recipe-api, an upstream service, takes to reply. The final set of required information is the rate of requests to the recipe-api service. If you determine the web-api service is slow, you might use this information to discover that the recipe-api service is slowing it down.

Switch back to your web browser with the Grafana interface. There is a large plus symbol in the sidebar; click it to be taken to the New dashboard screen. On this screen you’ll see a New Panel rectangle. Inside of it is an Add Query button. Click that button to be taken to the query editor screen.

On this new screen, you’ll see an empty graph at the top and inputs to describe the graph below. The UI lets you describe the query using two fields. The first is called Series and is where you can input the hierarchical metric name. The second field is called Functions. Both of these fields provide autocomplete for matching metric names. First, start with the Series field. Click the “select metric” text next to the Series label and then click stats_count from the drop-down menu. Then click “select metric” again and select web-api. Continue this for the values inbound, response_code, and finally * (the * is a wildcard and will match any value). At this point, the graph has been updated and should show two sets of entries.

The graph labels aren’t too friendly just yet. They’re displaying the entire hierarchy name instead of just the easy-to-read values 200 and 500. A Function can be used to fix this. Click the plus sign next to the Functions label, then click Alias, and then click aliasByNode(). This will insert the function and also automatically provide a default argument of 4. This is because the asterisk in the query is the 4th entry in the (zero-based) hierarchy metric name. The graph labels have been updated to display just 200 and 500.

In the upper-right corner of the panel with the Series and Functions fields, there’s a pencil icon with a tooltip titled Toggle text edit mode. Click that, and the graphical entry will change into a text version. This is helpful for quickly writing a query. The value you should have looks like the following:

aliasByNode(stats_counts.web-api.inbound.response_code.*, 4)

In the left column, click the gear icon labeled General. On this screen you’re able to modify generic settings about this particular graph. Click the Title field, and input a value of Incoming Status Codes. Once that’s done, click the large arrow in the upper-left corner of the screen. This will take you from the panel editor screen and back to the dashboard edit screen. At this point, your dashboard will have a single panel.

Next, click the Add panel button in the upper-right corner of the screen and then click the Add query button again. This will allow you to add a second panel to the dashboard. This next panel will track the time it takes to query the recipe-api. Create the appropriate Series and Functions entries to reproduce the following:

aliasByNode(stats.timers.web-api.outbound.*.request-time.upper_90, 4)
Note

StatsD is generating some of these metric names for you. For example, stats.timers is a StatsD prefix, web-api.outbound.recipe-api.request-time is provided by the application, and the timing-related metric names under that (such as upper_90) are again calculated by StatsD. In this case, the query is looking at TP90 timing values.

Since this graph measures time and is not a generic counter, the units should be modified as well (this information is measured in milliseconds). Click the second tab on the left, with a tooltip of Visualization. Then, scroll down the section labeled Axes, find the group titled Left Y, and then click the Unit drop-down menu. Click Time, then click milliseconds (ms). The graph will then be updated with proper units.

Click the third General tab again and set the panel’s title to Outbound Service Timing. Click the back arrow again to return to the dashboard edit screen.

Finally, click the Add panel button again and go through creating a final panel. This panel will be titled Outbound Request Count, won’t need any special units, and will use the following query:

aliasByNode(stats_counts.web-api.outbound.*.request-count, 3)

Click the back button a final time to return to the dashboard editor screen. In the upper-right corner of the screen, click the Save dashboard icon, give the dashboard a name of Web API Overview, and save the dashboard. The dashboard is now saved and will have a URL associated with it. If you were using an instance of Grafana permanently installed for your organization, this URL would be a permalink that you could provide to others and would make a great addition to your project’s README.

Feel free to drag the panels around and resize them until you get something that is aesthetically pleasing. In the upper right corner of the screen, you can also change the time range. Set it to “Last 15 minutes,” since you likely don’t have data much older than that. Once you’re done, your dashboard should look something like Figure 4-5.

Grafana dashboard showing Incoming Status Codes, Outbound Service Timing, and Outbound Request Count
Figure 4-5. Completed Grafana dashboard

Node.js Health Indicators

There is some generic health information about a running Node.js process that is also worth collecting for the dashboard. Modify your web-api/consumer-http-metrics.js file by adding the code from Example 4-8 to the end of the file. Restart the service and keep an eye on the data that is being generated. These new metrics represent values that can increase or decrease over time and are better represented as Gauges.

Example 4-8. web-api/consumer-http-metrics.js (second half)
const v8 = require('v8');
const fs = require('fs');

setInterval(() => {
  statsd.gauge('server.conn', server.server._connections); 1

  const m = process.memoryUsage(); 2
  statsd.gauge('server.memory.used', m.heapUsed);
  statsd.gauge('server.memory.total', m.heapTotal);

  const h = v8.getHeapStatistics(); 3
  statsd.gauge('server.heap.size', h.used_heap_size);
  statsd.gauge('server.heap.limit', h.heap_size_limit);

  fs.readdir('/proc/self/fd', (err, list) => {
    if (err) return;
    statsd.gauge('server.descriptors', list.length); 4
  });

  const begin = new Date();
  setTimeout(() => { statsd.timing('eventlag', begin); }, 0); 5
}, 10_000);
1

Number of connections to server

2

Process heap utilization

3

V8 heap utilization

4

Open file descriptors, ironically using a file descriptor

5

Event loop lag

This code will poll the Node.js underbelly every 10 seconds for key information about the process. As an exercise of your newfound Grafana skills, create five new dashboards containing this newly captured data. In the metric namespace hierarchy, the guage metrics begin with stats.gauges, while the timer starts with stats.timers.

The first set of data, provided as server.conn, is the number of active connections to the web server. Most Node.js web frameworks expose this value in some manner; check out the documentation for your framework of choice.

Information about the process memory usage is also captured. This is being recorded as two values, server.memory.used and server.memory.total. When creating a graph for these values, their unit should be set to Data/Bytes, and Grafana is smart enough to display more specific units like MB. A very similar panel could then be made based on the V8 heap size and limit.

The event loop lag metric displays how long it takes the application to call a function that was scheduled to run as early as zero milliseconds from the time setTimeout() was called. This graph should display the value in milliseconds. A healthy event loop should have a number between zero and two. Overwhelmed services might start taking tens of milliseconds.

Finally, the number of open file descriptors can indicate a leak in a Node.js application. Sometimes files will be opened but will never be closed, and this can lead to consumption of server resources and result in a process crash.

Once you’ve added the new panels, your dashboard may then resemble Figure 4-6. Save the modified dashboard so that you don’t lose your changes.

Grafana dashboard showing server memory usage, V8 memory usage, and open file descriptors
Figure 4-6. Updated Grafana dashboard

This section only covers the basics of what can be done with the StatsD, Graphite, and Grafana stack. There are many query functions that haven’t been covered, including other forms of visualizations, how to manually color individual time series entries (like green for 2XX, yellow for 4XX, and red for 5XX), and so on.

Distributed Request Tracing with Zipkin

“Logging with ELK” looked at storing logs from a Node.js process. Such logs contain information about the internal operations of a process. Likewise, “Metrics with Graphite, StatsD, and Grafana” looked at storing numeric metrics. These metrics are useful for looking at numeric data in aggregate about an application, such as throughput and failure rates for an endpoint. However, neither of these tools allow for associating a specific external request with all the internal requests it may then generate.

Consider, for example, a slightly more complex version of the services covered so far. Instead of just a web-api and a recipe-api service, there’s an additional user-api and a user-store service. The web-api will still call the recipe-api service as before, but now the web-api will also call the user-api service, which will in turn call the user-store service. In this scenario, if any one of the services produces a 500 error, that error will bubble up and the overall request will fail with a 500. How would you find the cause of a specific error with the tools used so far?

Well, if you know that an error occurred on Tuesday at 1:37 P.M., you might be tempted to look through logs stored in ELK between the time of 1:36 P.M. and 1:38 P.M. Goodness knows I’ve done this myself. Unfortunately, if there is a high volume of logs, this could mean sifting through thousands of individual log entries. Worse, other errors happening at the same time can “muddy the water,” making it hard to know which logs are actually associated with the erroneous request.

At a very basic level, requests made deeper within an organization can be associated with a single incoming external request by passing around a request ID. This is a unique identifier that is generated when the first request is received, which is then somehow passed between upstream services. Then, any logs associated with this request will contain some sort of request_id field, which can then be filtered using Kibana. This approach solves the associated request conundrum but loses information about the hierarchy of related requests.

Zipkin, sometimes referred to as OpenZipkin, is a tool that was created to alleviate situations just like this one. Zipkin is a service that runs and exposes an HTTP API. This API accepts JSON payloads describing request metadata, as they are both sent by clients and received by servers. Zipkin also defines a set of headers that are passed from client to server. These headers allow processes to associate outgoing requests from a client with incoming requests to a server. Timing information is also sent, which then allows Zipkin to display a graphical timeline of a request hierarchy.

How Does Zipkin Work?

In the aforementioned scenario with the four services, the relationship between services transpires over four requests. When this happens, seven messages will be sent to the Zipkin service. Figure 4-7 contains a visualization of the service relationships, the passed messages, and the additional headers.

Dependency graph between four services
Figure 4-7. Example requests and Zipkin data

One concept that has been repeated a few times so far in this book is that a client will perceive one latency of a request, while a server will perceive another latency. A client will always determine that a request takes longer than the server. This is due to the time it takes a message to be sent over the network, plus other things that are hard to measure, such as a web server package automatically parsing a JSON request before user code can start measuring time.

Zipkin allows you to measure the difference in opinion between client and server. This is why the four requests in the example situation, marked as solid arrows in Figure 4-7, result in seven different messages being sent to Zipkin. The first message, terminating with S1, only contains a server message. In this case, the third-party client isn’t reporting its perceived time, so there’s just the server message. For the three requests terminating in S2, S3, and S4, there is a correlating client message, namely C2, C3, and C4.

The different client and server messages can be sent from the different instances, asynchronously, and can be received in any order. The Zipkin service will then stitch them each together and visualize the request hierarchy using the Zipkin web UI. The C2 message will look something like this:

[{
  "id":       "0000000000000111",
  "traceId":  "0000000000000100",
  "parentId": "0000000000000110",
  "timestamp": 1579221096510000,
  "name": "get_recipe", "duration": 80000, "kind": "CLIENT",
  "localEndpoint": {
    "serviceName": "web-api", "ipv4": "127.0.0.1", "port": 100
  },
  "remoteEndpoint": { "ipv4": "127.0.0.2", "port": 200 },
  "tags": {
    "http.method": "GET", "http.path": "/recipe/42", "diagram": "C2"
  }
}]

These messages can be queued up by an application and occasionally flushed in batches to the Zipkin service, which is why the root JSON entry is an array. In Example 4-9, only a single message is being transmitted.

The client message and server message pairs will end up containing the same id, traceId, and parentId identifiers. The timestamp field represents the time when the client or server first perceived the request to start, and the duration is how long the service thought the request lasted. Both of these fields are measured in microseconds. The Node.js wall clock, attainable via Date.now(), only has millisecond accuracy, so it’s common to multiply that value by 1,000.1 The kind field is set to either CLIENT or SERVER, depending on which side of the request is being logged. The name field represents a name for the endpoint and should have a finite set of values (in other words, don’t use an identifier).

The localEndpoint field represents the service sending the message (the server with a SERVER message or the client with a CLIENT message). The service provides its own name in here, the port it’s listening on, and its own IP address. The remoteEndpoint field contains information about the other service (a SERVER message probably won’t know the client’s port, and likely won’t even know the client’s name).

The tags field contains metadata about the request. In this example, information about the HTTP request is provided as http.method and http.path. With other protocols, different metadata would be attached, such as a gRPC service and method name.

The identifiers sent in the seven different messages have been re-created in Table 4-2.

Table 4-2. Values reported from Figure 4-7
Message id parentId traceId kind

S1

110

N/A

100

SERVER

C2

111

110

100

CLIENT

S2

111

110

100

SERVER

C3

121

110

100

CLIENT

S3

121

110

100

SERVER

C4

122

121

100

CLIENT

S4

122

121

100

SERVER

Apart from the messages sent to the server, the other important part of Zipkin is the metadata that is sent from client to server. Different protocols have different standards for sending this metadata. With HTTP, the metadata is sent via headers. These headers are provided by C2, C3, and C4 and are received by S2, S3, and S4. Each of these headers has a different meaning:

X-B3-TraceId

Zipkin refers to all related requests as a trace. This value is Zipkin’s concept of a request ID. This value is passed between all related requests, unchanged.

X-B3-SpanId

A span represents a single request, as seen from both a client and a server (like C3/S3). Both the client and server will send a message using the same span ID. There can be multiple spans in a trace, forming a tree structure.

X-B3-ParentSpanId

A parent span is used for associating a child span with a parent span. This value is missing for the originating external request but is present for deeper requests.

X-B3-Sampled

This is a mechanism used for determining if a particular trace should be reported to Zipkin. For example, an organization may choose to track only 1% of requests.

X-B3-Flags

This can be used to tell downstream services that this is a debug request. Services are encouraged to then increase their logging verbosity.

Essentially, each service creates a new span ID for each outgoing request. The current span ID is then provided as the parent ID in the outbound request. This is how the hierarchy of relationships is formed.

Now that you understand the intricacies of Zipkin, it’s time to run a local copy of the Zipkin service and modify the applications to interact with it.

Running Zipkin via Docker

Again, Docker provides a convenient platform for running the service. Unlike the other tools covered in this chapter, Zipkin provides an API and a UI using the same port. Zipkin uses a default port of 9411 for this.

Run this command to download and start the Zipkin service:2

$ docker run -p 9411:9411 \
  -it --name distnode-zipkin \
  openzipkin/zipkin-slim:2.19

Transmitting Traces from Node.js

For this example, you’re going to again start by modifying an existing application. Copy the web-api/consumer-http-basic.js file created in Example 1-7 to web-api/consumer-http-zipkin.js as a starting point. Modify the file to look like the code in Example 4-9.

Example 4-9. web-api/consumer-http-zipkin.js
#!/usr/bin/env node

// npm install fastify@3.2 node-fetch@2.6 zipkin-lite@0.1
const server = require('fastify')();
const fetch = require('node-fetch');
const HOST = process.env.HOST || '127.0.0.1';
const PORT = process.env.PORT || 3000;
const TARGET = process.env.TARGET || 'localhost:4000';
const ZIPKIN = process.env.ZIPKIN || 'localhost:9411';
const Zipkin = require('zipkin-lite');
const zipkin = new Zipkin({ 1
  zipkinHost: ZIPKIN,
  serviceName: 'web-api', servicePort: PORT, serviceIp: HOST,
  init: 'short' 2
});
server.addHook('onRequest', zipkin.onRequest()); 3
server.addHook('onResponse', zipkin.onResponse());

server.get('/', async (req) => {
  req.zipkin.setName('get_root'); 4

  const url = `http://${TARGET}/recipes/42`;
  const zreq = req.zipkin.prepare(); 5
  const recipe = await fetch(url, { headers: zreq.headers });
  zreq.complete('GET', url);
  const producer_data = await recipe.json();

  return {pid: process.pid, producer_data, trace: req.zipkin.trace};
});

server.listen(PORT, HOST, () => {
  console.log(`Consumer running at http://${HOST}:${PORT}/`);
});
1

The zipkin-lite package is required and instantiated.

2

web-api accepts outside requests and can generate trace IDs.

3

Hooks are called when requests start and finish.

4

Each endpoint will need to specify its name.

5

Outbound requests are manually instrumented.

Note

These examples use the zipkin-lite package. This package requires manual instrumentation, which is a fancy way of saying that you, the developer, must call different hooks to interact with the package. I chose it for this project to help demonstrate the different parts of the Zipkin reporting process. For a production app, the official Zipkin package, zipkin, would make for a better choice.

The consumer service represents the first service that an external client will communicate with. Because of this, the init configuration flag has been enabled. This will allow the service to generate a new trace ID. In theory, a reverse proxy can be configured to also generate initial identifier values. The serviceName, servicePort, and serviceIp fields are each used for reporting information about the running service to Zipkin.

The onRequest and onResponse hooks allow the zipkin-lite package to interpose on requests. The onRequest handler runs first. It records the time the request starts and injects a req.zipkin property that can be used throughout the life cycle of the request. Later, the onResponse handler is called. This then calculates the overall time the request took and sends a SERVER message to the Zipkin server.

Within a request handler, two things need to happen. The first is that the name of the endpoint has to be set. This is done by calling req.zipkin.setName(). The second is that for each outbound request that is sent, the appropriate headers need to be applied and the time the request took should be calculated. This is done by first calling req.zipkin.prepare(). When this is called, another time value is recorded and a new span ID is generated. This ID and the other necessary headers are provided in the returned value, which is assigned here to the variable zreq.

These headers are then provided to the request via zreq.headers. Once the request is complete, a call to zreq.complete() is made, passing in the request method and URL. Once this happens, the overall time taken is calculated, and the CLIENT message is then sent to the Zipkin server.

Next up, the producing service should also be modified. This is important because not only should the timing as perceived by the client be reported (web-api in this case), but the timing from the server’s point of view (recipe-api) should be reported as well. Copy the recipe-api/producer-http-basic.js file created in Example 1-6 to recipe-api/producer-http-zipkin.js as a starting point. Modify the file to look like the code in Example 4-10. Most of the file can be left as is, so only the required changes are displayed.

Example 4-10. recipe-api/producer-http-zipkin.js (truncated)
const PORT = process.env.PORT || 4000;
const ZIPKIN = process.env.ZIPKIN || 'localhost:9411';
const Zipkin = require('zipkin-lite');
const zipkin = new Zipkin({
  zipkinHost: ZIPKIN,
  serviceName: 'recipe-api', servicePort: PORT, serviceIp: HOST,
});
server.addHook('onRequest', zipkin.onRequest());
server.addHook('onResponse', zipkin.onResponse());

server.get('/recipes/:id', async (req, reply) => {
  req.zipkin.setName('get_recipe');
  const id = Number(req.params.id);

Example 4-10 doesn’t act as a root service, so the init configuration flag has been omitted. If it receives a request directly, it won’t generate a trace ID, unlike the web-api service. Also, note that the same req.zipkin.prepare() method is available in this new recipe-api service, even though the example isn’t using it. When implementing Zipkin within services you own, you’ll want to pass the Zipkin headers to as many upstream services as you can.

Be sure to run the npm install zipkin-lite@0.1 command in both project directories.

Once you’ve created the two new service files, run them and then generate a request to the web-api by running the following commands:

$ node recipe-api/producer-http-zipkin.js
$ node web-api/consumer-http-zipkin.js
$ curl http://localhost:3000/

A new field, named trace, should now be present in the output of the curl command. This is the trace ID for the series of requests that have been passed between the services. The value should be 16 hexadecimal characters, and in my case, I received the value e232bb26a7941aab.

Visualizing a Request Tree

Data about the requests have been sent to your Zipkin server instance. It’s now time to open the web interface and see how that data is visualized. Open the following URL in your browser:

http://localhost:9411/zipkin/

You should now be greeted with the Zipkin web interface. It’s not too exciting just yet. The left sidebar contains two links. The first one, which looks like a magnifying glass, is to the current Discover screen. The second link, resembling network nodes, links to the Dependencies screen. At the top of the screen is a plus sign, which can be used for specifying which requests to search for. With this tool you can specify criteria like the service name or tags. But for now you can ignore those. In the upper-right corner is a simple search button, one that will display recent requests. Click the magnifying glass icon, which will perform the search.

Figure 4-8 is an example of what the interface should look like after you’ve performed a search. Assuming you ran the curl command just once, you should see only a single entry.

Screenshot of a Zipkin Discover page
Figure 4-8. Zipkin discover interface

Click the entry to be taken to the timeline view page. This page displays content in two columns. The column on the left displays a timeline of requests. The horizontal axis represents time. The units on the top of the timeline display how much time has passed since the very first SERVER trace was made with the given trace ID. The vertical rows represent the depth of the request; as each subsequent service makes another request, a new row will be added.

For your timeline, you should see two rows. The first row was generated by the web-api and has a call named get_root. The second row was generated by the recipe-api and has a call named get_recipe. A more complex version of the timeline you’re seeing, based on the previously mentioned system with an additional user-api and user-store, is displayed in Figure 4-9.

Screenshot of a Zipkin Trace timeline
Figure 4-9. Example Zipkin trace timeline

Click the second row. The right column will be updated to display additional metadata about the request. The Annotations bar displays a timeline for the span you clicked. Depending on the speed of the request, you will see between two and four dots. The furthest left and furthest right dots represent the time that the client perceived the request to take. If the request was slow enough, you should see two inner dots, and those will represent the time the server perceived the request to take. Since these services are so fast, the dots might overlap and will be hidden by the Zipkin interface.

The Tags section displays the tags associated with the request. This can be used to debug which endpoints are taking the longest time to process and which service instances (by using the IP address and port) are to blame.

Visualizing Microservice Dependencies

The Zipkin interface can also be used to show aggregate information about the requests that it receives. Click the Dependencies link in the sidebar to be taken to the dependencies screen. The screen should be mostly blank, with a selector at the top to specify a time range and perform a search. The default values should be fine, so click the magnifying glass icon to perform a search.

The screen will then be updated to display two nodes. Zipkin has searched through the different spans it found that matched the time range. Using this information, it has determined how the services are related to each other. With the two example applications, the interface isn’t all that interesting. On the left, you should see a node representing the web-api (where requests originate), and on the right, you should see a node representing the recipe-api (the deepest service in the stack). Small dots move from the left of the screen to the right, showing the relative amount of traffic between the two nodes.

If you were using Zipkin with many different services within an organization, you would see a much more complex map of the relationships between services. Figure 4-10 is an example of what the relationships between the four services in the more complex example would look like.

Pseudo screenshot of a Zipkin Dependencies view
Figure 4-10. Example Zipkin dependency view

Assuming every service within an organization uses Zipkin, such a diagram would be a very powerful tool for understanding the interconnections between services.

Health Checks

“Load Balancing and Health Checks” looked at how HAProxy can be configured to automatically remove and re-add a running service instance to the pool of candidate instances for routing requests to. HAProxy can do this by making an HTTP request to an endpoint of your choosing and checking the status code. Such an endpoint is also useful for checking the liveness of a service—which is a term meaning a newly deployed service has finished the startup stage and is ready to receive requests (like establishing a database connection). Kubernetes, which is covered in Chapter 7, can also make use of such a liveness check. It is generally useful for an application to know if it’s healthy or not.

An application can usually be considered healthy if it is able to respond to incoming requests with correct data without ill side effects. The specifics of how to measure this will change depending on the application. If an application needs to make a connection to a database, and such a connection is lost, then the application probably won’t be able to process the requests it receives. (Note that your application should attempt to reconnect to databases; this is covered in “Database Connection Resilience”.) In such a case, it would make sense to have the application declare itself unhealthy.

On the other hand, some features are a bit of a grey area. For example, if a service is unable to establish a connection to a caching service but is still able to connect to a database and serve requests, it is probably fine to declare itself healthy. The grey area in this case is with response time. If the service is no longer able to achieve its SLA, then it might be dangerous to run because it could cost your organization money. In this situation, it might make sense to declare the service degraded.

What would happen in this situation if the degraded service were to declare itself unhealthy? The service might be restarted by some sort of deployment management tool. However, if the problem is that the caching service is down, then perhaps every single service would be restarted. This can lead to situations where no service is available to serve requests. This scenario will be covered in “Alerting with Cabot”. For now, consider slow/degraded services healthy.

Health checks are usually run periodically. Sometimes they are triggered by a request from an external service, such as HAProxy making an HTTP request (an operation that defaults to every two seconds). Sometimes they are triggered internally, such as a setInterval() call that checks the application’s health before reporting to an external discovery service like Consul that it is healthy (a check that runs perhaps every 10 seconds). In any case, the overhead of running the health check should not be so high that the process is slowed down or the database is overwhelmed.

Building a Health Check

In this section you will build a health check for a rather boring service. This application will have both a connection to a Postgres database, resembling a persistent data store, as well as a connection to Redis, which will represent a cache.

Before you start writing code, you’ll need to run the two backing services. Run the commands in Example 4-11 to get a copy of Postgres and Redis running. You’ll need to run each command in a new terminal window. Ctrl + C can be used to kill either service.

Example 4-11. Running Postgres and Redis
$ docker run \
  --rm \
  -p 5432:5432 \
  -e POSTGRES_PASSWORD=hunter2 \
  -e POSTGRES_USER=tmp \
  -e POSTGRES_DB=tmp \
  postgres:12.3
$ docker run \
  --rm \
  -p 6379:6379 \
  redis:6.0

Next, create a new file from scratch named basic-http-healthcheck.js. Insert the content from Example 4-12 into your newly created file.

Example 4-12. basic-http-healthcheck.js
#!/usr/bin/env node

// npm install fastify@3.2 ioredis@4.17 pg@8.3
const server = require('fastify')();
const HOST = '0.0.0.0';
const PORT = 3300;
const redis = new (require("ioredis"))({enableOfflineQueue: false}); 1
const pg = new (require('pg').Client)();
pg.connect(); // Note: Postgres will not reconnect on failure

server.get('/health', async (req, reply) => {
  try {
    const res = await pg.query('SELECT $1::text as status', ['ACK']);
    if (res.rows[0].status !== 'ACK') reply.code(500).send('DOWN');
  } catch(e) {
    reply.code(500).send('DOWN'); 2
  }
  // ... other down checks ...
  let status = 'OK';
  try {
    if (await redis.ping() !== 'PONG') status = 'DEGRADED';
  } catch(e) {
    status = 'DEGRADED'; 3
  }
  // ... other degraded checks ...
  reply.code(200).send(status);
});

server.listen(PORT, HOST, () => console.log(`http://${HOST}:${PORT}/`));
1

Redis requests will fail when offline.

2

Completely fail if Postgres cannot be reached.

3

Pass with a degraded state if Redis cannot be reached.

This file makes use of the ioredis package for connecting to and issuing queries for Redis. It also makes use of the pg package for working with Postgres. When ioredis is instantiated it will default to connecting to a locally running service, which is why connection details aren’t necessary. The enableOfflineQueue flag specifies if commands should be queued up when the Node.js process can’t connect to the Redis instance. It defaults to true, meaning requests can be queued up. Since Redis is being used as a caching service—not as a primary data store—the flag should set to false. Otherwise, a queued-up request to access the cache could be slower than connecting to the real data store.

The pg package also defaults to connecting to a Postgres instance running locally, but it will still need some connection information. That will be provided using environment variables.

This health check endpoint is configured to first check for features that are critical to run. If any of those features are lacking, then the endpoint will immediately fail. In this case, only the Postgres check applies, but a real application might have more. After that, the checks that will result in a degraded service are run. Only the Redis check applies in this situation. Both of these checks work by querying the backing store and checking for a sane response.

Note that a degraded service will return a 200 status code. HAProxy could, for example, be configured to still direct requests to this service. If the service is degraded, then an alert could be generated (see “Alerting with Cabot”). Figuring out why the cache isn’t working is something that our application shouldn’t be concerned about. The issue might be that Redis itself has crashed or that there is a network issue.

Now that the service file is ready, run the following command to start the service:

$ PGUSER=tmp PGPASSWORD=hunter2 PGDATABASE=tmp \
  node basic-http-healthcheck.js

The Postgres connection variables have been provided as environment variables and are used by the underlying pg package. Explicitly naming the variables in code is a better approach for production code, and these variables are only used for brevity.

Now that your service is running, it’s time to try using the health checks.

Testing the Health Check

With the process running and connecting to the databases, it should be considered in a healthy state. Issue the following request to check the status of the application:

$ curl -v http://localhost:3300/health

The response should contain the message OK and have an associated 200 status code.

Now we can simulate a degraded situation. Switch focus to the Redis service and press Ctrl + C to kill the process. You should see some error messages printed from the Node.js process. They will start off quickly and then slow down as the ioredis module uses exponential backoff when attempting to reconnect to the Redis server. This means that it retries rapidly and then slows down.

Now that the application is no longer connected to Redis, run the same curl command again. This time, the response body should contain the message DEGRADED, though it will still have a 200 status code.

Switch back to the terminal window you previously ran Redis with. Start the Redis service again, switch back to the terminal where you ran curl, and run the request again. Depending on your timing, you might still receive the DEGRADED message, but you will eventually get the OK message once ioredis is able to reestablish a connection.

Note that killing Postgres in this manner will cause the application to crash. The pg library doesn’t provide the same automatic reconnection feature that ioredis provides. Additional reconnection logic will need to be added to the application to get that working. “Database Connection Resilience” contains an example of this.

Alerting with Cabot

There are certain issues that simply cannot be resolved by automatically killing and restarting a process. Issues related to stateful services, like the downed Redis service mentioned in the previous section, are an example. Elevated 5XX error rates are another common example. In these situations it’s often necessary to alert a developer to find the root cause of an issue and correct it. If such errors can cause a loss of revenue, then it becomes necessary to wake developers up in the middle of the night.

In these situations a cellphone is usually the best medium for waking a developer, often by triggering an actual phone call. Other message formats, such as emails, chat room messages, and text messages, usually aren’t accompanied by an annoying ringing sound and often won’t suffice for alerting the developer.

In this section, you’ll set up an instance of Cabot, which is an open source tool for polling the health of an application and triggering alerts. Cabot supports multiple forms of health checks, such as querying Graphite and comparing reported values to a threshold, as well as pinging a host. Cabot also supports making an HTTP request, which is what is covered in this section.

In this section, you’ll also create a free Twilio trial account. Cabot can use this account to both send SMS messages and make phone calls. You can skip this part if you would prefer not to create a Twilio account. In that case, you’ll just see a dashboard changing colors from a happy green to an angry red.

The examples in this section will have you create a single user in Cabot, and that user will receive all the alerts. In practice, an organization will set up schedules, usually referred to as the on-call rotation. In these situations, the person who will receive an alert will depend on the schedule. For example, the person on call might be Alice on call week one, Bob on week two, Carol on week three, and back to Alice on week four.

Note

Another important feature in a real organization is something called a runbook. A runbook is usually a page in a wiki and is associated with a given alert. The runbook contains information on how to diagnose and fix an issue. That way, when an engineer gets a notification at 2 A.M. about the Database Latency alert, they can read about how to access the database and run a query. You won’t create a runbook for this example, but you must be diligent in doing so for real-world alerts.

Create a Twilio Trial Account

At this point, head over to https://twilio.com and create a trial account. When you create an account, you will get two pieces of data that you will need for configuring Cabot. The first piece of information is called an Account SID. This is a string that starts with AC and contains a bunch of hexadecimal characters. The second piece of information is the Auth Token. This value just looks like normal hexadecimal characters.

Using the interface, you’ll also need to configure a Trial Number. This is a virtual phone number that you can use with this project. The phone number begins with a plus sign followed by a country code and the rest of the number. You’ll need to use this number within your project, including the plus sign and country code. The number you receive might look like +15551234567.

Finally, you’ll need to configure your personal cellphone’s phone number to be a Verified Number/Verified Caller ID in Twilio. This allows you to confirm with Twilio that the phone number you have belongs to you and that you’re not just using Twilio to send spam texts to strangers, a process that is a limitation of the Twilio trial account. After you verify your phone number, you’ll be able to configure Cabot to send an SMS message to it.

Running Cabot via Docker

Cabot is a little more complex than the other services covered in this chapter. It requires several Docker images, not just a single one. For that reason you’ll need to use Docker Compose to launch several containers, instead of launching a single one using Docker. Run the following commands to pull the git repository and check out a commit that is known to be compatible with this example:

$ git clone git@github.com:cabotapp/docker-cabot.git cabot
$ cd cabot
$ git checkout 1f846b96

Next, create a new file located at conf/production.env within this repository. Note that it’s not within the distributed-node directory that you’ve been creating all your other project files in. Add the content from Example 4-13 to this file.

Example 4-13. config/production.env
TIME_ZONE=America/Los_Angeles 1
ADMIN_EMAIL=admin@example.org
CABOT_FROM_EMAIL=cabot@example.org
DJANGO_SECRET_KEY=abcd1234
WWW_HTTP_HOST=localhost:5000
WWW_SCHEME=http

# GRAPHITE_API=http://<YOUR-IP-ADDRESS>:8080/ 2

TWILIO_ACCOUNT_SID=<YOUR_TWILIO_ACCOUNT_SID> 3
TWILIO_AUTH_TOKEN=<YOUR_TWILIO_AUTH_TOKEN>
TWILIO_OUTGOING_NUMBER=<YOUR_TWILIO_NUMBER>
1

Set this value to your TZ Time Zone.

2

For extra credit, configure a Graphite source using your IP address.

3

Omit these lines if you’re not using Twilio. Be sure to prefix the phone number with a plus sign and country code.

Tip

If you’re feeling adventurous, configure the GRAPHITE_API line to use the same Graphite instance that you created in “Metrics with Graphite, StatsD, and Grafana”. Later, when using the Cabot interface, you can choose which metrics to create an alert on. This is useful for taking a metric, like request timing, and alerting once it surpasses a certain threshold, such as 200ms. However, for brevity, this section won’t cover how to set it up, and you can omit the line.

Once you’ve finished configuring Cabot, run the following command to start the Cabot service:

$ docker-compose up

This will cause several Docker containers to start running. In the terminal, you should see progress as each image is downloaded, followed by colored output associated with each container once it’s running. Once things have settled down, you’re ready to move on to the next step.

Creating a Health Check

For this example, use the same basic-http-healthcheck.js file from Example 4-12 that you made in the previous section. Execute that file and run the Postgres service as configured in Example 4-11. Once that is done, Cabot can be configured to make use of the /health endpoint the Node.js service exposes.

With the Node.js service now running, open the Cabot web service using your web browser by visiting http://localhost:5000.

You’ll first be prompted to create an administrative account. Use the default username admin. Next, put in your email address and a password and click Create. Then, you’ll be prompted to log in. Type admin for the username field, enter your password again, then click Log in. You’ll finally be taken to the services screen that will contain no entries.

On the empty services screen, click the large plus symbol to be taken to the New service screen. Then, input the information from Table 4-3 into the create service form.

Table 4-3. Fields for creating a service in Cabot

Name

Dist Node Service

Url

http://<LOCAL_IP>:3300/

Users to notify

admin

Alerts

Twilio SMS

Alerts enabled

checked

Again, you’ll need to replace <LOCAL_IP> with your IP address. Once you’ve entered the information, click the Submit button. This will take you to a screen where you can view the Dist Node Service overview.

On this screen, scroll down to the Http checks section and click the plus sign to be taken to the New check screen. On this screen, input the information from Table 4-4 into the “create check” form.

Table 4-4. Fields for creating an HTTP check in Cabot

Name

Dist Node HTTP Health

Endpoint

http://<LOCAL_IP>:3300/health

Status code

200

Importance

Critical

Active

checked

Service set

Dist Node Service

Once you’ve entered that information, click the Submit button. This will take you back to the Dist Node Service overview screen.

Next, the admin account needs to be configured to receive alerts using Twilio SMS. In the upper-right corner of the screen, click the admin drop-down menu, then click Profile settings. On the left sidebar, click the Twilio Plugin link. This form will ask you for your phone number. Enter your phone number, beginning with a plus symbol and the country code. This number should match the verified number that you previously entered in your Twilio account. Once you’re done, click the Submit button.

Once you’re done setting your phone number, click the Checks link in the top navigation bar. This will take you to the Checks listing page, which should contain the one entry you’ve created. Click the single entry, Dist Node HTTP Health, to be taken to the health check history listing. At this point, you should only see one or two entries since they run once every five minutes. These entries should have a green “succeeded” label next to them. Click the circular arrow icon in the upper right to trigger another health check.

Now switch back to the terminal window where your Node.js service is running. Kill it with Ctrl + C. Then, switch back to Cabot and click the icon to run the test again. This time the test will fail, and you’ll get a new entry in the list with a red background and the word “failed.”

You should also get a text message containing information about the alert. The message I received is shown here:

Sent from your Twilio trial account - Service
Dist Node Service reporting CRITICAL status:
http://localhost:5000/service/1/

If Cabot were properly installed on a real server somewhere with a real hostname, the text message would contain a working link that could then be opened on your phone. However, since Cabot is probably running on your laptop, the URL doesn’t make a lot of sense in this context.

Click the Services link at the top of the screen, then click the Dist Node Service link again. On this screen, you’ll now see a graph displaying the status of the service, as well as a banner stating that the service is critical, like in Figure 4-11. Now click the Acknowledge alert button to pause the alerts for 20 minutes. This is useful for giving you time to work on the issue without being alerted over and over. It’s now time to fix the failing service.

A screenshot of Cabot showing the failing Dist Node Service, with a warning banner and an acknowledge button
Figure 4-11. Cabot service status screenshot

Switch back to the terminal where you ran the Node.js process and start it again. Then, switch back to the browser. Navigate back to the HTTP check you created. Click the icon to trigger the check again. This time the check should succeed, and it will switch back to a green “succeeded” message.

Cabot, as well as other alerting tools, offers the ability to assign different users to different services. This is important since different teams within an organization will own different services. When you created an HTTP alert, it was also possible to provide a regex to be applied against the body. This can be used to differentiate a degraded service from an unhealthy service. Cabot can then be configured to have an unhealthy service alert an engineer but have a degraded service merely be highlighted in the UI.

At this point you’re done with the Cabot Docker containers. Switch to the window where you were running Cabot and press Ctrl + C to kill it. Then run the following command to remove the containers from your system:

$ docker rm cabot_postgres_1 cabot_rabbitmq_1 \
  cabot_worker_1 cabot_beat_1 cabot_web_1

1 Note that process.hrtime() is only useful for getting relative time and can’t be used to get the current time with microsecond accuracy.

2 This example doesn’t persist data to disk and isn’t appropriate for production use.

Get Distributed Systems with Node.js now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.