Chapter 4. Observability
This chapter is dedicated to observing Node.js services that run on remote machines. Locally, tools like the debugger or console.log()
make this a straightforward process. However, once a service is running in a faraway land, youâll need to reach for a different set of tools.
When debugging locally, youâre usually concerned with a single request. You might ask yourself, âWhen I pass this value into a request, why do I get that value in the response?â By logging the inner workings of a function, you gain insight into why a function behaved in an unanticipated way. This chapter looks at technologies useful for debugging individual requests as well. âLogging with ELKâ looks at log generation, which is a way to keep track of information on a per-request basis, much like you might print with console.log()
. Later, âDistributed Request Tracing with Zipkinâ looks at a tool for tracking requests as theyâre passed around, associating related logs generated by different services.
You often need insight into situations that wouldnât normally be considered a hard bug when dealing with production traffic. For example, you might have to ask, âWhy are HTTP requests 100ms slower for users created before April 2020?â Such timing might not be worrying with a single request, but when such metrics are considered in aggregate over many requests, youâre able to spot trends of negative performance. âMetrics with Graphite, StatsD, and Grafanaâ covers this in more detail.
These tools mostly display information passively in a dashboard of some sort, which an engineer can later consult to determine the source of a problem. âAlerting with Cabotâ covers how to send a warning to a developer when an applicationâs performance dips below a certain threshold, thus allowing the engineer to prevent an outage before it happens.
So far these concepts have been reactive, where a developer must look at data captured from an application. Other times itâs necessary to be more proactive. âHealth Checksâ covers how an application can determine if itâs healthy and able to serve requests or if itâs unhealthy and deserves to be terminated.
Environments
Environments are a concept for differentiating running instances of an application, as well as databases, from each other. Theyâre important for various reasons, including choosing which instances to route traffic to, keeping metrics and logs separate (which is particularly important in this chapter), segregating services for security, and gaining confidence that a checkout of application code is going to be stable in one environment before it is deployed to production.
Environments should remain segregated from one another. If you control your own hardware, this could mean running different environments on different physical servers. If youâre deploying your application to the cloud, this more likely means setting up different VPCs (Virtual Private Clouds)âa concept supported by both AWS and GCP.
At an absolute minimum, any application will need at least a single production environment. This is the environment responsible for handling requests made by public users. However, youâre going to want a few more environments than that, especially as your application grows in complexity.
As a convention, Node.js applications generally use the NODE_ENV
environment variable to specify which environment an instance is running in. This value can be set in different ways. For testing, it can be set manually, like with the following example, but for production use, whatever tool you use for deploying will abstract this process away:
$
export
NODE_ENV
=
production$
node server.js
Philosophies for choosing what code to deploy to different environments, which branching and merging strategies to use, and even which VCS (version control system) to choose are outside the scope of this book. But, ultimately, a particular snapshot of the codebase is chosen to be deployed to a particular environment.
Choosing which environments to support is also important, and also outside the scope of this book. Usually companies will have, at a minimum, the following environments:
- Development
-
Used for local development. Perhaps other services know to ignore messages associated with this environment. Doesnât need some of the backing stores required by production; for example, logs might be written to stdout instead of being transmitted to a collector.
- Staging
-
Represents an exact copy of the production environment, such as machine specs and operating system versions. Perhaps an anonymized database snapshot from production is copied to a staging database via a nightly cron job.
- Production
-
Where the real production traffic is processed. There may be more service instances here than in staging; for example, maybe staging runs two application instances (always run more than one) but production runs eight.
The environment string must remain consistent across all applications, both those written using Node.js and those on other platforms. This consistency will prevent many headaches. If one team refers to an environment as staging and the other as preprod, querying logs for related messages then becomes an error-prone process.
The environment value shouldnât necessarily be used for configurationâfor example, having a lookup map where environment name is associated with a hostname for a database. Ideally, any dynamic configuration should be provided via environment variables. Instead, the environment value is mostly used for things related to observability. For example, log messages should have the environment attached in order to help associate any logs with the given environment, which is especially important if a logging service does get shared across environments. âApplication Configurationâ takes a deeper look at configuration.
Logging with ELK
ELK, or more specifically, the ELK stack, is a reference to Elasticsearch, Logstash, and Kibana, three open source tools built by Elastic. When combined, these powerful tools are often the platform of choice for collecting logs on-prem. Individually, each of these tools serves a different purpose:
- Elasticsearch
-
A database with a powerful query syntax, supporting features like natural text searching. It is useful in many more situations than what are covered in this book and is worth considering if you ever need to build a search engine. It exposes an HTTP API and has a default port of
:9200
. - Logstash
-
A service for ingesting and transforming logs from multiple sources. Youâll create an interface so that it can ingest logs via User Datagram Protocol (UDP). It doesnât have a default port, so weâll just use
:7777
. - Kibana
-
A web service for building dashboards that visualize data stored in Elasticsearch. It exposes an HTTP web service over the port
:5601
.
Figure 4-1 diagrams these services and their relationships, as well as how theyâre encapsulated using Docker in the upcoming examples.
Your application is expected to transmit well-formed JSON logs, typically an object thatâs one or two levels deep. These objects contain generic metadata about the message being logged, such as timestamp and host and IP address, as well as information specific to the message itself, such as level/severity, environment, and a human-readable message. There are multiple ways to configure ELK to receive such messages, such as writing logs to a file and using Elasticâs Filebeat tool to collect them. The approach used in this section will configure Logstash to listen for incoming UDP messages.
Running ELK via Docker
In order to get your hands dirty, youâre going to run a single Docker container containing all three services. (Be sure to have Docker installedâsee Appendix B for more information.) These examples wonât enable disk persistence. Within a larger organization, each of these services would perform better when installed on dedicated machines, and of course, persistence is vital.
In order to configure Logstash to listen for UDP messages, a configuration file must first be created. The content for this file is available in Example 4-1 and can be placed in a new directory at misc/elk/udp.conf. Once the file is created, youâll make it available to the Logstash service running inside of the Docker container. This is done by using the -v
volume flag, which allows a local filesystem path to be mounted inside of the containerâs filesystem.
Example 4-1. misc/elk/udp.conf
input { udp { id => "nodejs_udp_logs" port => 7777 codec => json } } output { elasticsearch { hosts => ["localhost:9200"] document_type => "nodelog" manage_template => false index => "nodejs-%{+YYYY.MM.dd}" } }
Note
For brevityâs sake, these examples use UDP for sending messages. This approach doesnât come with the same features as others, such as delivery guarantees or back pressure support, but it does come with reduced overhead for the application. Be sure to research the best tool for your use-case.
Once the file has been created youâre ready to run the container using the commands in Example 4-2. If youâre running Docker on a system-based Linux machine, youâll need to run the sysctl
command before the container will properly run, and you may omit the -e
flag if you want. If youâre running Docker on macOS, you should skip the sysctl
flag, but you will need to go into the Docker Desktop preferences and allocate at least 4 GB of memory in the Resources â Advanced tab.
Example 4-2. Running ELK within Docker
$
sudo sysctl -w vm.max_map_count=
262144
# Linux Only
$
docker run -p 5601:5601 -p 9200:9200\
-p 5044:5044 -p 7777:7777/udp\
-v$PWD
/misc/elk/udp.conf:/etc/logstash/conf.d/99-input-udp.conf\
-eMAX_MAP_COUNT
=
262144
\
-it --name distnode-elk sebp/elk:683
This command downloads files from Dockerhub and configures the service and may take a few minutes to run. Once your console calms down a bit, visit http://localhost:5601 in your browser. If you see a successful message, then the service is now ready to receive messages.
Transmitting Logs from Node.js
For this example, youâre going to again start by modifying an existing application. Copy the web-api/consumer-http-basic.js file created in Example 1-7 to web-api/consumer-http-logs.js as a starting point. Next, modify the file to look like the code in Example 4-3.
Example 4-3. web-api/consumer-http-logs.js
#!/usr/bin/env node
// npm install fastify@3.2 node-fetch@2.6 middie@5.1
const
server
=
require
(
'fastify'
)
(
)
;
const
fetch
=
require
(
'node-fetch'
)
;
const
HOST
=
process
.
env
.
HOST
||
'127.0.0.1'
;
const
PORT
=
process
.
env
.
PORT
||
3000
;
const
TARGET
=
process
.
env
.
TARGET
||
'localhost:4000'
;
const
log
=
require
(
'./logstash.js'
)
;
(
async
(
)
=>
{
await
server
.
register
(
require
(
'middie'
)
)
;
server
.
use
(
(
req
,
res
,
next
)
=>
{
log
(
'info'
,
'request-incoming'
,
{
path
:
req
.
url
,
method
:
req
.
method
,
ip
:
req
.
ip
,
ua
:
req
.
headers
[
'user-agent'
]
||
null
}
)
;
next
(
)
;
}
)
;
server
.
setErrorHandler
(
async
(
error
,
req
)
=>
{
log
(
'error'
,
'request-failure'
,
{
stack
:
error
.
stack
,
path
:
req
.
url
,
method
:
req
.
method
,
}
)
;
return
{
error
:
error
.
message
}
;
}
)
;
server
.
get
(
'/'
,
async
(
)
=>
{
const
url
=
`
http://
${
TARGET
}
/recipes/42
`
;
log
(
'info'
,
'request-outgoing'
,
{
url
,
svc
:
'recipe-api'
}
)
;
const
req
=
await
fetch
(
url
)
;
const
producer_data
=
await
req
.
json
(
)
;
return
{
consumer_pid
:
process
.
pid
,
producer_data
}
;
}
)
;
server
.
get
(
'/error'
,
async
(
)
=>
{
throw
new
Error
(
'oh no'
)
;
}
)
;
server
.
listen
(
PORT
,
HOST
,
(
)
=>
{
log
(
'verbose'
,
'listen'
,
{
host
:
HOST
,
port
:
PORT
}
)
;
}
)
;
}
)
(
)
;
The new logstash.js file is now being loaded.
The
middie
package allows Fastify to use generic middleware.A middleware to log incoming requests.
A call to the logger that passes in request data.
A generic middleware for logging errors.
Information about outbound requests is logged.
Information about server starts is also logged.
This file logs some key pieces of information. The first thing logged is when the server starts. The second set of information is by way of a generic middleware handler. It logs data about any incoming request, including the path, the method, the IP address, and the user agent. This is similar to the access log for a traditional web server. Finally, the application tracks outbound requests to the recipe-api service.
The contents of the logstash.js file might be more interesting. There are many libraries available on npm for transmitting logs to Logstash (@log4js-node/logstashudp
is one such package). These libraries support a few methods for transmission, UDP included. Since the mechanism for sending logs is so simple, youâre going to reproduce a version from scratch. This is great for educational purposes, but a full-featured package from npm will make a better choice for a production application.
Create a new file called web-api/logstash.js. Unlike the other JavaScript files youâve created so far, this one wonât be executed directly. Add the content from Example 4-4 to this file.
Example 4-4. web-api/logstash.js
const
client
=
require
(
'dgram'
)
.
createSocket
(
'udp4'
)
;
const
host
=
require
(
'os'
)
.
hostname
(
)
;
const
[
LS_HOST
,
LS_PORT
]
=
process
.
env
.
LOGSTASH
.
split
(
':'
)
;
const
NODE_ENV
=
process
.
env
.
NODE_ENV
;
module
.
exports
=
function
(
severity
,
type
,
fields
)
{
const
payload
=
JSON
.
stringify
(
{
'@timestamp'
:
(
new
Date
(
)
)
.
toISOString
(
)
,
"@version"
:
1
,
app
:
'web-api'
,
environment
:
NODE_ENV
,
severity
,
type
,
fields
,
host
}
)
;
console
.
log
(
payload
)
;
client
.
send
(
payload
,
LS_PORT
,
LS_HOST
)
;
}
;
This basic Logstash module exports a function that application code calls to send a log. Many of the fields are automatically generated, like @timestamp
, which represents the current time. The app
field is the name of the running application and doesnât need to be overridden by the caller. Other fields, like severity
and type
, are fields that the application is going to change all the time. The fields
field represents additional key/value pairs the app might want to provide.
The severity
field (often called the log level in other logging frameworks) refers to the importance of the log. Most logging packages support the following six values, originally made popular by the npm client: error, warn, info, verbose, debug, silly. Itâs a common pattern with more âcompleteâ logging packages to set a logging threshold via environment variable. For example, by setting the minimum severity to verbose, any messages with a lower severity (namely debug and silly) will get dropped. The overly simple logstash.js module doesnât support this.
Once the payload has been constructed, itâs then converted into a JSON string and printed to the console to help tell whatâs going on. Finally, the process attempts to transmit the message to the Logstash server (there is no way for the application to know if the message was delivered; this is the shortcoming of UDP).
With the two files created, itâs now time to test the application. Run the commands in Example 4-5. This will start an instance of the new web-api service, an instance of the previous recipe-api service, and will also send a series of requests to the web-api. A log will be immediately sent once the web-api has been started, and two additional logs will be sent for each incoming HTTP request. Note that the watch
commands continuously execute the command following on the same line and will need to be run in separate terminal windows.
Example 4-5. Running web-api and generating logs
$ NODE_ENV
=
developmentLOGSTASH
=
localhost:7777\
node web-api/consumer-http-logs.js$
node recipe-api/producer-http-basic.js$
brew install watch# required for macOS
$
watch -n5 curl http://localhost:3000$
watch -n13 curl http://localhost:3000/error
Isnât that exciting? Well, not quite yet. Now youâll jump into Kibana and take a look at the logs being sent. Let the watch
commands continue running in the background; theyâll keep the data fresh while youâre using Kibana.
Creating a Kibana Dashboard
Now that the application is sending data to Logstash and Logstash is storing the data in Elasticsearch, itâs time to open Kibana and explore this data. Open your browser and visit http://localhost:5601. At this point you should be greeted with the Kibana dashboard.
Within the dashboard, click the last tab on the left, titled Management. Next, locate the Kibana section of options and then click the Index Patterns option. Click Create index pattern. For Step 1, type in an Index pattern of nodejs-*
. You should see a small Success! message below as Kibana correlates your query to a result. Click Next step. For Step 2, click the Time Filter drop-down menu and then click the @timestamp
field. Finally, click Create index pattern. Youâve now created an index named
nodejs-*
that will allow you to query those values.
Click the second tab on the left, titled Visualize. Next, click the Create new visualization button in the center of the screen. Youâll be given several different options for creating a visualization, including the ones shown in Figure 4-2, but for now just click the Vertical Bar graph option.
Select the nodejs-*
index that you just created. Once thatâs done, youâll be taken to a new screen to fine-tune the visualization. The default graph isnât too interesting; itâs a single bar showing a count of all logs matching the nodejs-*
index. But not for long.
The goal now is to create a graph that displays the rate at which incoming requests are received by the web-api service. So, first add a few filters to narrow down the results to only contain applicable entries. Click the Add a Filter link near the upper-left corner of the screen. For the Field drop-down menu, enter the value type
. For the Operator field, set it to is
. For the Value field, enter the value request-incoming
and then click Save. Next, click Add a Filter again and do the same thing, but this time set Field to app
, then set Operator to is
again, and set Value to web-api
.
For the Metrics section, leave it displaying the count, since it should display the number of requests and the matching log messages correlate one to one with real requests.
For the Buckets section, it should be changed to group by time. Click the Add buckets link and select X-Axis. For the Aggregation drop-down menu, select Date Histogram. Click on the blue button with a play symbol above the Metrics section (it has a title of Apply changes), and the graph will update. The default setting of grouping by @timestamp
with an automatic interval is fine.
In the upper-right corner is a drop-down menu for changing the time range of the logs being queried. Click the drop-down menu and configure it to display logs from the last hour, and then click the large Refresh button to the right of the drop-down menu. If all goes to plan, your screen should look like Figure 4-3.
Once your graph is complete, click the Save link at the top of the Kibana screen. Name the visualization web-api incoming requests. Next, create a similar visualization but this time, set the type field to request-outgoing
and name that visualization web-api outgoing requests. Finally, create a third visualization with a type field of listen
and name it web-api server starts.
Next, youâll create a dashboard for these three visualizations. Select the third option in the sidebar titled Dashboard. Then, click Create new dashboard. A modal window will appear with your three visualizations in it. Click each visualization, and it will be added to the dashboard. Once youâve added each visualization, dismiss the modal. Click the Save link at the top of the screen and save the dashboard as web-api overview.
Congratulations! Youâve created a dashboard containing information extracted from your application.
Running Ad-Hoc Queries
Sometimes youâll need to run arbitrary queries against the data thatâs being logged without a correlating dashboard. This is helpful in one-off debugging situations. In this section, youâll write arbitrary queries in order to extract errors about the application.
Click the first tab in the left sidebar, the one titled Discover. This is a convenient playground for running queries without needing to commit them to a dashboard. By default, a listing of all recently received messages is displayed. Click inside of the Search field at the top of the screen. Then, type the following query into the search field and press Enter:
app:"web-api" AND (severity:"error" OR severity:"warn")
The syntax of this query is written in the Kibana Query Language (KQL). Essentially, there are three clauses. Itâs asking for logs belonging to the web-api application and whose severity levels are set to either error or warn (in other words, things that are very important).
Click the arrow symbol next to one of the log entries in the list that follows. This will expand the individual log entry and allow you to view the entire JSON payload associated with the log. The ability to view arbitrary log messages like this is what makes logging so powerful. With this tool youâre now able to find all the errors being logged from the service.
By logging more data, youâll gain the ability to drill down into the details of specific error situations. For example, you might find that errors occur when a specific endpoint within an application is being hit under certain circumstances (like a user updating a recipe via PUT /recipe
in a more full-featured application). With access to the stack trace, and enough contextual information about the requests, youâre then able to re-create the conditions locally, reproduce the bug, and come up with a fix.
Warning
This section looks at transmitting logs from within an application, an inherently asynchronous operation. Unfortunately, logs generated when a process crashes might not be sent in time. Many deployment tools can read messages from stdout and transmit them on behalf of the application, which increases the likelihood of them being delivered.
This section looked at storing logs. Certainly, these logs can be used to display numeric information in graphs, but it isnât necessarily the most efficient system for doing so since the logs store complex objects. The next section, âMetrics with Graphite, StatsD, and Grafanaâ, looks at storing more interesting numeric data using a different set of tools.
Metrics with Graphite, StatsD, and Grafana
âLogging with ELKâ looked at transmitting logs from a running Node.js process. Such logs are formatted as JSON and are indexable and searchable on a per-log basis. This is perfect for reading messages related to a particular running process, such as reading variables and stack traces. However, sometimes you donât necessarily care about individual pieces of numeric data, and instead you want to know about aggregations of data, usually as these values grow and shrink over time.
This section looks at sending metrics. A metric is numeric data associated with time. This can include things like request rates, the number of 2XX versus 5XX HTTP responses, latency between the application and a backing service, memory and disk use, and even business stats like dollar revenue or cancelled payments. Visualizing such information is important to understanding application health and system load.
Much like in the logging section, a stack of tools will be used instead of a single one. However, this stack doesnât really have a catchy acronym like ELK, and itâs fairly common to swap out different components. The stack considered here is that of Graphite, StatsD, and Grafana:
- Graphite
-
A combination of a service (Carbon) and time series database (Whisper). It also comes with a UI (Graphite Web), though the more powerful Grafana interface is often used.
- StatsD
-
A daemon (built with Node.js) for collecting metrics. It can listen for stats over TCP or UDP before sending aggregations to a backend such as Graphite.
- Grafana
-
A web service that queries time series backends (like Graphite) and displays information in configurable dashboards.
Figure 4-4 shows a diagram of these services and how theyâre related. The Docker boundaries represent what the upcoming examples will use.
Much like in the logging section, these examples will transmit data using UDP. Due to the nature of metrics being rapidly produced, using UDP will help keep the application from getting overwhelmed.
Running via Docker
Example 4-6 starts two separate Docker containers. The first one, graphiteapp/graphite-statsd contains StatsD and Graphite. Two ports from this container are exposed. The Graphite UI/API is exposed via port :8080
, while the StatsD UDP metrics collector is exposed as :8125
. The second, grafana/grafana, contains Grafana. A single port for the web interface, :8000
, is exposed for this container.
Example 4-6. Running StatsD + Graphite, and Grafana
$
docker run\
-p 8080:80\
-p 8125:8125/udp\
-it --name distnode-graphite graphiteapp/graphite-statsd:1.1.6-1$
docker run\
-p 8000:3000\
-it --name distnode-grafana grafana/grafana:6.5.2
Once the containers are up and running, open a web browser and visit the Grafana dashboard at http://localhost:8000/. Youâll be asked to log in at this point. The default login credentials are admin / admin. Once you successfully log in, youâll then be prompted to change the password to something else. This password will be used to administer Grafana, though it wonât be used in code.
Once the password has been set, youâll be taken to a wizard for configuring Grafana. The next step is to configure Grafana to communicate with the Graphite image. Click the Add Data Source button and then click the Graphite option. On the Graphite configuration screen, input the values displayed in Table 4-1.
Name |
Dist Node Graphite |
URL |
http://<LOCAL_IP>:8080 |
Version |
1.1.x |
Note
Due to the way these Docker containers are being run, you wonât be able to use localhost
for the <LOCAL_IP>
placeholder. Instead, youâll need to use your local IP address. If youâre on Linux, try running hostname -I
, and if youâre on macOS, try running ipconfig getifaddr en0
. If youâre running this on a laptop and your IP address changes, youâll need to reconfigure the data source in Grafana to use the new IP address, or else you wonât get data.
Once youâve entered the data, click Save & Test. If you see the message âData source is working,â then Grafana was able to talk to Graphite and you can click the Back button. If you get HTTP Error Bad Gateway, make sure the Graphite container is running and that the settings have been entered correctly.
Now that Graphite and Grafana are talking to each other, itâs time to modify one of the Node.js services to start sending metrics.
Transmitting Metrics from Node.js
The protocol used by StatsD is extremely simple, arguably even simpler than the one used by Logstash UDP. An example message that increments a metric named foo.bar.baz
looks like this:
foo.bar.baz:1|c
Such interactions could very easily be rebuilt using the dgram
module, like in the previous section. However, this code sample will make use of an existing package. There are a few out there, but this example uses the statsd-client
package.
Again, start by rebuilding a version of the consumer service. Copy the web-api/consumer-http-basic.js file created in Example 1-7 to web-api/consumer-http-metrics.js as a starting point. From there, modify the file to resemble Example 4-7. Be sure to run the npm install
command to get the required package as well.
Example 4-7. web-api/consumer-http-metrics.js (first half)
#!/usr/bin/env node
// npm install fastify@3.2 node-fetch@2.6 statsd-client@0.4.4 middie@5.1
const
server
=
require
(
'fastify'
)
(
)
;
const
fetch
=
require
(
'node-fetch'
)
;
const
HOST
=
'127.0.0.1'
;
const
PORT
=
process
.
env
.
PORT
||
3000
;
const
TARGET
=
process
.
env
.
TARGET
||
'localhost:4000'
;
const
SDC
=
require
(
'statsd-client'
)
;
const
statsd
=
new
(
require
(
'statsd-client'
)
)
(
{
host
:
'localhost'
,
port
:
8125
,
prefix
:
'web-api'
}
)
;
(
async
(
)
=>
{
await
server
.
register
(
require
(
'middie'
)
)
;
server
.
use
(
statsd
.
helpers
.
getExpressMiddleware
(
'inbound'
,
{
timeByUrl
:
true
}
)
)
;
server
.
get
(
'/'
,
async
(
)
=>
{
const
begin
=
new
Date
(
)
;
const
req
=
await
fetch
(
`
http://
${
TARGET
}
/recipes/42
`
)
;
statsd
.
timing
(
'outbound.recipe-api.request-time'
,
begin
)
;
statsd
.
increment
(
'outbound.recipe-api.request-count'
)
;
const
producer_data
=
await
req
.
json
(
)
;
return
{
consumer_pid
:
process
.
pid
,
producer_data
}
;
}
)
;
server
.
get
(
'/error'
,
async
(
)
=>
{
throw
new
Error
(
'oh no'
)
;
}
)
;
server
.
listen
(
PORT
,
HOST
,
(
)
=>
{
console
.
log
(
`
Consumer running at http://
${
HOST
}
:
${
PORT
}
/
`
)
;
}
)
;
}
)
(
)
;
A few things are going on with this new set of changes. First, it requires the statsd-client
package and configures a connection to the StatsD service listening at localhost:8125
. It also configures the package to use a prefix value of web-api
. This value represents the name of the service reporting the metrics (likewise, if you made similar changes to recipe-api, youâd set its prefix accordingly). Graphite works by using a hierarchy for naming metrics, so metrics sent from this service will all have the same prefix to differentiate them from metrics sent by another service.
The code makes use of a generic middleware provided by the statsd-client
package. As the method name implies, it was originally designed for Express, but Fastify mostly supports the same middleware interface, so this application is able to reuse it. The first argument is another prefix name, and inbound
implies that the metrics being sent here are associated with incoming requests.
Next, two values are manually tracked. The first is the amount of time the web-api perceives the recipe-api to have taken. Note that this time should always be longer than the time recipe-api believes the response took. This is due to the overhead of sending a request over the network. This timing value is written to a metric named outbound.recipe-api.request-time
. The application also tracks how many requests are sent. This value is provided as outbound.recipe-api.request-count
. You could even get more granular here. For example, for a production application, the status codes that the recipe-api responds with could also be tracked, which would allow an increased rate of failures to be visible.
Next, run the following commands each in a separate terminal window. This will start your newly created service, run a copy of the producer, run Autocannon to get a stream of good requests, and also trigger some bad requests:
$ NODE_DEBUG
=
statsd-client node web-api/consumer-http-metrics.js$
node recipe-api/producer-http-basic.js$
autocannon -d300
-R5
-c1
http://localhost:3000$
watch -n1 curl http://localhost:3000/error
Those commands will generate a stream of data, which gets passed to StatsD before being sent to Graphite. Now that you have some data, youâre ready to create a dashboard to view it.
Creating a Grafana Dashboard
As the owner of the web-api service, there are (at least) three different sets of metrics that should be extracted so that you can measure its health. This includes the incoming requests and, importantly, differentiating 200 from 500. It also includes the amount of time that recipe-api, an upstream service, takes to reply. The final set of required information is the rate of requests to the recipe-api service. If you determine the web-api service is slow, you might use this information to discover that the recipe-api service is slowing it down.
Switch back to your web browser with the Grafana interface. There is a large plus symbol in the sidebar; click it to be taken to the New dashboard screen. On this screen youâll see a New Panel rectangle. Inside of it is an Add Query button. Click that button to be taken to the query editor screen.
On this new screen, youâll see an empty graph at the top and inputs to describe the graph below. The UI lets you describe the query using two fields. The first is called Series and is where you can input the hierarchical metric name. The second field is called Functions. Both of these fields provide autocomplete for matching metric names. First, start with the Series field.
Click the âselect metricâ text next to the Series label and then click stats_count
from the drop-down menu.
Then click âselect metricâ again and select web-api
. Continue this for the values inbound
, response_code
, and finally *
(the *
is a wildcard and will match any value). At this point, the graph has been updated and should show two sets of entries.
The graph labels arenât too friendly just yet. Theyâre displaying the entire hierarchy name instead of just the easy-to-read values 200 and 500. A Function can be used to fix this. Click the plus sign next to the Functions label, then click Alias, and then click aliasByNode(). This will insert the function and also automatically provide a default argument of 4. This is because the asterisk in the query is the 4th entry in the (zero-based) hierarchy metric name. The graph labels have been updated to display just 200 and 500.
In the upper-right corner of the panel with the Series and Functions fields, thereâs a pencil icon with a tooltip titled Toggle text edit mode. Click that, and the graphical entry will change into a text version. This is helpful for quickly writing a query. The value you should have looks like the following:
aliasByNode(stats_counts.web-api.inbound.response_code.*, 4)
In the left column, click the gear icon labeled General. On this screen youâre able to modify generic settings about this particular graph. Click the Title field, and input a value of Incoming Status Codes. Once thatâs done, click the large arrow in the upper-left corner of the screen. This will take you from the panel editor screen and back to the dashboard edit screen. At this point, your dashboard will have a single panel.
Next, click the Add panel button in the upper-right corner of the screen and then click the Add query button again. This will allow you to add a second panel to the dashboard. This next panel will track the time it takes to query the recipe-api. Create the appropriate Series and Functions entries to reproduce the following:
aliasByNode(stats.timers.web-api.outbound.*.request-time.upper_90, 4)
Note
StatsD is generating some of these metric names for you. For example, stats.timers
is a StatsD prefix, web-api.outbound.recipe-api.request-time
is provided by the application, and the timing-related metric names under that (such as upper_90
) are again calculated by StatsD. In this case, the query is looking at TP90 timing values.
Since this graph measures time and is not a generic counter, the units should be modified as well (this information is measured in milliseconds). Click the second tab on the left, with a tooltip of Visualization. Then, scroll down the section labeled Axes, find the group titled Left Y, and then click the Unit drop-down menu. Click Time, then click milliseconds (ms). The graph will then be updated with proper units.
Click the third General tab again and set the panelâs title to Outbound Service Timing. Click the back arrow again to return to the dashboard edit screen.
Finally, click the Add panel button again and go through creating a final panel. This panel will be titled Outbound Request Count, wonât need any special units, and will use the following query:
aliasByNode(stats_counts.web-api.outbound.*.request-count, 3)
Click the back button a final time to return to the dashboard editor screen. In the upper-right corner of the screen, click the Save dashboard icon, give the dashboard a name of Web API Overview, and save the dashboard. The dashboard is now saved and will have a URL associated with it. If you were using an instance of Grafana permanently installed for your organization, this URL would be a permalink that you could provide to others and would make a great addition to your projectâs README.
Feel free to drag the panels around and resize them until you get something that is aesthetically pleasing. In the upper right corner of the screen, you can also change the time range. Set it to âLast 15 minutes,â since you likely donât have data much older than that. Once youâre done, your dashboard should look something like Figure 4-5.
Node.js Health Indicators
There is some generic health information about a running Node.js process that is also worth collecting for the dashboard. Modify your web-api/consumer-http-metrics.js file by adding the code from Example 4-8 to the end of the file. Restart the service and keep an eye on the data that is being generated. These new metrics represent values that can increase or decrease over time and are better represented as Gauges.
Example 4-8. web-api/consumer-http-metrics.js (second half)
const
v8
=
require
(
'v8'
)
;
const
fs
=
require
(
'fs'
)
;
setInterval
(
(
)
=>
{
statsd
.
gauge
(
'server.conn'
,
server
.
server
.
_connections
)
;
const
m
=
process
.
memoryUsage
(
)
;
statsd
.
gauge
(
'server.memory.used'
,
m
.
heapUsed
)
;
statsd
.
gauge
(
'server.memory.total'
,
m
.
heapTotal
)
;
const
h
=
v8
.
getHeapStatistics
(
)
;
statsd
.
gauge
(
'server.heap.size'
,
h
.
used_heap_size
)
;
statsd
.
gauge
(
'server.heap.limit'
,
h
.
heap_size_limit
)
;
fs
.
readdir
(
'/proc/self/fd'
,
(
err
,
list
)
=>
{
if
(
err
)
return
;
statsd
.
gauge
(
'server.descriptors'
,
list
.
length
)
;
}
)
;
const
begin
=
new
Date
(
)
;
setTimeout
(
(
)
=>
{
statsd
.
timing
(
'eventlag'
,
begin
)
;
}
,
0
)
;
}
,
10
_000
)
;
This code will poll the Node.js underbelly every 10 seconds for key information about the process. As an exercise of your newfound Grafana skills, create five new dashboards containing this newly captured data. In the metric namespace hierarchy, the guage metrics begin with stats.gauges
, while the timer starts with stats.timers
.
The first set of data, provided as server.conn
, is the number of active connections to the web server. Most Node.js web frameworks expose this value in some manner; check out the documentation for your framework of choice.
Information about the process memory usage is also captured. This is being recorded as two values, server.memory.used
and server.memory.total
. When creating a graph for these values, their unit should be set to Data/Bytes, and Grafana is smart enough to display more specific units like MB. A very similar panel could then be made based on the V8 heap size and limit.
The event loop lag metric displays how long it takes the application to call a function that was scheduled to run as early as zero milliseconds from the time setTimeout()
was called. This graph should display the value in milliseconds. A healthy event loop should have a number between zero and two. Overwhelmed services might start taking tens of milliseconds.
Finally, the number of open file descriptors can indicate a leak in a Node.js application. Sometimes files will be opened but will never be closed, and this can lead to consumption of server resources and result in a process crash.
Once youâve added the new panels, your dashboard may then resemble Figure 4-6. Save the modified dashboard so that you donât lose your changes.
This section only covers the basics of what can be done with the StatsD, Graphite, and Grafana stack. There are many query functions that havenât been covered, including other forms of visualizations, how to manually color individual time series entries (like green for 2XX, yellow for 4XX, and red for 5XX), and so on.
Distributed Request Tracing with Zipkin
âLogging with ELKâ looked at storing logs from a Node.js process. Such logs contain information about the internal operations of a process. Likewise, âMetrics with Graphite, StatsD, and Grafanaâ looked at storing numeric metrics. These metrics are useful for looking at numeric data in aggregate about an application, such as throughput and failure rates for an endpoint. However, neither of these tools allow for associating a specific external request with all the internal requests it may then generate.
Consider, for example, a slightly more complex version of the services covered so far. Instead of just a web-api and a recipe-api service, thereâs an additional user-api and a user-store service. The web-api will still call the recipe-api service as before, but now the web-api will also call the user-api service, which will in turn call the user-store service. In this scenario, if any one of the services produces a 500 error, that error will bubble up and the overall request will fail with a 500. How would you find the cause of a specific error with the tools used so far?
Well, if you know that an error occurred on Tuesday at 1:37 P.M., you might be tempted to look through logs stored in ELK between the time of 1:36 P.M. and 1:38 P.M. Goodness knows Iâve done this myself. Unfortunately, if there is a high volume of logs, this could mean sifting through thousands of individual log entries. Worse, other errors happening at the same time can âmuddy the water,â making it hard to know which logs are actually associated with the erroneous request.
At a very basic level, requests made deeper within an organization can be associated with a single incoming external request by passing around a request ID. This is a unique identifier that is generated when the first request is received, which is then somehow passed between upstream services. Then, any logs associated with this request will contain some sort of request_id
field, which can then be filtered using Kibana. This approach solves the associated request conundrum but loses information about the hierarchy of related requests.
Zipkin, sometimes referred to as OpenZipkin, is a tool that was created to alleviate situations just like this one. Zipkin is a service that runs and exposes an HTTP API. This API accepts JSON payloads describing request metadata, as they are both sent by clients and received by servers. Zipkin also defines a set of headers that are passed from client to server. These headers allow processes to associate outgoing requests from a client with incoming requests to a server. Timing information is also sent, which then allows Zipkin to display a graphical timeline of a request hierarchy.
How Does Zipkin Work?
In the aforementioned scenario with the four services, the relationship between services transpires over four requests. When this happens, seven messages will be sent to the Zipkin service. Figure 4-7 contains a visualization of the service relationships, the passed messages, and the additional headers.
One concept that has been repeated a few times so far in this book is that a client will perceive one latency of a request, while a server will perceive another latency. A client will always determine that a request takes longer than the server. This is due to the time it takes a message to be sent over the network, plus other things that are hard to measure, such as a web server package automatically parsing a JSON request before user code can start measuring time.
Zipkin allows you to measure the difference in opinion between client and server. This is why the four requests in the example situation, marked as solid arrows in Figure 4-7, result in seven different messages being sent to Zipkin. The first message, terminating with S1, only contains a server message. In this case, the third-party client isnât reporting its perceived time, so thereâs just the server message. For the three requests terminating in S2, S3, and S4, there is a correlating client message, namely C2, C3, and C4.
The different client and server messages can be sent from the different instances, asynchronously, and can be received in any order. The Zipkin service will then stitch them each together and visualize the request hierarchy using the Zipkin web UI. The C2 message will look something like this:
[{
"id"
:
"0000000000000111"
,
"traceId"
:
"0000000000000100"
,
"parentId"
:
"0000000000000110"
,
"timestamp"
:
1579221096510000
,
"name"
:
"get_recipe"
,
"duration"
:
80000
,
"kind"
:
"CLIENT"
,
"localEndpoint"
:
{
"serviceName"
:
"web-api"
,
"ipv4"
:
"127.0.0.1"
,
"port"
:
100
},
"remoteEndpoint"
:
{
"ipv4"
:
"127.0.0.2"
,
"port"
:
200
},
"tags"
:
{
"http.method"
:
"GET"
,
"http.path"
:
"/recipe/42"
,
"diagram"
:
"C2"
}
}]
These messages can be queued up by an application and occasionally flushed in batches to the Zipkin service, which is why the root JSON entry is an array. In Example 4-9, only a single message is being transmitted.
The client message and server message pairs will end up containing the same id
, traceId
, and parentId
identifiers. The timestamp
field represents the time when the client or server first perceived the request to start, and the duration
is how long the service thought the request lasted. Both of these fields are measured in microseconds. The Node.js wall clock, attainable via Date.now()
, only has millisecond accuracy, so itâs common to multiply that value by 1,000.1 The kind
field is set to either CLIENT
or SERVER
, depending on which side of the request is being logged. The name
field represents a name for the endpoint and should have a finite set of values (in other words, donât use an identifier).
The localEndpoint
field represents the service sending the message (the server with a SERVER
message or the client with a CLIENT
message). The service provides its own name in here, the port itâs listening on, and its own IP address. The remoteEndpoint
field contains information about the other service (a SERVER
message probably wonât know the clientâs port
, and likely wonât even know the clientâs name
).
The tags
field contains metadata about the request. In this example, information about the HTTP request is provided as http.method
and http.path
. With other
protocols, different metadata would be attached, such as a gRPC service and
method name.
The identifiers sent in the seven different messages have been re-created in Table 4-2.
Message | id |
parentId |
traceId |
kind |
---|---|---|---|---|
S1 |
110 |
N/A |
100 |
SERVER |
C2 |
111 |
110 |
100 |
CLIENT |
S2 |
111 |
110 |
100 |
SERVER |
C3 |
121 |
110 |
100 |
CLIENT |
S3 |
121 |
110 |
100 |
SERVER |
C4 |
122 |
121 |
100 |
CLIENT |
S4 |
122 |
121 |
100 |
SERVER |
Apart from the messages sent to the server, the other important part of Zipkin is the metadata that is sent from client to server. Different protocols have different standards for sending this metadata. With HTTP, the metadata is sent via headers. These headers are provided by C2, C3, and C4 and are received by S2, S3, and S4. Each of these headers has a different meaning:
X-B3-TraceId
-
Zipkin refers to all related requests as a trace. This value is Zipkinâs concept of a request ID. This value is passed between all related requests, unchanged.
X-B3-SpanId
-
A span represents a single request, as seen from both a client and a server (like C3/S3). Both the client and server will send a message using the same span ID. There can be multiple spans in a trace, forming a tree structure.
X-B3-ParentSpanId
-
A parent span is used for associating a child span with a parent span. This value is missing for the originating external request but is present for deeper requests.
X-B3-Sampled
-
This is a mechanism used for determining if a particular trace should be reported to Zipkin. For example, an organization may choose to track only 1% of requests.
X-B3-Flags
-
This can be used to tell downstream services that this is a debug request. Services are encouraged to then increase their logging verbosity.
Essentially, each service creates a new span ID for each outgoing request. The current span ID is then provided as the parent ID in the outbound request. This is how the hierarchy of relationships is formed.
Now that you understand the intricacies of Zipkin, itâs time to run a local copy of the Zipkin service and modify the applications to interact with it.
Running Zipkin via Docker
Again, Docker provides a convenient platform for running the service. Unlike the other tools covered in this chapter, Zipkin provides an API and a UI using the same port. Zipkin uses a default port of 9411
for this.
Run this command to download and start the Zipkin service:2
$
docker run -p 9411:9411\
-it --name distnode-zipkin\
openzipkin/zipkin-slim:2.19
Transmitting Traces from Node.js
For this example, youâre going to again start by modifying an existing application. Copy the web-api/consumer-http-basic.js file created in Example 1-7 to web-api/consumer-http-zipkin.js as a starting point. Modify the file to look like the code in Example 4-9.
Example 4-9. web-api/consumer-http-zipkin.js
#!/usr/bin/env node
// npm install fastify@3.2 node-fetch@2.6 zipkin-lite@0.1
const
server
=
require
(
'fastify'
)
(
)
;
const
fetch
=
require
(
'node-fetch'
)
;
const
HOST
=
process
.
env
.
HOST
||
'127.0.0.1'
;
const
PORT
=
process
.
env
.
PORT
||
3000
;
const
TARGET
=
process
.
env
.
TARGET
||
'localhost:4000'
;
const
ZIPKIN
=
process
.
env
.
ZIPKIN
||
'localhost:9411'
;
const
Zipkin
=
require
(
'zipkin-lite'
)
;
const
zipkin
=
new
Zipkin
(
{
zipkinHost
:
ZIPKIN
,
serviceName
:
'web-api'
,
servicePort
:
PORT
,
serviceIp
:
HOST
,
init
:
'short'
}
)
;
server
.
addHook
(
'onRequest'
,
zipkin
.
onRequest
(
)
)
;
server
.
addHook
(
'onResponse'
,
zipkin
.
onResponse
(
)
)
;
server
.
get
(
'/'
,
async
(
req
)
=>
{
req
.
zipkin
.
setName
(
'get_root'
)
;
const
url
=
`
http://
${
TARGET
}
/recipes/42
`
;
const
zreq
=
req
.
zipkin
.
prepare
(
)
;
const
recipe
=
await
fetch
(
url
,
{
headers
:
zreq
.
headers
}
)
;
zreq
.
complete
(
'GET'
,
url
)
;
const
producer_data
=
await
recipe
.
json
(
)
;
return
{
pid
:
process
.
pid
,
producer_data
,
trace
:
req
.
zipkin
.
trace
}
;
}
)
;
server
.
listen
(
PORT
,
HOST
,
(
)
=>
{
console
.
log
(
`
Consumer running at http://
${
HOST
}
:
${
PORT
}
/
`
)
;
}
)
;
Note
These examples use the zipkin-lite
package. This package requires manual instrumentation, which is a fancy way of saying that you, the developer, must call different hooks to interact with the package. I chose it for this project to help demonstrate the different parts of the Zipkin reporting process. For a production app, the official Zipkin package, zipkin
, would make for a better choice.
The consumer service represents the first service that an external client will communicate with. Because of this, the init
configuration flag has been enabled. This will allow the service to generate a new trace ID. In theory, a reverse proxy can be configured to also generate initial identifier values. The serviceName
, servicePort
, and serviceIp
fields are each used for reporting information about the running service to Zipkin.
The onRequest
and onResponse
hooks allow the zipkin-lite
package to interpose on requests. The onRequest
handler runs first. It records the time the request starts and injects a req.zipkin
property that can be used throughout the life cycle of the request. Later, the onResponse
handler is called. This then calculates the overall time the request took and sends a SERVER
message to the Zipkin server.
Within a request handler, two things need to happen. The first is that the name of the endpoint has to be set. This is done by calling req.zipkin.setName()
. The second is that for each outbound request that is sent, the appropriate headers need to be applied and the time the request took should be calculated. This is done by first calling req.zipkin.prepare()
. When this is called, another time value is recorded and a new span ID is generated. This ID and the other necessary headers are provided in the returned value, which is assigned here to the variable zreq
.
These headers are then provided to the request via zreq.headers
. Once the request is complete, a call to zreq.complete()
is made, passing in the request method and URL. Once this happens, the overall time taken is calculated, and the CLIENT
message is then sent to the Zipkin server.
Next up, the producing service should also be modified. This is important because not only should the timing as perceived by the client be reported (web-api in this case), but the timing from the serverâs point of view (recipe-api) should be reported as well. Copy the recipe-api/producer-http-basic.js file created in Example 1-6 to recipe-api/producer-http-zipkin.js as a starting point. Modify the file to look like the code in Example 4-10. Most of the file can be left as is, so only the required changes are displayed.
Example 4-10. recipe-api/producer-http-zipkin.js (truncated)
const
PORT
=
process
.
env
.
PORT
||
4000
;
const
ZIPKIN
=
process
.
env
.
ZIPKIN
||
'localhost:9411'
;
const
Zipkin
=
require
(
'zipkin-lite'
);
const
zipkin
=
new
Zipkin
({
zipkinHost
:
ZIPKIN
,
serviceName
:
'recipe-api'
,
servicePort
:
PORT
,
serviceIp
:
HOST
,
});
server
.
addHook
(
'onRequest'
,
zipkin
.
onRequest
());
server
.
addHook
(
'onResponse'
,
zipkin
.
onResponse
());
server
.
get
(
'/recipes/:id'
,
async
(
req
,
reply
)
=>
{
req
.
zipkin
.
setName
(
'get_recipe'
);
const
id
=
Number
(
req
.
params
.
id
);
Example 4-10 doesnât act as a root service, so the init
configuration flag has been omitted. If it receives a request directly, it wonât generate a trace ID, unlike the web-api service. Also, note that the same req.zipkin.prepare()
method is available in this new recipe-api service, even though the example isnât using it. When implementing Zipkin within services you own, youâll want to pass the Zipkin headers to as many upstream services as you can.
Be sure to run the npm install zipkin-lite@0.1
command in both project
directories.
Once youâve created the two new service files, run them and then generate a request to the web-api by running the following commands:
$
node recipe-api/producer-http-zipkin.js$
node web-api/consumer-http-zipkin.js$
curl http://localhost:3000/
A new field, named trace
, should now be present in the output of the curl
command. This is the trace ID for the series of requests that have been passed between the services. The value should be 16 hexadecimal characters, and in my case, I received the value e232bb26a7941aab
.
Visualizing a Request Tree
Data about the requests have been sent to your Zipkin server instance. Itâs now time to open the web interface and see how that data is visualized. Open the following URL in your browser:
http://localhost:9411/zipkin/
You should now be greeted with the Zipkin web interface. Itâs not too exciting just yet. The left sidebar contains two links. The first one, which looks like a magnifying glass, is to the current Discover screen. The second link, resembling network nodes, links to the Dependencies screen. At the top of the screen is a plus sign, which can be used for specifying which requests to search for. With this tool you can specify criteria like the service name or tags. But for now you can ignore those. In the upper-right corner is a simple search button, one that will display recent requests. Click the magnifying glass icon, which will perform the search.
Figure 4-8 is an example of what the interface should look like after youâve performed a search. Assuming you ran the curl
command just once, you should see only a
single entry.
Click the entry to be taken to the timeline view page. This page displays content in two columns. The column on the left displays a timeline of requests. The horizontal axis represents time. The units on the top of the timeline display how much time has passed since the very first SERVER
trace was made with the given trace ID. The vertical rows represent the depth of the request; as each subsequent service makes another request, a new row will be added.
For your timeline, you should see two rows. The first row was generated by the web-api and has a call named get_root. The second row was generated by the recipe-api and has a call named get_recipe. A more complex version of the timeline youâre seeing, based on the previously mentioned system with an additional user-api and user-store, is displayed in Figure 4-9.
Click the second row. The right column will be updated to display additional metadata about the request. The Annotations bar displays a timeline for the span you clicked. Depending on the speed of the request, you will see between two and four dots. The furthest left and furthest right dots represent the time that the client perceived the request to take. If the request was slow enough, you should see two inner dots, and those will represent the time the server perceived the request to take. Since these services are so fast, the dots might overlap and will be hidden by the Zipkin interface.
The Tags section displays the tags associated with the request. This can be used to debug which endpoints are taking the longest time to process and which service instances (by using the IP address and port) are to blame.
Visualizing Microservice Dependencies
The Zipkin interface can also be used to show aggregate information about the requests that it receives. Click the Dependencies link in the sidebar to be taken to the dependencies screen. The screen should be mostly blank, with a selector at the top to specify a time range and perform a search. The default values should be fine, so click the magnifying glass icon to perform a search.
The screen will then be updated to display two nodes. Zipkin has searched through the different spans it found that matched the time range. Using this information, it has determined how the services are related to each other. With the two example applications, the interface isnât all that interesting. On the left, you should see a node representing the web-api (where requests originate), and on the right, you should see a node representing the recipe-api (the deepest service in the stack). Small dots move from the left of the screen to the right, showing the relative amount of traffic between the two nodes.
If you were using Zipkin with many different services within an organization, you would see a much more complex map of the relationships between services. Figure 4-10 is an example of what the relationships between the four services in the more complex example would look like.
Assuming every service within an organization uses Zipkin, such a diagram would be a very powerful tool for understanding the interconnections between services.
Health Checks
âLoad Balancing and Health Checksâ looked at how HAProxy can be configured to automatically remove and re-add a running service instance to the pool of candidate instances for routing requests to. HAProxy can do this by making an HTTP request to an endpoint of your choosing and checking the status code. Such an endpoint is also useful for checking the liveness of a serviceâwhich is a term meaning a newly deployed service has finished the startup stage and is ready to receive requests (like establishing a database connection). Kubernetes, which is covered in Chapter 7, can also make use of such a liveness check. It is generally useful for an application to know if itâs healthy or not.
An application can usually be considered healthy if it is able to respond to incoming requests with correct data without ill side effects. The specifics of how to measure this will change depending on the application. If an application needs to make a connection to a database, and such a connection is lost, then the application probably wonât be able to process the requests it receives. (Note that your application should attempt to reconnect to databases; this is covered in âDatabase Connection Resilienceâ.) In such a case, it would make sense to have the application declare itself unhealthy.
On the other hand, some features are a bit of a grey area. For example, if a service is unable to establish a connection to a caching service but is still able to connect to a database and serve requests, it is probably fine to declare itself healthy. The grey area in this case is with response time. If the service is no longer able to achieve its SLA, then it might be dangerous to run because it could cost your organization money. In this situation, it might make sense to declare the service degraded.
What would happen in this situation if the degraded service were to declare itself unhealthy? The service might be restarted by some sort of deployment management tool. However, if the problem is that the caching service is down, then perhaps every single service would be restarted. This can lead to situations where no service is available to serve requests. This scenario will be covered in âAlerting with Cabotâ. For now, consider slow/degraded services healthy.
Health checks are usually run periodically. Sometimes they are triggered by a request from an external service, such as HAProxy making an HTTP request (an operation that defaults to every two seconds). Sometimes they are triggered internally, such as a setInterval()
call that checks the applicationâs health before reporting to an external discovery service like Consul that it is healthy (a check that runs perhaps every 10 seconds). In any case, the overhead of running the health check should not be so high that the process is slowed down or the database is overwhelmed.
Building a Health Check
In this section you will build a health check for a rather boring service. This application will have both a connection to a Postgres database, resembling a persistent data store, as well as a connection to Redis, which will represent a cache.
Before you start writing code, youâll need to run the two backing services. Run the commands in Example 4-11 to get a copy of Postgres and Redis running. Youâll need to run each command in a new terminal window. Ctrl + C can be used to kill either service.
Example 4-11. Running Postgres and Redis
$
docker run\
--rm\
-p 5432:5432\
-ePOSTGRES_PASSWORD
=
hunter2\
-ePOSTGRES_USER
=
tmp\
-ePOSTGRES_DB
=
tmp\
postgres:12.3$
docker run\
--rm\
-p 6379:6379\
redis:6.0
Next, create a new file from scratch named basic-http-healthcheck.js. Insert the content from Example 4-12 into your newly created file.
Example 4-12. basic-http-healthcheck.js
#!/usr/bin/env node
// npm install fastify@3.2 ioredis@4.17 pg@8.3
const
server
=
require
(
'fastify'
)
(
)
;
const
HOST
=
'0.0.0.0'
;
const
PORT
=
3300
;
const
redis
=
new
(
require
(
"ioredis"
)
)
(
{
enableOfflineQueue
:
false
}
)
;
const
pg
=
new
(
require
(
'pg'
)
.
Client
)
(
)
;
pg
.
connect
(
)
;
// Note: Postgres will not reconnect on failure
server
.
get
(
'/health'
,
async
(
req
,
reply
)
=>
{
try
{
const
res
=
await
pg
.
query
(
'SELECT $1::text as status'
,
[
'ACK'
]
)
;
if
(
res
.
rows
[
0
]
.
status
!==
'ACK'
)
reply
.
code
(
500
)
.
send
(
'DOWN'
)
;
}
catch
(
e
)
{
reply
.
code
(
500
)
.
send
(
'DOWN'
)
;
}
// ... other down checks ...
let
status
=
'OK'
;
try
{
if
(
await
redis
.
ping
(
)
!==
'PONG'
)
status
=
'DEGRADED'
;
}
catch
(
e
)
{
status
=
'DEGRADED'
;
}
// ... other degraded checks ...
reply
.
code
(
200
)
.
send
(
status
)
;
}
)
;
server
.
listen
(
PORT
,
HOST
,
(
)
=>
console
.
log
(
`
http://
${
HOST
}
:
${
PORT
}
/
`
)
)
;
This file makes use of the ioredis
package for connecting to and issuing queries for Redis. It also makes use of the pg
package for working with Postgres. When ioredis
is instantiated it will default to connecting to a locally running service, which is why connection details arenât necessary. The enableOfflineQueue
flag specifies if commands should be queued up when the Node.js process canât connect to the Redis instance. It defaults to true
, meaning requests can be queued up. Since Redis is being used as a caching serviceânot as a primary data storeâthe flag should set to false
. Otherwise, a queued-up request to access the cache could be slower than connecting to the real data store.
The pg
package also defaults to connecting to a Postgres instance running locally, but it will still need some connection information. That will be provided using environment variables.
This health check endpoint is configured to first check for features that are critical to run. If any of those features are lacking, then the endpoint will immediately fail. In this case, only the Postgres check applies, but a real application might have more. After that, the checks that will result in a degraded service are run. Only the Redis check applies in this situation. Both of these checks work by querying the backing store and checking for a sane response.
Note that a degraded service will return a 200 status code. HAProxy could, for example, be configured to still direct requests to this service. If the service is degraded, then an alert could be generated (see âAlerting with Cabotâ). Figuring out why the cache isnât working is something that our application shouldnât be concerned about. The issue might be that Redis itself has crashed or that there is a network issue.
Now that the service file is ready, run the following command to start the service:
$ PGUSER
=
tmpPGPASSWORD
=
hunter2PGDATABASE
=
tmp\
node basic-http-healthcheck.js
The Postgres connection variables have been provided as environment variables and are used by the underlying pg
package. Explicitly naming the variables in code is a better approach for production code, and these variables are only used for brevity.
Now that your service is running, itâs time to try using the health checks.
Testing the Health Check
With the process running and connecting to the databases, it should be considered in a healthy state. Issue the following request to check the status of the application:
$ curl -v http://localhost:3300/health
The response should contain the message OK
and have an associated 200 status code.
Now we can simulate a degraded situation. Switch focus to the Redis service and press Ctrl + C to kill the process. You should see some error messages printed from the Node.js process. They will start off quickly and then slow down as the ioredis
module uses exponential backoff when attempting to reconnect to the Redis server. This means that it retries rapidly and then slows down.
Now that the application is no longer connected to Redis, run the same curl
command again. This time, the response body should contain the message DEGRADED
, though it will still have a 200 status code.
Switch back to the terminal window you previously ran Redis with. Start the Redis service again, switch back to the terminal where you ran curl
, and run the request again. Depending on your timing, you might still receive the DEGRADED
message,
but you will eventually get the OK
message once ioredis
is able to reestablish a
connection.
Note that killing Postgres in this manner will cause the application to crash. The pg
library doesnât provide the same automatic reconnection feature that ioredis
provides. Additional reconnection logic will need to be added to the application to get that working. âDatabase Connection Resilienceâ contains an example
of this.
Alerting with Cabot
There are certain issues that simply cannot be resolved by automatically killing and restarting a process. Issues related to stateful services, like the downed Redis service mentioned in the previous section, are an example. Elevated 5XX error rates are another common example. In these situations itâs often necessary to alert a developer to find the root cause of an issue and correct it. If such errors can cause a loss of revenue, then it becomes necessary to wake developers up in the middle of the night.
In these situations a cellphone is usually the best medium for waking a developer, often by triggering an actual phone call. Other message formats, such as emails, chat room messages, and text messages, usually arenât accompanied by an annoying ringing sound and often wonât suffice for alerting the developer.
In this section, youâll set up an instance of Cabot, which is an open source tool for polling the health of an application and triggering alerts. Cabot supports multiple forms of health checks, such as querying Graphite and comparing reported values to a threshold, as well as pinging a host. Cabot also supports making an HTTP request, which is what is covered in this section.
In this section, youâll also create a free Twilio trial account. Cabot can use this account to both send SMS messages and make phone calls. You can skip this part if you would prefer not to create a Twilio account. In that case, youâll just see a dashboard changing colors from a happy green to an angry red.
The examples in this section will have you create a single user in Cabot, and that user will receive all the alerts. In practice, an organization will set up schedules, usually referred to as the on-call rotation. In these situations, the person who will receive an alert will depend on the schedule. For example, the person on call might be Alice on call week one, Bob on week two, Carol on week three, and back to Alice on week four.
Note
Another important feature in a real organization is something called a runbook. A runbook is usually a page in a wiki and is associated with a given alert. The runbook contains information on how to diagnose and fix an issue. That way, when an engineer gets a notification at 2 A.M. about the Database Latency alert, they can read about how to access the database and run a query. You wonât create a runbook for this example, but you must be diligent in doing so for real-world alerts.
Create a Twilio Trial Account
At this point, head over to https://twilio.com and create a trial account. When you create an account, you will get two pieces of data that you will need for configuring Cabot. The first piece of information is called an Account SID. This is a string that starts with AC
and contains a bunch of hexadecimal characters. The second piece
of information is the Auth Token. This value just looks like normal hexadecimal
characters.
Using the interface, youâll also need to configure a Trial Number. This is a virtual phone number that you can use with this project. The phone number begins with a plus sign followed by a country code and the rest of the number. Youâll need to use this number within your project, including the plus sign and country code. The number you receive might look like +15551234567.
Finally, youâll need to configure your personal cellphoneâs phone number to be a Verified Number/Verified Caller ID in Twilio. This allows you to confirm with Twilio that the phone number you have belongs to you and that youâre not just using Twilio to send spam texts to strangers, a process that is a limitation of the Twilio trial account. After you verify your phone number, youâll be able to configure Cabot to send an SMS message to it.
Running Cabot via Docker
Cabot is a little more complex than the other services covered in this chapter. It requires several Docker images, not just a single one. For that reason youâll need to use Docker Compose to launch several containers, instead of launching a single one using Docker. Run the following commands to pull the git repository and check out a commit that is known to be compatible with this example:
$
git clone git@github.com:cabotapp/docker-cabot.git cabot$
cd
cabot$
git checkout 1f846b96
Next, create a new file located at conf/production.env within this repository. Note that itâs not within the distributed-node directory that youâve been creating all your other project files in. Add the content from Example 4-13 to this file.
Example 4-13. config/production.env
TIME_ZONE
=
America/Los_Angeles
ADMIN_EMAIL
=
admin@example.org
CABOT_FROM_EMAIL
=
cabot@example.org
DJANGO_SECRET_KEY
=
abcd1234
WWW_HTTP_HOST
=
localhost:5000
WWW_SCHEME
=
http
# GRAPHITE_API=http://<YOUR-IP-ADDRESS>:8080/
TWILIO_ACCOUNT_SID
=
<YOUR_TWILIO_ACCOUNT_SID>
TWILIO_AUTH_TOKEN
=
<YOUR_TWILIO_AUTH_TOKEN>
TWILIO_OUTGOING_NUMBER
=
<YOUR_TWILIO_NUMBER>
Set this value to your TZ Time Zone.
For extra credit, configure a Graphite source using your IP address.
Omit these lines if youâre not using Twilio. Be sure to prefix the phone number with a plus sign and country code.
Tip
If youâre feeling adventurous, configure the GRAPHITE_API
line to use the same Graphite instance that you created in âMetrics with Graphite, StatsD, and Grafanaâ. Later, when using the Cabot interface, you can choose which metrics to create an alert on. This is useful for taking a metric, like request timing, and alerting once it surpasses a certain threshold, such as 200ms. However, for brevity, this section wonât cover how to set it up, and you can omit the line.
Once youâve finished configuring Cabot, run the following command to start the Cabot service:
$
docker-compose up
This will cause several Docker containers to start running. In the terminal, you should see progress as each image is downloaded, followed by colored output associated with each container once itâs running. Once things have settled down, youâre ready to move on to the next step.
Creating a Health Check
For this example, use the same basic-http-healthcheck.js file from Example 4-12 that you made in the previous section. Execute that file and run the Postgres service as configured in Example 4-11. Once that is done, Cabot can be configured to make use of the /health endpoint the Node.js service exposes.
With the Node.js service now running, open the Cabot web service using your web browser by visiting http://localhost:5000.
Youâll first be prompted to create an administrative account. Use the default username admin
. Next, put in your email address and a password and click Create. Then, youâll be prompted to log in. Type admin
for the username field, enter your password again, then click Log in. Youâll finally be taken to the services screen that will contain no entries.
On the empty services screen, click the large plus symbol to be taken to the New service screen. Then, input the information from Table 4-3 into the create service form.
Name |
Dist Node Service |
Url |
http://<LOCAL_IP>:3300/ |
Users to notify |
admin |
Alerts |
Twilio SMS |
Alerts enabled |
checked |
Again, youâll need to replace <LOCAL_IP>
with your IP address. Once youâve entered the information, click the Submit button. This will take you to a screen where you can view the Dist Node Service overview.
On this screen, scroll down to the Http checks section and click the plus sign to be taken to the New check screen. On this screen, input the information from Table 4-4 into the âcreate checkâ form.
Name |
Dist Node HTTP Health |
Endpoint |
http://<LOCAL_IP>:3300/health |
Status code |
200 |
Importance |
Critical |
Active |
checked |
Service set |
Dist Node Service |
Once youâve entered that information, click the Submit button. This will take you back to the Dist Node Service overview screen.
Next, the admin
account needs to be configured to receive alerts using Twilio SMS. In the upper-right corner of the screen, click the admin drop-down menu, then click Profile settings. On the left sidebar, click the Twilio Plugin link. This form will ask you for your phone number. Enter your phone number, beginning with a plus symbol and the country code. This number should match the verified number that you previously entered in your Twilio account. Once youâre done, click the Submit button.
Once youâre done setting your phone number, click the Checks link in the top navigation bar. This will take you to the Checks listing page, which should contain the one entry youâve created. Click the single entry, Dist Node HTTP Health, to be taken to the health check history listing. At this point, you should only see one or two entries since they run once every five minutes. These entries should have a green âsucceededâ label next to them. Click the circular arrow icon in the upper right to trigger another health check.
Now switch back to the terminal window where your Node.js service is running. Kill it with Ctrl + C. Then, switch back to Cabot and click the icon to run the test again. This time the test will fail, and youâll get a new entry in the list with a red background and the word âfailed.â
You should also get a text message containing information about the alert. The message I received is shown here:
Sent from your Twilio trial account - Service Dist Node Service reporting CRITICAL status: http://localhost:5000/service/1/
If Cabot were properly installed on a real server somewhere with a real hostname, the text message would contain a working link that could then be opened on your phone. However, since Cabot is probably running on your laptop, the URL doesnât make a lot of sense in this context.
Click the Services link at the top of the screen, then click the Dist Node Service link again. On this screen, youâll now see a graph displaying the status of the service, as well as a banner stating that the service is critical, like in Figure 4-11. Now click the Acknowledge alert button to pause the alerts for 20 minutes. This is useful for giving you time to work on the issue without being alerted over and over. Itâs now time to fix the failing service.
Switch back to the terminal where you ran the Node.js process and start it again. Then, switch back to the browser. Navigate back to the HTTP check you created. Click the icon to trigger the check again. This time the check should succeed, and it will switch back to a green âsucceededâ message.
Cabot, as well as other alerting tools, offers the ability to assign different users to different services. This is important since different teams within an organization will own different services. When you created an HTTP alert, it was also possible to provide a regex to be applied against the body. This can be used to differentiate a degraded service from an unhealthy service. Cabot can then be configured to have an unhealthy service alert an engineer but have a degraded service merely be highlighted in the UI.
At this point youâre done with the Cabot Docker containers. Switch to the window where you were running Cabot and press Ctrl + C to kill it. Then run the following command to remove the containers from your system:
$
docker
rm
cabot_postgres_1
cabot_rabbitmq_1
\
cabot_worker_1
cabot_beat_1
cabot_web_1
Get Distributed Systems with Node.js now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.