Writing Code for Production

One of the challenges of writing a book is trying to explain things in the simplest way possible. That runs counter to showing techniques and functional code that you’d want to deploy. Although we should always strive to have the simplest, most understandable code possible, sometimes you need to do things that make code more robust or faster at the cost of making it less simple. This section provides guidance about how to harden the applications you deploy, which you can take with you as you explore upcoming chapters. This section is about writing code with maturity that will keep your application running long into the future. It’s not exhaustive, but if you write robust code, you won’t have to deal with so many maintenance issues. One of the trade-offs of Node’s single-threaded approach is a tendency to be brittle. These techniques help mitigate this risk.

Deploying a production application is not the same as running test programs on your laptop. Servers can have a wide variety of resource constraints, but they tend to have a lot more resources than the typical machine you would develop on. Typically, frontend servers have many more cores (CPUs) than laptop or desktop machines, but less hard drive space. They also have a lot of RAM. Node currently has some constraints, such as a maximum JavaScript heap size. This affects the way you deploy because you want to maximize the use of the CPUs and memory on the machine while using Node’s easy-to-program single-threaded approach.

Error Handling

As we saw earlier in this chapter, you can split I/O activities from other things in Node, and error handling is one of those things. JavaScript includes try/catch functionality, but it’s appropriate only for errors that happen inline. When you do nonblocking I/O in Node, you pass a callback to the function. This means the callback is going to run when the event happens outside of the try/catch block. We need to be able to provide error handling that works in asynchronous situations. Consider the code in Example 3-9.

Example 3-9. Trying to catch an error in a callback and failing

var http = require('http')

var opts = {
  host: 'sfnsdkfjdsnk.com',
  port: 80,
  path: '/'
}

try {
  http.get(opts, function(res) {
    console.log('Will this get called?')
  })
}
catch (e) {
  console.log('Will we catch an error?')
}

When you call http.get(), what is actually happening? We pass some parameters specifying the I/O we want to happen and a callback function. When the I/O completes, the callback function will be fired. However, the http.get() call will succeed simply by issuing the callback. An error during the GET cannot be caught by a try/catch block.

The disconnect from I/O errors is even more obvious in Node REPL. Because the REPL shell prints out any return values that are not assigned, we can see that the return value of calling http.get() is the http.ClientRequest object that is created. This means that the try/catch did its job by making sure the specified code returned without errors. However, because the hostname is nonsense, a problem will occur within this I/O request. This means the callback can’t be completed successfully. A try/catch can’t help with this, because the error has happened outside the JavaScript, and when Node is ready to report it, we are not in that call stack any more. We’ve moved on to dealing with another event.

We deal with this in Node by using the error event. This is a special event that is fired when an error occurs. It allows a module engaging in I/O to fire an alternative event to the one the callback was listening for to deal with the error. The error event allows us to deal with any errors that might occur in any of the callbacks that happen in any modules we use. Let’s write the previous example correctly, as shown in Example 3-10.

Example 3-10. Catching an I/O error with the error event

var http = require('http')

var opts = {
  host: 'dskjvnfskcsjsdkcds.net',
  port: 80,
  path: '/'
}

var req = http.get(opts, function(res) {
  console.log('This will never get called')
})

req.on('error', function(e) {
  console.log('Got that pesky error trapped')
})

By using the error event, we got to deal with the error (in this case by ignoring it). However, our program survived, which is the main thing. Like try/catch in JavaScript, the error event catches all kinds of exceptions. A good general approach to exception handling is to set up conditionals to check for known error conditions and deal with them if possible. Otherwise, catching any remaining errors, logging them, and keeping your server running is probably the best approach.

Using Multiple Processors

As we’ve mentioned, Node is single-threaded. This means Node is using only one processor to do its work. However, most servers have several “multicore” processors, and a single multicore processor has many processors. A server with two physical CPU sockets might have “24 logical cores”—that is, 24 processors exposed to the operating system. To make the best use of Node, we should use those too. So if we don’t have threads, how do we do that?

Node provides a module called cluster that allows you to delegate work to child processes. This means that Node creates a copy of its current program in another process (on Windows, it is actually another thread). Each child process has some special abilities, such as the ability to share a socket with other children. This allows us to write Node programs that start many other Node programs and then delegate work to them.

It is important to understand that when you use cluster to share work between a number of copies of a Node program, the master process isn’t involved in every transaction. The master process manages the child processes, but when the children interact with I/O they do it directly, not through the master. This means that if you set up a web server using cluster, requests don’t go through your master process, but directly to the children. Hence, dispatching requests does not create a bottleneck in the system.

By using the cluster API, you can distribute work to a Node process on every available core of your server. This makes the best use of the resource. Let’s look at a simple cluster script in Example 3-11.

Example 3-11. Using cluster to distribute work

var cluster = require('cluster');
var http = require('http');
var numCPUs = require('os').cpus().length;

if (cluster.isMaster) {
  // Fork workers.
  for (var i = 0; i < numCPUs; i++) {
    cluster.fork();
  }

  cluster.on('death', function(worker) {
    console.log('worker ' + worker.pid + ' died');
  });
} else {
  // Worker processes have a http server.
  http.Server(function(req, res) {
    res.writeHead(200);
    res.end("hello world\n");
  }).listen(8000);
}

In this example, we use a few parts of Node core to evenly distribute the work across all of the CPUs available: the cluster module, the http module, and the os module. From the latter, we simply get the number of CPUs on the system.

The way cluster works is that each Node process becomes either a “master” or a “worker” process. When a master process calls the cluster.fork() method, it creates a child process that is identical to the master, except for two attributes that each process can check to see whether it is a master or child. In the master process—the one in which the script has been directly invoked by calling it with Node—cluster.isMaster returns true, whereas cluster.isWorker returns false. cluster.isMaster returns false on the child, whereas cluster.isWorker returns true.

The example shows a master script that invokes a worker for each CPU. Each child starts an HTTP server, which is another unique aspect of cluster. When you listen() to a socket where cluster is in use, many processes can listen to the same socket. If you simply started serveral Node processes with node myscript.js, this wouldn’t be possible, because the second process to start would throw the EADDRINUSE exception. cluster provides a cross-platform way to invoke several processes that share a socket. And even when the children all share a connection to a port, if one of them is jammed, it doesn’t stop the other workers from getting connections.

We can do more with cluster than simply share sockets, because it is based on the child_process module. This gives us a number of attributes, and some of the most useful ones relate to the health of the child processes. In the previous example, when a child dies, the master process uses console.log() to print out a death notification. However, a more useful script would cluster.fork() a new child, as shown in Example 3-12.

Example 3-12. Forking a new worker when a death occurs

      if (cluster.isMaster) {
  //Fork workers.
  for (var i=0; i<numCPUs; i++) {
    cluster.fork();
  }

  cluster.on('death', function(worker) {
    console.log('worker ' + worker.pid + ' died');
    cluster.fork();
  });
}

This simple change means that our master process can keep restarting dying processes to keep our server firing on all CPUs. However, this is just a basic check for running processes. We can also do some more fancy tricks. Because workers can pass messages to the master, we can have each worker report some stats, such as memory usage, to the master. This will allow the master to determine when workers are becoming unruly or to confirm that workers are not freezing or getting stuck in long-running events (see Example 3-13).

Example 3-13. Monitoring worker health using message passing

var cluster = require('cluster');
var http = require('http');
var numCPUs = require('os').cpus().length;

var rssWarn = (12 * 1024 * 1024)
  , heapWarn = (10 * 1024 * 1024)

if(cluster.isMaster) {
  for(var i=0; i<numCPUs; i++) {
    var worker = cluster.fork();
    worker.on('message', function(m) {
      if (m.memory) {
        if(m.memory.rss > rssWarn) {
          console.log('Worker ' + m.process + ' using too much memory.')
        }
      }
    })
  }
} else {
  //Server
  http.Server(function(req,res) {
    res.writeHead(200);
    res.end('hello world\n')
  }).listen(8000)
  //Report stats once a second
  setInterval(function report(){
    process.send({memory: process.memoryUsage(), process: process.pid});
  }, 1000)
}

In this example, workers report on their memory usage, and the master sends an alert to the log when a process uses too much memory. This replicates the functionality of many health reporting systems that operations teams already use. It gives control to the master Node process, however, which has some benefits. This message-passing interface allows the master process to send messages back to the workers too. This means you can treat a master process as a lightly loaded admin interface to your workers.

There are other things we can do with message passing that we can’t do from the outside of Node. Because Node relies on an event loop to do its work, there is the danger that the callback of an event in the loop could run for a long time. This means that other users of the process are not going to get their requests met until that long-running event’s callback has concluded. The master process has a connection to each worker, so we can tell it to expect an “all OK” notification periodically. This means we can validate that the event loop has the appropriate amount of turnover and that it hasn’t become stuck on one callback. Sadly, identifying a long-running callback doesn’t allow us to make a callback for termination. Because any notification we could send to the process will get added to the event queue, it would have to wait for the long-running callback to finish. Consequently, although using the master process allows us to identify zombie workers, our only remedy is to kill the worker and lose all the tasks it was doing.

Some preparation can give you the capability to kill an individual worker that threatens to take over its processor; see Example 3-14.

Example 3-14. Killing zombie workers

var cluster = require('cluster');
var http = require('http');
var numCPUs = require('os').cpus().length;

var rssWarn = (50 * 1024 * 1024)
  , heapWarn = (50 * 1024 * 1024)

var workers = {}

if(cluster.isMaster) {
  for(var i=0; i<numCPUs; i++) {
    createWorker()
  }

  setInterval(function() {
    var time = new Date().getTime()
    for(pid in workers) {
      if(workers.hasOwnProperty(pid) &&
         workers[pid].lastCb + 5000 < time) {

        console.log('Long running worker ' + pid + ' killed')
        workers[pid].worker.kill()
        delete workers[pid]
        createWorker()
      }
    }
  }, 1000)
} else {
  //Server
  http.Server(function(req,res) {
    //mess up 1 in 200 reqs
    if (Math.floor(Math.random() * 200) === 4) {
      console.log('Stopped ' + process.pid + ' from ever finishing')
      while(true) { continue }
    }
    res.writeHead(200);
    res.end('hello world from '  + process.pid + '\n')
  }).listen(8000)
  //Report stats once a second
  setInterval(function report(){
    process.send({cmd: "reportMem", memory: process.memoryUsage(), process: process.pid})
  }, 1000)
}

function createWorker() {
  var worker = cluster.fork()
  console.log('Created worker: ' + worker.pid)
  //allow boot time
  workers[worker.pid] = {worker:worker, lastCb: new Date().getTime()-1000}
  worker.on('message', function(m) {
    if(m.cmd === "reportMem") {
      workers[m.process].lastCb = new Date().getTime()
      if(m.memory.rss > rssWarn) {
        console.log('Worker ' + m.process + ' using too much memory.')
      }
    }
  })
}

In this script, we’ve added an interval to the master as well as the workers. Now whenever a worker sends a report to the master process, the master stores the time of the report. Every second or so, the master process looks at all its workers to check whether any of them haven’t responded in longer than 5 seconds (using > 5000 because timeouts are in milliseconds). If that is the case, it kills the stuck worker and restarts it. To make this process effective, we moved the creation of workers into a small function. This allows us to do the various pieces of setup in a single place, regardless of whether we are creating a new worker or restarting a dead one.

We also made a small change to the HTTP server in order to give each request a 1 in 200 chance of failing, so you can run the script and see what it’s like to get failures. If you do a bunch of parallel requests from several sources, you’ll see the way this works. These are all entirely separate Node programs that interact via message passing, which means that no matter what happens, the master process can check on the other processes because the master is a small program that won’t get jammed.

Get Node: Up and Running now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.