I/O

I/O is one of the core pieces that makes Node different from other frameworks. This section explores the APIs that provide nonblocking I/O in Node.

Streams

Many components in Node provide continuous output or can process continuous input. To make these components act in a consistent way, the stream API provides an abstract interface for them. This API provides common methods and properties that are available in specific implementations of streams. Streams can be readable, writable, or both. All streams are EventEmitter instances, allowing them to emit events.

Readable streams

The readable stream API is a set of methods and events that provides access to chunks of data as they are sent by an underlying data source. Fundamentally, readable streams are about emitting data events. These events represent the stream of data as a stream of events. To make this manageable, streams have a number of features that allow you to configure how much data you get and how fast.

The basic stream in Example 4-16 simply reads data from a file in chunks. Every time a new chunk is made available, it is exposed to a callback in the variable called data. In this example, we simply log the data to the console. However, in real use cases, you might either stream the data somewhere else or spool it into bigger pieces before you work on it. In essence, the data event simply provides access to the data, and you have to figure out what to do with each chunk.

Example 4-16. Creating a readable file stream

var fs = require('fs');
var filehandle = fs.readFile('data.txt', function(err, data) {
  console.log(data)
});

Let’s look in more detail at one of the common patterns used in dealing with streams. The spooling pattern is used when we need an entire resource available before we deal with it. We know it’s important not to block the event loop for Node to perform well, so even though we don’t want to perform the next action on this data until we’ve received all of it, we don’t want to block the event loop. In this scenario (Example 4-17), we use a stream to get the data, but use the data only when enough is available. Typically this means when the stream ends, but it could be another event or condition.

Example 4-17. Using the spooling pattern to read a complete stream

          //abstract stream
var spool = "";
stream.on('data', function(data) {
  spool += data;
});
stream.on('end', function() {
  console.log(spool);
});

Filesystem

The filesystem module is obviously very helpful because you need it in order to access files on disk. It closely mimics the POSIX style of file I/O. It is a somewhat unique module in that all of the methods have both asynchronous and synchronous versions. However, we strongly recommend that you use the asynchronous methods, unless you are building command-line scripts with Node. Even then, it is often much better to use the async versions, even though doing so adds a little extra code, so that you can access multiple files in parallel and reduce the running time of your script.

The main issue that people face while dealing with asynchronous calls is ordering, and this is especially true with file I/O. It’s common to want to do a number of moves, renames, copies, reads, or writes at one time. However, if one of the operations depends on another, this can create issues because return order is not guaranteed. This means that the first operation in the code could happen after the second operation in the code. Patterns exist to make ordering easy. We talked about them in detail in Chapter 3, but we’ll provide a recap here.

Consider the case of reading and then deleting a file (Example 4-18). If the delete (unlink) happens before the read, it will be impossible to read the contents of the file.

Example 4-18. Reading and deleting a file asynchronously—but all wrong

var fs = require('fs');

fs.readFile('warandpeace.txt', function(e, data) {
  console.log('War and Peace: ' + data);
});

fs.unlink('warandpeace.txt');

Notice that we are using the asynchronous methods, and although we have created callbacks, we haven’t written any code that defines in which order they get called. This often becomes a problem for programmers who are not used to programming in event loops. This code looks OK on the surface and sometimes it will work at runtime, but sometimes it won’t. Instead, we need to use a pattern in which we specify the ordering we want for the calls. There are a few approaches. One common approach is to use nested callbacks. In Example 4-19, the asynchronous call to delete the file is nested within the callback to the asynchronous function that reads the file.

Example 4-19. Reading and deleting a file asynchronously using nested callbacks

var fs = require('fs');

fs.readFile('warandpeace.txt', function(e, data) {
  console.log('War and Peace: ' + data);
  fs.unlink('warandpeace.txt');
});

This approach is often very effective for discrete sets of operations. In our example with just two operations, it’s easy to read and understand, but this pattern can potentially get out of control.

Buffers

Although Node is JavaScript, it is JavaScript out of its usual environment. For instance, the browser requires JavaScript to perform many functions, but manipulating binary data is rarely one of them. Although JavaScript does support bitwise operations, it doesn’t have a native representation of binary data. This is especially troublesome when you also consider the limitations of the number type system in JavaScript, which might otherwise lend itself to binary representation. Node introduces the Buffer class to make up for this shortfall when you’re working with binary data.

Buffers are an extension to the V8 engine, which means that they have their own set of pitfalls. Buffers are actually a direct allocation of memory, which may mean a little or a lot, depending on your experience with lower-level computer languages. Unlike the data types in JavaScript, which abstract some of the ugliness of storing data, Buffer provides direct memory access, warts and all. Once a Buffer is created, it is a fixed size. If you want to add more data, you must clone the Buffer into a larger Buffer. Although some of these features may seem frustrating, they allow Buffer to perform at the speed necessary for many data operations on the server. It was a conscious design choice to trade off some programmer convenience for performance.

A quick primer on binary

We thought it was important to include this quick primer on working with binary data for those who haven’t done much of it, or as a refresher for those of us who haven’t in a long time (which was true for us when we started working with Node). Computers, as almost everyone knows, work by manipulating states of “on” and “off.” We call this a binary state because there are only two possibilities. Everything in computers is built on top of this, which means that working directly with binary can often be the fastest method on the computer. To do more complex things, we collect “bits” (each representing a single binary state) into groups of eights, often called an octet or, more commonly, a byte.[9] This allows us to represent bigger numbers than just 0 or 1.

By creating sets of 8 bits, we are able to represent any number from 0 to 255. The rightmost bit represents 1, but then we double the value of the number represented by each bit as we move left. To find out what number it represents, we simply sum the numbers in column headers (Example 4-20).

Example 4-20. Representing 0 through 255 in a byte

128 64 32 16 8 4 2 1
--- -- -- -- - - - -
0   0  0  0  0 0 0 0 = 0

128 64 32 16 8 4 2 1
--- -- -- -- - - - -
1   1  1  1  1 1 1 1 = 255

128 64 32 16 8 4 2 1
--- -- -- -- - - - -
1   0  0  1  0 1 0 1 = 149

You’ll also see the use of hexadecimal notation, or “hex,” a lot. Because bytes need to be easily described and a string of eight 0s and 1s isn’t very convenient, hex notation has become popular. Binary notation is base 2, in that there are only two possible states per digit (0 or 1). Hex uses base 16, and each digit in hex can have a value from 0 to F, where the letters A through F (or their lowercase equivalents) stand for 10 through 15, respectively. What’s very convenient about hex is that with two digits we can represent a whole byte. The right digit represents 1s, and the left digit represents 16s. If we wanted to represent decimal 149, it is (16 x 9) + (5 x 1), or the hex value 95.

Example 4-21. Representing 0 through 255 with hex notation

Hex to Decimal:

0 1 2 3 4 5 6 7 8 9 A  B  C  D  E  F
- - - - - - - - - - -- -- -- -- -- --
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15


Counting in hex:

16 1
-- -
0  0 = 0

16 1
-- -
F  F = 255

16 1
-- -
9  5 = 149

In JavaScript, you can create a number from a hex value using the notation 0x in front of the hex value. For instance, 0x95 is decimal 149. In Node, you’ll commonly see Buffers represented by hex values in console.log() output or Node REPL. Example 4-22 shows how you could store 3-octet values (such as an RGB color value) as a Buffer.

Example 4-22. Creating a 3-byte Buffer from an array of octets

> new Buffer([255,0,149]);
<Buffer ff 00 95>
>

So how does binary relate to other kinds of data? Well, we’ve seen how binary can represent numbers. In network protocols, it’s common to specify a certain number of bytes to convey some information, using particular bits in fixed places to indicate specific things. For example, in a DNS request, the first two bytes are used as a number for a transaction ID, whereas the next byte is treated as individual bits, each used to indicate whether a specific feature of DNS is being used in this request.

The other extremely common use of binary is to represent strings. The two most common “encoding” formats for strings are ASCII and UTF (typically UTF-8). These encodings define how the bits should be converted into characters. We’re not going to go into too much of the gory detail, but essentially, encodings work by having a lookup table that maps the character to a specific number represented in bytes. To convert the encoding, the computer has to simply convert from the number to the character by looking it up in a conversion table.

ASCII characters (some of which are nonvisible “control characters,” such as Return) are always exactly 7 bits each, so they can be represented by values from 0 to 127. The eighth bit in a byte is often used to extend the character set to represent various choices of international characters (such as ȳ or ȱ).

UTF is a little more complex. Its character set has a lot more characters, including many international ones. Each character in UTF-8 is represented by at least 1 byte, but sometimes up to 4. Essentially, the first 128 values are good old ASCII, whereas the others are pushed further down in the map and represented by higher numbers. When a less common character is referenced, the first byte uses a number that tells the computer to check out the next byte to find the real address of the character starting on the second sheet of its map. If the character isn’t on the second sheet of the map, the second byte tells the computer to look at the third, and so on. This means that in UTF-8, the length of a string measured in characters isn’t necessarily the same as its length in bytes, as is always true with ASCII.

Binary and strings

It is important to remember is that once you copy things to a Buffer, they will be stored as their binary representations. You can always convert the binary representation in the buffer back into other things, such as strings, later. So a Buffer is defined only by its size, not by the encoding or any other indication of its meaning.

Given that Buffer is opaque, how big does it need to be in order to store a particular string of input? As we’ve said, a UTF character can occupy up to 4 bytes, so to be safe, you should define a Buffer to be four times the size of the largest input you can accept, measured in UTF characters. There may be ways you can reduce this burden; for instance, if you limit your input to European languages, you’ll know there will be at most 2 bytes per character.

Using Buffers

Buffers can be created using three possible parameters: the length of the Buffer in bytes, an array of bytes to copy into the Buffer, or a string to copy into the Buffer. The first and last methods are by far the most common. There aren’t too many instances where you are likely to have a JavaScript array of bytes.[10]

Creating a Buffer of a particular size is a very common scenario and easy to deal with. Simply put, you specify the number of bytes as your argument when creating the Buffer (Example 4-23).

Example 4-23. Creating a Buffer using byte length

> new Buffer(10);
<Buffer e1 43 17 05 01 00 00 00 41 90>
>

As you can see from the previous example, when we create a Buffer we get a matching number of bytes. However, because the Buffer is just getting an allocation of memory directly, it is uninitialized and the contents are left over from whatever happened to occupy them before. This is unlike all the native JavaScript types, which initialize all memory so that when you create a new primitive or object, it doesn’t assign whatever was already in the memory space to the primitive or object you just created. Here is a good way to think about it. If you go to a busy cafe and you want a table, the fastest way to get one is to sit down as soon as some other people vacate one. However, although it’s fast, you are left with all their dirty dishes and the detritus from their meals. You might prefer to wait for one of the staff to clear the table and wipe it down before you sit. This is a lot like Buffers versus native types. Buffers do very little to make things easy for you, but they do give you direct and fast access to memory. If you want to have a nicely zeroed set of bits, you’ll need to do it yourself (or find a helper library).

Creating a Buffer using byte length is most common when you are working with things such as network transport protocols that have very specifically defined structures. When you know exactly how big the data is going to be (or you know exactly how big it could be) and you want to allocate and reuse a Buffer for performance reasons, this is the way to go.

Probably the most common way to use a Buffer is to create it with a string of either ASCII or UTF-8 characters. Although a Buffer can hold any data, it is particularly useful for I/O with character data because the constraints we’ve already seen on Buffer can make their operations much faster than operations on regular strings. So when you are building really highly scalable apps, it’s often worth using Buffers to hold strings. This is especially true if you are just shunting the strings around the application without modifying them. Therefore, even though strings exist as primitives in JavaScript, it’s still very common to keep strings in Buffers in Node.

When we create a Buffer with a string, as shown in Example 4-24, it defaults to UTF-8. That is, if you don’t specify an encoding, it will be considered a UTF-8 string. That is not to say that Buffer pads the string to fit any Unicode character (blindly allocating 4 bytes per character), but rather that it will not truncate characters. In this example, we can see that when taking a string with just lowercase alpha characters, the Buffer uses the same byte structure, whatever the encoding, because they all fall in the same range. However, when we have an “é,” it’s encoded as 2 bytes in the default UTF-8 case or when we specify UTF-8 explicitly. If we specify ASCII, the character is truncated to a single byte.

Example 4-24. Creating Buffers using strings

> new Buffer('foobarbaz');
<Buffer 66 6f 6f 62 61 72 62 61 7a>
> new Buffer('foobarbaz', 'ascii');
<Buffer 66 6f 6f 62 61 72 62 61 7a>
> new Buffer('foobarbaz', 'utf8');
<Buffer 66 6f 6f 62 61 72 62 61 7a>
> new Buffer('é');
<Buffer c3 a9>
> new Buffer('é', 'utf8');
<Buffer c3 a9>
> new Buffer('é', 'ascii');
<Buffer e9>
>

Working with strings

Node offers a number of operations to simplify working with strings and Buffers. First, you don’t need to compute the length of a string before creating a Buffer to hold it; just assign the string as the argument when creating the Buffer. Alternatively, you can use the Buffer.byteLength() method. This method takes a string and an encoding and returns the string’s length in bytes, rather than in characters as String.length does.

You can also write a string to an existing Buffer. The Buffer.write() method writes a string to a specific index of a Buffer. If there is room in the Buffer starting from the specified offset, the entire string will be written. Otherwise, characters are truncated from the end of the string to fit the Buffer. In either case, Buffer.write() will return the number of bytes that were written. In the case of UTF-8 strings, if a whole character can’t be written to the Buffer, none of the bytes for that character will be written. In Example 4-25, because the Buffer is too small for even one non-ASCII character, it ends up empty.

Example 4-25. Buffer.write( ) and partial characters

> var b = new Buffer(1);
> b
<Buffer 00>
> b.write('a');
1
> b
<Buffer 61>
> b.write('é');
0
> b
<Buffer 61>
>

In a single-byte Buffer, it’s possible to write an “a” character, and doing so returns 1, indicating that 1 byte was written. However, trying to write a “é” character fails because it requires 2 bytes, and the method returns 0 because nothing was written.

There is a little more complexity to Buffer.write(), though. If possible, when writing UTF-8, Buffer.write() will terminate the character string with a NUL character.[11] This is much more significant when writing into the middle of a larger Buffer.

In Example 4-26, after creating a Buffer that is 5 bytes long (which could have been done directly using the string), we write the character f to the entire Buffer. f is the character code 0x66 (102 in decimal). This makes it easy to see what happens when we write the characters “ab” to the Buffer starting with an offset of 1. The zeroth character is left as f. At positions 1 and 2, the characters themselves are written, 61 followed by 62. Then Buffer.write() inserts a terminator, in this case a null character of 0x00.

Example 4-26. Writing a string into a Buffer including a terminator

> var b = new Buffer(5);
> b.write('fffff');
5
> b
<Buffer 66 66 66 66 66>
> b.write('ab', 1);
2
> b
<Buffer 66 61 62 00 66>
>

console.log

Borrowed from the Firebug debugger in Firefox, the simple console.log command allows you to easily output to stdout without using any modules (Example 4-27). It also offers some pretty-printing functionality to help enumerate objects.

Example 4-27. Outputting with console.log

> foo = {};
{}
> foo.bar = function() {1+1};
[Function]
> console.log(foo);
{ bar: [Function] }
>


[9] There is no “standard” size of byte, but the de facto size that virtually everyone uses nowadays is 8 bits. Therefore, octets and bytes are equivalent, and we’ll be using the more common term byte to mean specifically an octet.

[10] It’s very memory-inefficient, for one thing. If you store each byte as a number, for instance, you are using a 64-bit memory space to represent 8 bits.

[11] This generally just means a binary 0.

Get Node: Up and Running now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.