I/O is one of the core pieces that makes Node different from other frameworks. This section explores the APIs that provide nonblocking I/O in Node.
Many components in Node provide continuous
output or can process continuous input. To make these components act in
a consistent way, the stream
API
provides an abstract interface for them. This API provides common
methods and properties that are available in specific implementations of
streams. Streams can be readable, writable, or both. All streams
are EventEmitter
instances, allowing them to emit events.
The readable stream API is a set of methods and events that provides
access to chunks of data as they are sent by an underlying data
source. Fundamentally, readable streams are about emitting data
events. These events represent the
stream of data as a stream of events. To make this manageable, streams
have a number of features that allow you to configure how much data
you get and how fast.
The basic stream in Example 4-16 simply reads data from a file in chunks.
Every time a new chunk is made available, it is exposed to a callback
in the variable called data
. In
this example, we simply log the data to the console. However, in real
use cases, you might either stream the data somewhere else or spool it
into bigger pieces before you work on it. In essence, the data
event simply
provides access to the data, and you have to figure out what to do
with each chunk.
Let’s look in more detail at one of the common patterns used in dealing with streams. The spooling pattern is used when we need an entire resource available before we deal with it. We know it’s important not to block the event loop for Node to perform well, so even though we don’t want to perform the next action on this data until we’ve received all of it, we don’t want to block the event loop. In this scenario (Example 4-17), we use a stream to get the data, but use the data only when enough is available. Typically this means when the stream ends, but it could be another event or condition.
The filesystem module is obviously very helpful because you need it in order to access files on disk. It closely mimics the POSIX style of file I/O. It is a somewhat unique module in that all of the methods have both asynchronous and synchronous versions. However, we strongly recommend that you use the asynchronous methods, unless you are building command-line scripts with Node. Even then, it is often much better to use the async versions, even though doing so adds a little extra code, so that you can access multiple files in parallel and reduce the running time of your script.
The main issue that people face while dealing with asynchronous calls is ordering, and this is especially true with file I/O. It’s common to want to do a number of moves, renames, copies, reads, or writes at one time. However, if one of the operations depends on another, this can create issues because return order is not guaranteed. This means that the first operation in the code could happen after the second operation in the code. Patterns exist to make ordering easy. We talked about them in detail in Chapter 3, but we’ll provide a recap here.
Consider the case of reading and then deleting a file (Example 4-18). If the delete (unlink) happens before the read, it will be impossible to read the contents of the file.
Notice that we are using the asynchronous methods, and although we have created callbacks, we haven’t written any code that defines in which order they get called. This often becomes a problem for programmers who are not used to programming in event loops. This code looks OK on the surface and sometimes it will work at runtime, but sometimes it won’t. Instead, we need to use a pattern in which we specify the ordering we want for the calls. There are a few approaches. One common approach is to use nested callbacks. In Example 4-19, the asynchronous call to delete the file is nested within the callback to the asynchronous function that reads the file.
This approach is often very effective for discrete sets of operations. In our example with just two operations, it’s easy to read and understand, but this pattern can potentially get out of control.
Although Node is JavaScript, it is
JavaScript out of its usual environment. For instance, the browser
requires JavaScript to perform many functions, but manipulating binary
data is rarely one of them. Although JavaScript does support bitwise
operations, it doesn’t have a native representation of binary data. This
is especially troublesome when you also consider the limitations of the
number type system in JavaScript, which might otherwise lend itself to
binary representation. Node introduces the Buffer
class to make
up for this shortfall when you’re working with binary data.
Buffers are an extension to the V8 engine,
which means that they have their own set of pitfalls. Buffers are
actually a direct allocation of memory, which may mean a little or a
lot, depending on your experience with lower-level computer languages.
Unlike the data types in JavaScript, which abstract some of the ugliness
of storing data, Buffer
provides
direct memory access, warts and all. Once a Buffer
is created, it is a fixed size.
If you want to add more data, you must clone the Buffer
into a larger
Buffer
. Although some of these features may seem
frustrating, they allow Buffer
to
perform at the speed necessary for many data operations on the server.
It was a conscious design choice to trade off some programmer
convenience for performance.
We thought it was important to include this quick primer on working with binary data for those who haven’t done much of it, or as a refresher for those of us who haven’t in a long time (which was true for us when we started working with Node). Computers, as almost everyone knows, work by manipulating states of “on” and “off.” We call this a binary state because there are only two possibilities. Everything in computers is built on top of this, which means that working directly with binary can often be the fastest method on the computer. To do more complex things, we collect “bits” (each representing a single binary state) into groups of eights, often called an octet or, more commonly, a byte.[9] This allows us to represent bigger numbers than just 0 or 1.
By creating sets of 8 bits, we are able to represent any number from 0 to 255. The rightmost bit represents 1, but then we double the value of the number represented by each bit as we move left. To find out what number it represents, we simply sum the numbers in column headers (Example 4-20).
You’ll also see the use of hexadecimal notation, or “hex,” a lot. Because bytes
need to be easily described and a string of eight 0s and 1s isn’t very
convenient, hex notation has become popular. Binary notation is base
2, in that there are only two possible states per digit (0 or 1). Hex
uses base 16, and each digit in hex can have a value from 0 to F,
where the letters A through F (or their lowercase equivalents) stand
for 10 through 15, respectively. What’s very convenient about hex is
that with two digits we can represent a whole byte. The right digit
represents 1s, and the left digit represents 16s. If we wanted to
represent decimal 149, it is (16 x 9) + (5 x
1)
, or the hex value 95.
In JavaScript, you can create a number from a hex value using the
notation 0x
in front of the hex
value. For instance, 0x95
is
decimal 149. In Node, you’ll commonly see Buffers
represented by hex values in console.log()
output or Node REPL. Example 4-22 shows how you
could store 3-octet values (such as an RGB color value) as a
Buffer
.
So how does binary relate to other kinds of data? Well, we’ve seen how binary can represent numbers. In network protocols, it’s common to specify a certain number of bytes to convey some information, using particular bits in fixed places to indicate specific things. For example, in a DNS request, the first two bytes are used as a number for a transaction ID, whereas the next byte is treated as individual bits, each used to indicate whether a specific feature of DNS is being used in this request.
The other extremely common use of binary is to represent strings. The two most common “encoding” formats for strings are ASCII and UTF (typically UTF-8). These encodings define how the bits should be converted into characters. We’re not going to go into too much of the gory detail, but essentially, encodings work by having a lookup table that maps the character to a specific number represented in bytes. To convert the encoding, the computer has to simply convert from the number to the character by looking it up in a conversion table.
ASCII characters (some of which are nonvisible “control characters,” such as Return) are always exactly 7 bits each, so they can be represented by values from 0 to 127. The eighth bit in a byte is often used to extend the character set to represent various choices of international characters (such as ȳ or ȱ).
UTF is a little more complex. Its character set has a lot more characters, including many international ones. Each character in UTF-8 is represented by at least 1 byte, but sometimes up to 4. Essentially, the first 128 values are good old ASCII, whereas the others are pushed further down in the map and represented by higher numbers. When a less common character is referenced, the first byte uses a number that tells the computer to check out the next byte to find the real address of the character starting on the second sheet of its map. If the character isn’t on the second sheet of the map, the second byte tells the computer to look at the third, and so on. This means that in UTF-8, the length of a string measured in characters isn’t necessarily the same as its length in bytes, as is always true with ASCII.
It is important to remember is that once you copy things to a Buffer
, they will be stored as their binary
representations. You can always convert the binary representation in
the buffer back into other things, such as strings, later. So a
Buffer
is defined only by its size,
not by the encoding or any other indication of its meaning.
Given that Buffer
is opaque, how big does it need to be
in order to store a particular string of input? As we’ve said, a UTF
character can occupy up to 4 bytes, so to be safe, you should define a
Buffer
to be four times the size of
the largest input you can accept, measured in UTF characters. There
may be ways you can reduce this burden; for instance, if you limit
your input to European languages, you’ll know there will be at most 2
bytes per character.
Buffer
s
can be created using three possible parameters: the length of
the Buffer
in bytes, an array of bytes to copy into
the Buffer
, or a string to copy into the
Buffer
. The first and last methods are by far the
most common. There aren’t too many instances where you are likely to
have a JavaScript array of bytes.[10]
Creating a Buffer
of a particular size is a very common scenario and easy to deal with.
Simply put, you specify the number of bytes as your argument when
creating the Buffer
(Example 4-23).
As you can see from the previous example,
when we create a Buffer
we get a
matching number of bytes. However, because the Buffer
is just getting an allocation of
memory directly, it is uninitialized and the
contents are left over from whatever happened to occupy them before.
This is unlike all the native JavaScript types, which initialize all
memory so that when you create a new primitive or object, it doesn’t
assign whatever was already in the memory space to the primitive or
object you just created. Here is a good way to think about it. If you
go to a busy cafe and you want a table, the fastest way to get one is
to sit down as soon as some other people vacate one. However, although
it’s fast, you are left with all their dirty dishes and the detritus
from their meals. You might prefer to wait for one of the staff to
clear the table and wipe it down before you sit. This is a lot like
Buffers
versus native types.
Buffers
do very little to make
things easy for you, but they do give you direct and fast access to
memory. If you want to have a nicely zeroed set of bits, you’ll need
to do it yourself (or find a helper library).
Creating a Buffer
using byte length is most common when
you are working with things such as network transport protocols that
have very specifically defined structures. When you know exactly how
big the data is going to be (or you know exactly how big it could be)
and you want to allocate and reuse a Buffer
for performance reasons, this is the
way to go.
Probably the most common way to use a
Buffer
is to create it with a
string of either ASCII or UTF-8 characters. Although a Buffer
can hold any data, it is particularly
useful for I/O with character data because the constraints we’ve
already seen on Buffer
can make
their operations much faster than operations on regular strings. So
when you are building really highly scalable apps, it’s often worth
using Buffers
to hold strings. This
is especially true if you are just shunting the strings around the
application without modifying them. Therefore, even though strings
exist as primitives in JavaScript, it’s still very common to keep
strings in Buffer
s in Node.
When we create a Buffer
with a string, as shown in Example 4-24, it defaults to UTF-8. That is, if you
don’t specify an encoding, it will be considered a UTF-8 string. That
is not to say that Buffer
pads the
string to fit any Unicode character (blindly allocating 4 bytes per
character), but rather that it will not truncate characters. In this
example, we can see that when taking a string with just lowercase
alpha characters, the Buffer
uses
the same byte structure, whatever the encoding, because they all fall
in the same range. However, when we have an “é,” it’s encoded as 2
bytes in the default UTF-8 case or when we specify UTF-8 explicitly.
If we specify ASCII, the character is truncated to a single byte.
Example 4-24. Creating Buffers using strings
> new Buffer('foobarbaz'); <Buffer 66 6f 6f 62 61 72 62 61 7a> > new Buffer('foobarbaz', 'ascii'); <Buffer 66 6f 6f 62 61 72 62 61 7a> > new Buffer('foobarbaz', 'utf8'); <Buffer 66 6f 6f 62 61 72 62 61 7a> > new Buffer('é'); <Buffer c3 a9> > new Buffer('é', 'utf8'); <Buffer c3 a9> > new Buffer('é', 'ascii'); <Buffer e9> >
Node offers a number of operations to simplify working with strings and Buffer
s. First, you don’t need to compute
the length of a string before creating a Buffer
to hold it; just assign the string as
the argument when creating the Buffer
. Alternatively, you can use the Buffer.byteLength()
method. This method
takes a string and an encoding and returns the string’s length in
bytes, rather than in characters as String.length
does.
You can also write a string to an existing
Buffer
. The Buffer.write()
method writes a string to a specific index of a Buffer
. If there is room in the Buffer
starting from the specified offset,
the entire string will be written. Otherwise, characters are truncated
from the end of the string to fit the Buffer
. In either case, Buffer.write()
will return the number of
bytes that were written. In the case of UTF-8 strings, if a whole
character can’t be written to the Buffer
, none of the bytes for that character
will be written. In Example 4-25, because the
Buffer
is too small for even one
non-ASCII character, it ends up empty.
In a single-byte
Buffer
, it’s possible to write an “a” character,
and doing so returns 1
, indicating
that 1 byte was written. However, trying to write a “é” character
fails because it requires 2 bytes, and the method returns
0
because nothing was written.
There is a little more complexity to
Buffer.write()
, though. If
possible, when writing UTF-8, Buffer.write()
will terminate the character
string with a NUL character.[11] This is much more significant when writing into the
middle of a larger Buffer
.
In Example 4-26,
after creating a Buffer
that is 5
bytes long (which could have been done directly using the string), we
write the character f
to the entire
Buffer
. f
is the character code 0x66 (102 in
decimal). This makes it easy to see what happens when we write the
characters “ab” to the Buffer
starting with an offset of 1. The zeroth character is left as f
. At positions 1 and 2, the characters
themselves are written, 61 followed by 62. Then Buffer.write()
inserts a terminator, in this
case a null character of 0x00.
Borrowed from the Firebug debugger in
Firefox, the simple console.log
command allows you
to easily output to stdout without using any modules (Example 4-27). It also offers some pretty-printing
functionality to help enumerate objects.
[9] There is no “standard” size of byte, but the de facto size that virtually everyone uses nowadays is 8 bits. Therefore, octets and bytes are equivalent, and we’ll be using the more common term byte to mean specifically an octet.
[10] It’s very memory-inefficient, for one thing. If you store each byte as a number, for instance, you are using a 64-bit memory space to represent 8 bits.
[11] This generally just means a binary 0.
Get Node: Up and Running now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.