Handling Requests: The Detailed View

The sbull driver as described earlier works very well. In simple situations (as with sbull), the macros from <linux/blk.h> can be used to easily set up a request function and get a working driver. As has already been mentioned, however, block drivers are often a performance-critical part of the kernel. Drivers based on the simple code shown earlier will likely not perform very well in many situations, and can also be a drag on the system as a whole. In this section we get into the details of how the I/O request queue works with an eye toward writing a faster, more efficient driver.

The I/O Request Queue

Each block driver works with at least one I/O request queue. This queue contains, at any given time, all of the I/O operations that the kernel would like to see done on the driver’s devices. The management of this queue is complicated; the performance of the system depends on how it is done.

The queue is designed with physical disk drives in mind. With disks, the amount of time required to transfer a block of data is typically quite small. The amount of time required to position the head (seek) to do that transfer, however, can be very large. Thus the Linux kernel works to minimize the number and extent of the seeks performed by the device.

Two things are done to achieve those goals. One is the clustering of requests to adjacent sectors on the disk. Most modern filesystems will attempt to lay out files in consecutive sectors; as a result, requests to adjoining parts of the disk are common. The kernel also applies an “elevator” algorithm to the requests. An elevator in a skyscraper is either going up or down; it will continue to move in those directions until all of its “requests” (people wanting on or off) have been satisfied. In the same way, the kernel tries to keep the disk head moving in the same direction for as long as possible; this approach tends to minimize seek times while ensuring that all requests get satisfied eventually.

A Linux I/O request queue is represented by a structure of type request_queue, declared in <linux/blkdev.h>. The request_queue structure looks somewhat like file_operations and other such objects, in that it contains pointers to a number of functions that operate on the queue—for example, the driver’s request function is stored there. There is also a queue head (using the functions from <linux/list.h> described in Section 10.5 in Chapter 10), which points to the list of outstanding requests to the device.

These requests are, of course, of type struct request; we have already looked at some of the fields in this structure. The reality of the request structure is a little more complicated, however; understanding it requires a brief digression into the structure of the Linux buffer cache.

The request structure and the buffer cache

The design of the request structure is driven by the Linux memory management scheme. Like most Unix-like systems, Linux maintains a buffer cache, a region of memory that is used to hold copies of blocks stored on disk. A great many “disk” operations performed at higher levels of the kernel—such as in the filesystem code—act only on the buffer cache and do not generate any actual I/O operations. Through aggressive caching the kernel can avoid many read operations altogether, and multiple writes can often be merged into a single physical write to disk.

One unavoidable aspect of the buffer cache, however, is that blocks that are adjacent on disk are almost certainly not adjacent in memory. The buffer cache is a dynamic thing, and blocks end up being scattered widely. In order to keep track of everything, the kernel manages the buffer cache through buffer_head structures. One buffer_head is associated with each data buffer. This structure contains a great many fields, most of which do not concern a driver writer. There are a few that are important, however, including the following:

char *b_data;

The actual data block associated with this buffer head.

unsigned long b_size;

The size of the block pointed to by b_data.

kdev_t b_rdev;

The device holding the block represented by this buffer head.

unsigned long b_rsector;

The sector number where this block lives on disk.

struct buffer_head *b_reqnext;

A pointer to a linked list of buffer head structures in the request queue.

void (*b_end_io)(struct buffer_head *bh, int uptodate);

A pointer to a function to be called when I/O on this buffer completes. bh is the buffer head itself, and uptodate is nonzero if the I/O was successful.

Every block passed to a driver’s request function either lives in the buffer cache, or, on rare occasion, lives elsewhere but has been made to look as if it lived in the buffer cache.[49] As a result, every request passed to the driver deals with one or more buffer_head structures. The request structure contains a member (called simply bh) that points to a linked list of these structures; satisfying the request requires performing the indicated I/O operation on each buffer in the list. Figure 12-2 shows how the request queue and buffer_head structures fit together.

Buffers in the I/O Request Queue

Figure 12-2. Buffers in the I/O Request Queue

Requests are not made of random lists of buffers; instead, all of the buffer heads attached to a single request will belong to a series of adjacent blocks on the disk. Thus a request is, in a sense, a single operation referring to a (perhaps long) group of blocks on the disk. This grouping of blocks is called clustering, and we will look at it in detail after completing our discussion of how the request list works.

Request queue manipulation

The header <linux/blkdev.h> defines a small number of functions that manipulate the request queue, most of which are implemented as preprocessor macros. Not all drivers will need to work with the queue at this level, but a familiarity with how it all works can be helpful. Most request queue functions will be introduced as we need them, but a few are worth mentioning here.

struct request *blkdev_entry_next_request(struct list_head *head);

Returns the next entry in the request list. Usually the head argument is the queue_head member of the request_queue structure; in this case the function returns the first entry in the queue. The function uses the list_entry macro to look in the list.

struct request *blkdev_next_request(struct request *req); , struct request *blkdev_prev_request(struct request *req);

Given a request structure, return the next or previous structure in the request queue.

blkdev_dequeue_request(struct request *req);

Removes a request from its request queue.

blkdev_release_request(struct request *req);

Releases a request structure back to the kernel when it has been completely executed. Each request queue maintains its own free list of request structures (two, actually: one for reads and one for writes); this function places a structure back on the proper free list. blkdev_release_request will also wake up any processes that are waiting on a free request structure.

All of these functions require that the io_request_lock be held, which we will discuss next.

The I/O request lock

The I/O request queue is a complex data structure that is accessed in many places in the kernel. It is entirely possible that the kernel needs to add more requests to the queue at the same time that your driver is taking requests off. The queue is thus subject to the usual sort of race conditions, and must be protected accordingly.

In Linux 2.2 and 2.4, all request queues are protected with a single global spinlock called io_request_lock. Any code that manipulates a request queue must hold that lock and disable interrupts, with one small exception: the very first entry in the request queue is (by default) considered to be owned by the driver. Failure to acquire the io_request_lock prior to working with the request queue can cause the queue to be corrupted, with a system crash following shortly thereafter.

The simple request function shown earlier did not need to worry about this lock because the kernel always calls the request function with the io_request_lock held. A driver is thus protected against corrupting the request queue; it is also protected against reentrant calls to the request function. This scheme was designed to enable drivers that are not SMP aware to function on multiprocessor systems.

Note, however, that the io_request_lock is an expensive resource to hold. As long as your driver holds this lock, no other requests may be queued to any block driver in the system, and no other request functions may be called. A driver that holds this lock for a long time may well slow down the system as a whole.

Thus, well-written block drivers often drop this lock as soon as possible. We will see an example of how this can be done shortly. Block drivers that drop the io_request_lock must be written with a couple of important things in mind, however. First is that the request function must always reacquire this lock before returning, since the calling code expects it to still be held. The other concern is that, as soon as the io_request_lock is dropped, the possibility of reentrant calls to the request function is very real; the function must be written to handle that eventuality.

A variant of this latter case can also occur if your request function returns while an I/O request is still active. Many drivers for real hardware will start an I/O operation, then return; the work is completed in the driver’s interrupt handler. We will look at interrupt-driven block I/O in detail later in this chapter; for now it is worth mentioning, however, that the request function can be called while these operations are still in progress.

Some drivers handle request function reentrancy by maintaining an internal request queue. The request function simply removes any new requests from the I/O request queue and adds them to the internal queue, which is then processed through a combination of tasklets and interrupt handlers.

How the blk.h macros and functions work

In our simple request function earlier, we were not concerned with buffer_head structures or linked lists. The macros and functions in <linux/blk.h> hide the structure of the I/O request queue in order to make the task of writing a block driver simpler. In many cases, however, getting reasonable performance requires a deeper understanding of how the queue works. In this section we look at the actual steps involved in manipulating the request queue; subsequent sections show some more advanced techniques for writing block request functions.

The fields of the request structure that we looked at earlier—sector, current_nr_sectors, and buffer—are really just copies of the analogous information stored in the first buffer_head structure on the list. Thus, a request function that uses this information from the CURRENT pointer is just processing the first of what might be many buffers within the request. The task of splitting up a multibuffer request into (seemingly) independent, single-buffer requests is handled by two important definitions in <linux/blk.h>: the INIT_REQUEST macro and the end_request function.

Of the two, INIT_REQUEST is the simpler; all it really does is make a couple of consistency checks on the request queue and cause a return from the request function if the queue is empty. It is simply making sure that there is still work to do.

The bulk of the queue management work is done by end_request. This function, remember, is called when the driver has processed a single “request” (actually one buffer); it has several tasks to perform:

  1. Complete the I/O processing on the current buffer; this involves calling the b_end_io function with the status of the operation, thus waking any process that may be sleeping on the buffer.

  2. Remove the buffer from the request’s linked list. If there are further buffers to be processed, the sector, current_nr_sectors, and buffer fields in the request structure are updated to reflect the contents of the next buffer_head structure in the list. In this case (there are still buffers to be transferred), end_request is finished for this iteration and steps 3 to 5 are not executed.

  3. Call add_blkdev_randomness to update the entropy pool, unless DEVICE_NO_RANDOM has been defined (as is done in the sbull driver).

  4. Remove the finished request from the request queue by calling blkdev_dequeue_request. This step modifies the request queue, and thus must be performed with the io_request_lock held.

  5. Release the finished request back to the system; io_request_lock is required here too.

The kernel defines a couple of helper functions that are used by end_request to do most of this work. The first one is called end_that_request_first, which handles the first two steps just described. Its prototype is

int end_that_request_first(struct request *req, int status, char *name);

status is the status of the request as passed to end_request; the name parameter is the device name, to be used when printing error messages. The return value is nonzero if there are more buffers to be processed in the current request; in that case the work is done. Otherwise, the request is dequeued and released with end_that_request_last:

void end_that_request_last(struct request *req);

In end_request this step is handled with this code:

struct request *req = CURRENT;

That is all there is to it.

Clustered Requests

The time has come to look at how to apply all of that background material to the task of writing better block drivers. We’ll start with a look at the handling of clustered requests. Clustering, as mentioned earlier, is simply the practice of joining together requests that operate on adjacent blocks on the disk. There are two advantages to doing things this way. First, clustering speeds up the transfer; clustering can also save some memory in the kernel by avoiding allocation of redundant request structures.

As we have seen, block drivers need not be aware of clustering at all; <linux/blk.h> transparently splits each clustered request into its component pieces. In many cases, however, a driver can do better by explicitly acting on clustering. It is often possible to set up the I/O for several consecutive blocks at the same time, with an improvement in throughput. For example, the Linux floppy driver attempts to write an entire track to the diskette in a single operation. Most high-performance disk controllers can do “scatter/gather” I/O as well, leading to large performance gains.

To take advantage of clustering, a block driver must look directly at the list of buffer_head structures attached to the request. This list is pointed to by CURRENT->bh; subsequent buffers can be found by following the b_reqnext pointers in each buffer_head structure. A driver performing clustered I/O should follow roughly this sequence of operations with each buffer in the cluster:

  1. Arrange to transfer the data block at address bh->b_data, of size bh->b_size bytes. The direction of the data transfer is CURRENT->cmd (i.e., either READ or WRITE).

  2. Retrieve the next buffer head in the list: bh->b_reqnext. Then detach the buffer just transferred from the list, by zeroing its b_reqnext—the pointer to the new buffer you just retrieved.

  3. Update the request structure to reflect the I/O done with the buffer that has just been removed. Both CURRENT->hard_nr_sectors and CURRENT->nr_sectors should be decremented by the number of sectors (not blocks) transferred from the buffer. The sector numbers CURRENT->hard_sector and CURRENT->sector should be incremented by the same amount. Performing these operations keeps the request structure consistent.

  4. Loop back to the beginning to transfer the next adjacent block.

When the I/O on each buffer completes, your driver should notify the kernel by calling the buffer’s I/O completion routine:

bh->b_end_io(bh, status);

status is nonzero if the operation was successful. You also, of course, need to remove the request structure for the completed operations from the queue. The processing steps just described can be done without holding the io_request_lock, but that lock must be reacquired before changing the queue itself.

Your driver can still use end_request (as opposed to manipulating the queue directly) at the completion of the I/O operation, as long as it takes care to set the CURRENT->bh pointer properly. This pointer should either be NULL or it should point to the last buffer_head structure that was transferred. In the latter case, the b_end_io function should not have been called on that last buffer, since end_request will make that call.

A full-featured implementation of clustering appears in drivers/block/floppy.c, while a summary of the operations required appears in end_request, in blk.h. Neither floppy.c nor blk.h are easy to understand, but the latter is a better place to start.

The active queue head

One other detail regarding the behavior of the I/O request queue is relevant for block drivers that are dealing with clustering. It has to do with the queue head—the first request on the queue. For historical compatibility reasons, the kernel (almost) always assumes that a block driver is processing the first entry in the request queue. To avoid corruption resulting from conflicting activity, the kernel will never modify a request once it gets to the head of the queue. No further clustering will happen on that request, and the elevator code will not put other requests in front of it.

Many block drivers remove requests from the queue entirely before beginning to process them. If your driver works this way, the request at the head of the queue should be fair game for the kernel. In this case, your driver should inform the kernel that the head of the queue is not active by calling blk_queue_headactive:

blk_queue_headactive(request_queue_t *queue, int active);

If active is 0, the kernel will be able to make changes to the head of the request queue.

Multiqueue Block Drivers

As we have seen, the kernel, by default, maintains a single I/O request queue for each major number. The single queue works well for devices like sbull, but it is not always optimal for real-world situations.

Consider a driver that is handling real disk devices. Each disk is capable of operating independently; the performance of the system is sure to be better if the drives could be kept busy in parallel. A simple driver based on a single queue will not achieve that—it will perform operations on a single device at a time.

It would not be all that hard for a driver to walk through the request queue and pick out requests for independent drives. But the 2.4 kernel makes life easier by allowing the driver to set up independent queues for each device. Most high-performance drivers take advantage of this multiqueue capability. Doing so is not difficult, but it does require moving beyond the simple <linux/blk.h> definitions.

The sbull driver, when compiled with the SBULL_MULTIQUEUE symbol defined, operates in a multiqueue mode. It works without the <linux/blk.h> macros, and demonstrates a number of the features that have been described in this section.

To operate in a multiqueue mode, a block driver must define its own request queues. sbull does this by adding a queue member to the Sbull_Dev structure:

request_queue_t queue;
int busy;

The busy flag is used to protect against request function reentrancy, as we will see.

Request queues must be initialized, of course. sbull initializes its device-specific queues in this manner:

for (i = 0; i < sbull_devs; i++) {
    blk_init_queue(&sbull_devices[i].queue, sbull_request);
    blk_queue_headactive(&sbull_devices[i].queue, 0);
blk_dev[major].queue = sbull_find_queue;

The call to blk_init_queue is as we have seen before, only now we pass in the device-specific queues instead of the default queue for our major device number. This code also marks the queues as not having active heads.

You might be wondering how the kernel manages to find the request queues, which are buried in a device-specific, private structure. The key is the last line just shown, which sets the queue member in the global blk_dev structure. This member points to a function that has the job of finding the proper request queue for a given device number. Devices using the default queue have no such function, but multiqueue devices must implement it. sbull’s queue function looks like this:

request_queue_t *sbull_find_queue(kdev_t device)
    int devno = DEVICE_NR(device);

    if (devno >= sbull_devs) {
        static int count = 0;
        if (count++ < 5) /* print the message at most five times */
            printk(KERN_WARNING "sbull: request for unknown device\n");
        return NULL;
    return &sbull_devices[devno].queue;

Like the request function, sbull_find_queue must be atomic (no sleeping allowed).

Each queue has its own request function, though usually a driver will use the same function for all of its queues. The kernel passes the actual request queue into the request function as a parameter, so the function can always figure out which device is being operated on. The multiqueue request function used in sbull looks a little different from the ones we have seen so far because it manipulates the request queue directly. It also drops the io_request_lock while performing transfers to allow the kernel to execute other block operations. Finally, the code must take care to avoid two separate perils: multiple calls of the request function and conflicting access to the device itself.

void sbull_request(request_queue_t *q)
    Sbull_Dev *device;
    struct request *req;
    int status;

    /* Find our device */
    device = sbull_locate_device (blkdev_entry_next_request(&q->queue_head));
    if (device->busy) /* no race here - io_request_lock held */
    device->busy = 1;

    /* Process requests in the queue */
    while(! list_empty(&q->queue_head)) {

    /* Pull the next request off the list. */
        req = blkdev_entry_next_request(&q->queue_head);
        spin_unlock_irq (&io_request_lock);

    /* Process all of the buffers in this (possibly clustered) request. */
        do {
            status = sbull_transfer(device, req);
        } while (end_that_request_first(req, status, DEVICE_NAME));
        spin_lock_irq (&io_request_lock);
    device->busy = 0;

Instead of using INIT_REQUEST, this function tests its specific request queue with the list function list_empty. As long as requests exist, it removes each one in turn from the queue with blkdev_dequeue_request. Only then, once the removal is complete, is it able to drop io_request_lock and obtain the device-specific lock. The actual transfer is done using sbull_transfer, which we have already seen.

Each call to sbull_transfer handles exactly one buffer_head structure attached to the request. The function then calls end_that_request_first to dispose of that buffer, and, if the request is complete, goes on to end_that_request_last to clean up the request as a whole.

The management of concurrency here is worth a quick look. The busy flag is used to prevent multiple invocations of sbull_request. Since sbull_request is always called with the io_request_lock held, it is safe to test and set the busy flag with no additional protection. (Otherwise, an atomic_t could have been used). The io_request_lock is dropped before the device-specific lock is acquired. It is possible to acquire multiple locks without risking deadlock, but it is harder; when the constraints allow, it is better to release one lock before obtaining another.

end_that_request_first is called without the io_request_lock held. Since this function operates only on the given request structure, calling it this way is safe—as long as the request is not on the queue. The call to end_that_request_last, however, requires that the lock be held, since it returns the request to the request queue’s free list. The function also always exits from the outer loop (and the function as a whole) with the io_request_lock held and the device lock released.

Multiqueue drivers must, of course, clean up all of their queues at module removal time:

for (i = 0; i < sbull_devs; i++)
blk_dev[major].queue = NULL;

It is worth noting, briefly, that this code could be made more efficient. It allocates a whole set of request queues at initialization time, even though some of them may never be used. A request queue is a large structure, since many (perhaps thousands) of request structures are allocated when the queue is initialized. A more clever implementation would allocate a request queue when needed in either the open method or the queue function. We chose a simpler implementation for sbull in order to avoid complicating the code.

That covers the mechanics of multiqueue drivers. Drivers handling real hardware may have other issues to deal with, of course, such as serializing access to a controller. But the basic structure of multiqueue drivers is as we have seen here.

Doing Without the Request Queue

Much of the discussion to this point has centered around the manipulation of the I/O request queue. The purpose of the request queue is to improve performance by allowing the driver to act asynchronously and, crucially, by allowing the merging of contiguous (on the disk) operations. For normal disk devices, operations on contiguous blocks are common, and this optimization is necessary.

Not all block devices benefit from the request queue, however. sbull, for example, processes requests synchronously and has no problems with seek times. For sbull, the request queue actually ends up slowing things down. Other types of block devices also can be better off without a request queue. For example, RAID devices, which are made up of multiple disks, often spread “contiguous” blocks across multiple physical devices. Block devices implemented by the logical volume manager (LVM) capability (which first appeared in 2.4) also have an implementation that is more complex than the block interface that is presented to the rest of the kernel.

In the 2.4 kernel, block I/O requests are placed on the queue by the function __make_request, which is also responsible for invoking the driver’s request function. Block drivers that need more control over request queueing, however, can replace that function with their own “make request” function. The RAID and LVM drivers do so, providing their own variant that, eventually, requeues each I/O request (with different block numbers) to the appropriate low-level device (or devices) that make up the higher-level device. A RAM-disk driver, instead, can execute the I/O operation directly.

sbull, when loaded with the noqueue=1 option on 2.4 systems, will provide its own “make request” function and operate without a request queue. The first step in this scenario is to replace __make_request. The “make request” function pointer is stored in the request queue, and can be changed with blk_queue_make_request:

void blk_queue_make_request(request_queue_t *queue,
make_request_fn *func);

The make_request_fn type, in turn, is defined as follows:

typedef int (make_request_fn) (request_queue_t *q, int rw, 
             struct buffer_head *bh);

The “make request” function must arrange to transfer the given block, and see to it that the b_end_io function is called when the transfer is done. The kernel does not hold the io_request_lock lock when calling the make_request_fn function, so the function must acquire the lock itself if it will be manipulating the request queue. If the transfer has been set up (not necessarily completed), the function should return 0.

The phrase “arrange to transfer” was chosen carefully; often a driver-specific make request function will not actually transfer the data. Consider a RAID device. What the function really needs to do is to map the I/O operation onto one of its constituent devices, then invoke that device’s driver to actually do the work. This mapping is done by setting the b_rdev member of the buffer_head structure to the number of the “real” device that will do the transfer, then signaling that the block still needs to be written by returning a nonzero value.

When the kernel sees a nonzero return value from the make request function, it concludes that the job is not done and will try again. But first it will look up the make request function for the device indicated in the b_rdev field. Thus, in the RAID case, the RAID driver’s “make request” function will not be called again; instead, the kernel will pass the block to the appropriate function for the underlying device.

sbull, at initialization time, sets up its make request function as follows:

if (noqueue)
    blk_queue_make_request(BLK_DEFAULT_QUEUE(major), sbull_make_request);

It does not call blk_init_queue when operating in this mode, because the request queue will not be used.

When the kernel generates a request for an sbull device, it will call sbull_make_request, which is as follows:

int sbull_make_request(request_queue_t *queue, int rw, 
                       struct buffer_head *bh)
    u8 *ptr;

    /* Figure out what we are doing */
    Sbull_Dev *device = sbull_devices + MINOR(bh->b_rdev);
    ptr = device->data + bh->b_rsector * sbull_hardsect;

    /* Paranoid check; this apparently can really happen */
    if (ptr + bh->b_size > device->data + sbull_blksize*sbull_size) {
        static int count = 0;
        if (count++ < 5)
            printk(KERN_WARNING "sbull: request past end of device\n");
        bh->b_end_io(bh, 0);
        return 0;

    /* This could be a high-memory buffer; shift it down */
    bh = create_bounce(rw, bh);

    /* Do the transfer */
    switch(rw) {
    case READ:
    case READA:  /* Read ahead */
        memcpy(bh->b_data, ptr, bh->b_size); /* from sbull to buffer */
        bh->b_end_io(bh, 1);
    case WRITE:
        memcpy(ptr, bh->b_data, bh->b_size); /* from buffer to sbull */
        mark_buffer_uptodate(bh, 1);
        bh->b_end_io(bh, 1);
        /* can't happen */
        bh->b_end_io(bh, 0);

    /* Nonzero return means we're done */
    return 0;

For the most part, this code should look familiar. It contains the usual calculations to determine where the block lives within the sbull device and uses memcpy to perform the operation. Because the operation completes immediately, it is able to call bh->b_end_io to indicate the completion of the operation, and it returns 0 to the kernel.

There is, however, one detail that the “make request” function must take care of. The buffer to be transferred could be resident in high memory, which is not directly accessible by the kernel. High memory is covered in detail in Chapter 13. We won’t repeat the discussion here; suffice it to say that one way to deal with the problem is to replace a high-memory buffer with one that is in accessible memory. The function create_bounce will do so, in a way that is transparent to the driver. The kernel normally uses create_bounce before placing buffers in the driver’s request queue; if the driver implements its own make_request_fn, however, it must take care of this task itself.

[49] The RAM-disk driver, for example, makes its memory look as if it were in the buffer cache. Since the “disk” buffer is already in system RAM, there’s no need to keep a copy in the buffer cache. Our sample code is thus much less efficient than a properly implemented RAM disk, not being concerned with RAM-disk-specific performance issues.

Get Linux Device Drivers, Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.