Chapter 12. Loading Block Drivers

Our discussion thus far has been limited to char drivers. As we have already mentioned, however, char drivers are not the only type of driver used in Linux systems. Here we turn our attention to block drivers. Block drivers provide access to block-oriented devices—those that transfer data in randomly accessible, fixed-size blocks. The classic block device is a disk drive, though others exist as well.

The char driver interface is relatively clean and easy to use; the block interface, unfortunately, is a little messier. Kernel developers like to complain about it. There are two reasons for this state of affairs. The first is simple history—the block interface has been at the core of every version of Linux since the first, and it has proved hard to change. The other reason is performance. A slow char driver is an undesirable thing, but a slow block driver is a drag on the entire system. As a result, the design of the block interface has often been influenced by the need for speed.

The block driver interface has evolved significantly over time. As with the rest of the book, we cover the 2.4 interface in this chapter, with a discussion of the changes at the end. The example drivers work on all kernels between 2.0 and 2.4, however.

This chapter explores the creation of block drivers with two new example drivers. The first, sbull (Simple Block Utility for Loading Localities) implements a block device using system memory—a RAM-disk driver, essentially. Later on, we’ll introduce a variant called spull as a way of showing how to deal with partition tables.

As always, these example drivers gloss over many of the issues found in real block drivers; their purpose is to demonstrate the interface that such drivers must work with. Real drivers will have to deal with hardware, so the material covered in Chapter 8 and Chapter 9 will be useful as well.

One quick note on terminology: the word block as used in this book refers to a block of data as determined by the kernel. The size of blocks can be different in different disks, though they are always a power of two. A sector is a fixed-size unit of data as determined by the underlying hardware. Sectors are almost always 512 bytes long.

Registering the Driver

Like char drivers, block drivers in the kernel are identified by major numbers. Block major numbers are entirely distinct from char major numbers, however. A block device with major number 32 can coexist with a char device using the same major number since the two ranges are separate.

The functions for registering and unregistering block devices look similar to those for char devices:

#include <linux/fs.h>
int register_blkdev(unsigned int major, const char *name, 
    struct block_device_operations *bdops);
int unregister_blkdev(unsigned int major, const char *name);

The arguments have the same general meaning as for char devices, and major numbers can be assigned dynamically in the same way. So the sbull device registers itself in almost exactly the same way as scull did:

result = register_blkdev(sbull_major, "sbull", &sbull_bdops);
if (result < 0) {
    printk(KERN_WARNING "sbull: can't get major %d\n",sbull_major);
    return result;
}
if (sbull_major == 0) sbull_major = result; /* dynamic */
major = sbull_major; /* Use `major' later on to save typing */

The similarity stops here, however. One difference is already evident: register_chrdev took a pointer to a file_operations structure, but register_blkdev uses a structure of type block_device_operations instead—as it has since kernel version 2.3.38. The structure is still sometimes referred to by the name fops in block drivers; we’ll call it bdops to be more faithful to what the structure is and to follow the suggested naming. The definition of this structure is as follows:

struct block_device_operations {
    int (*open) (struct inode *inode, struct file *filp);
    int (*release) (struct inode *inode, struct file *filp);
    int (*ioctl) (struct inode *inode, struct file *filp,
                    unsigned command, unsigned long argument);
    int (*check_media_change) (kdev_t dev);
    int (*revalidate) (kdev_t dev);
};

The open, release, and ioctl methods listed here are exactly the same as their char device counterparts. The other two methods are specific to block devices and are discussed later in this chapter. Note that there is no owner field in this structure; block drivers must still maintain their usage count manually, even in the 2.4 kernel.

The bdops structure used in sbull is as follows:

struct block_device_operations sbull_bdops = {
    open:               sbull_open,
    release:            sbull_release,
    ioctl:              sbull_ioctl,
    check_media_change: sbull_check_change,
    revalidate:         sbull_revalidate,
};

Note that there are no read or write operations provided in the block_device_operations structure. All I/O to block devices is normally buffered by the system (the only exception is with “raw” devices, which we cover in the next chapter); user processes do not perform direct I/O to these devices. User-mode access to block devices usually is implicit in filesystem operations they perform, and those operations clearly benefit from I/O buffering. However, even “direct” I/O to a block device, such as when a filesystem is created, goes through the Linux buffer cache.[47] As a result, the kernel provides a single set of read and write functions for block devices, and drivers do not need to worry about them.

Clearly, a block driver must eventually provide some mechanism for actually doing block I/O to a device. In Linux, the method used for these I/O operations is called request; it is the equivalent of the “strategy” function found on many Unix systems. The request method handles both read and write operations and can be somewhat complex. We will get into the details of request shortly.

For the purposes of block device registration, however, we must tell the kernel where our request method is. This method is not kept in the block_device_operations structure, for both historical and performance reasons; instead, it is associated with the queue of pending I/O operations for the device. By default, there is one such queue for each major number. A block driver must initialize that queue with blk_init_queue. Queue initialization and cleanup is defined as follows:

#include <linux/blkdev.h>
blk_init_queue(request_queue_t *queue, request_fn_proc *request);
blk_cleanup_queue(request_queue_t *queue);

The init function sets up the queue, and associates the driver’s request function (passed as the second parameter) with the queue. It is necessary to call blk_cleanup_queue at module cleanup time. The sbull driver initializes its queue with this line of code:

blk_init_queue(BLK_DEFAULT_QUEUE(major), sbull_request);

Each device has a request queue that it uses by default; the macro BLK_DEFAULT_QUEUE(major) is used to indicate that queue when needed. This macro looks into a global array of blk_dev_struct structures called blk_dev, which is maintained by the kernel and indexed by major number. The structure looks like this:

struct blk_dev_struct {
    request_queue_t	request_queue;
    queue_proc		*queue;
    void		*data;
};

The request_queue member contains the I/O request queue that we have just initialized. We will look at the queue member shortly. The data field may be used by the driver for its own data—but few drivers do so.

Figure 12-1 visualizes the main steps a driver module performs to register with the kernel proper and deregister. If you compare this figure with Figure 2-1, similarities and differences should be clear.

Registering a Block Device Driver

Figure 12-1. Registering a Block Device Driver

In addition to blk_dev, several other global arrays hold information about block drivers. These arrays are indexed by the major number, and sometimes also the minor number. They are declared and described in drivers/block/ll_rw_block.c.

int blk_size[][];

This array is indexed by the major and minor numbers. It describes the size of each device, in kilobytes. If blk_size[major] is NULL, no checking is performed on the size of the device (i.e., the kernel might request data transfers past end-of-device).

int blksize_size[][];

The size of the block used by each device, in bytes. Like the previous one, this bidimensional array is indexed by both major and minor numbers. If blksize_size[major] is a null pointer, a block size of BLOCK_SIZE (currently 1 KB) is assumed. The block size for the device must be a power of two, because the kernel uses bit-shift operators to convert offsets to block numbers.

int hardsect_size[][];

Like the others, this data structure is indexed by the major and minor numbers. The default value for the hardware sector size is 512 bytes. With the 2.2 and 2.4 kernels, different sector sizes are supported, but they must always be a power of two greater than or equal to 512 bytes.

int read_ahead[]; , int max_readahead[][];

These arrays define the number of sectors to be read in advance by the kernel when a file is being read sequentially. read_ahead applies to all devices of a given type and is indexed by major number; max_readahead applies to individual devices and is indexed by both the major and minor numbers.

Reading data before a process asks for it helps system performance and overall throughput. A slower device should specify a bigger read-ahead value, while fast devices will be happy even with a smaller value. The bigger the read-ahead value, the more memory the buffer cache uses.

The primary difference between the two arrays is this: read_ahead is applied at the block I/O level and controls how many blocks may be read sequentially from the disk ahead of the current request. max_readahead works at the filesystem level and refers to blocks in the file, which may not be sequential on disk. Kernel development is moving toward doing read ahead at the filesystem level, rather than at the block I/O level. In the 2.4 kernel, however, read ahead is still done at both levels, so both of these arrays are used.

There is one read_ahead[] value for each major number, and it applies to all its minor numbers. max_readahead, instead, has a value for every device. The values can be changed via the driver’s ioctl method; hard-disk drivers usually set read_ahead to 8 sectors, which corresponds to 4 KB. The max_readahead value, on the other hand, is rarely set by the drivers; it defaults to MAX_READAHEAD, currently 31 pages.

int max_sectors[][];

This array limits the maximum size of a single request. It should normally be set to the largest transfer that your hardware can handle.

int max_segments[];

This array controlled the number of individual segments that could appear in a clustered request; it was removed just before the release of the 2.4 kernel, however. (See Section 12.4.2 later in this chapter for information on clustered requests).

The sbull device allows you to set these values at load time, and they apply to all the minor numbers of the sample driver. The variable names and their default values in sbull are as follows:

size=2048 (kilobytes)

Each RAM disk created by sbull takes two megabytes of RAM.

blksize=1024 (bytes)

The software “block” used by the module is one kilobyte, like the system default.

hardsect=512 (bytes)

The sbull sector size is the usual half-kilobyte value.

rahead=2 (sectors)

Because the RAM disk is a fast device, the default read-ahead value is small.

The sbull device also allows you to choose the number of devices to install. devs, the number of devices, defaults to 2, resulting in a default memory usage of four megabytes—two disks at two megabytes each.

The initialization of these arrays in sbull is done as follows:

read_ahead[major] = sbull_rahead;
result = -ENOMEM; /* for the possible errors */

sbull_sizes = kmalloc(sbull_devs * sizeof(int), GFP_KERNEL);
if (!sbull_sizes)
    goto fail_malloc;
for (i=0; i < sbull_devs; i++) /* all the same size */
    sbull_sizes[i] = sbull_size;
blk_size[major]=sbull_sizes;

sbull_blksizes = kmalloc(sbull_devs * sizeof(int), GFP_KERNEL);
if (!sbull_blksizes)
    goto fail_malloc;
for (i=0; i < sbull_devs; i++) /* all the same blocksize */
    sbull_blksizes[i] = sbull_blksize;
blksize_size[major]=sbull_blksizes;

sbull_hardsects = kmalloc(sbull_devs * sizeof(int), GFP_KERNEL);
if (!sbull_hardsects)
    goto fail_malloc;
for (i=0; i < sbull_devs; i++) /* all the same hardsect */
    sbull_hardsects[i] = sbull_hardsect;
hardsect_size[major]=sbull_hardsects;

For brevity, the error handling code (the target of the fail_malloc goto) has been omitted; it simply frees anything that was successfully allocated, unregisters the device, and returns a failure status.

One last thing that must be done is to register every “disk” device provided by the driver. sbull calls the necessary function (register_disk) as follows:

for (i = 0; i < sbull_devs; i++)
        register_disk(NULL, MKDEV(major, i), 1, &sbull_bdops,
                        sbull_size << 1);

In the 2.4.0 kernel, register_disk does nothing when invoked in this manner. The real purpose of register_disk is to set up the partition table, which is not supported by sbull. All block drivers, however, make this call whether or not they support partitions, indicating that it may become necessary for all block devices in the future. A block driver without partitions will work without this call in 2.4.0, but it is safer to include it. We revisit register_disk in detail later in this chapter, when we cover partitions.

The cleanup function used by sbull looks like this:

for (i=0; i<sbull_devs; i++)
    fsync_dev(MKDEV(sbull_major, i)); /* flush the devices */
unregister_blkdev(major, "sbull");
/*
 * Fix up the request queue(s)
 */
blk_cleanup_queue(BLK_DEFAULT_QUEUE(major));

/* Clean up the global arrays */
read_ahead[major] = 0;
kfree(blk_size[major]);
blk_size[major] = NULL;
kfree(blksize_size[major]);
blksize_size[major] = NULL;
kfree(hardsect_size[major]);
hardsect_size[major] = NULL;

Here, the call to fsync_dev is needed to free all references to the device that the kernel keeps in various caches. fsync_dev is the implementation of block_fsync, which is the fsync “method” for block devices.



[47] Actually, the 2.3 development series added the raw I/O capability, allowing user processes to write to block devices without involving the buffer cache. Block drivers, however, are entirely unaware of raw I/O, so we defer the discussion of that facility to the next chapter.

Get Linux Device Drivers, Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.