Partitionable Devices

Most block devices are not used in one large chunk. Instead, the system administrator expects to be able to partition the device—to split it into several independent pseudodevices. If you try to create partitions on an sbull device with fdisk, you’ll run into problems. The fdisk program calls the partitions /dev/sbull01, /dev/sbull02, and so on, but those names don’t exist on the filesystem. More to the point, there is no mechanism in place for binding those names to partitions in the sbull device. Something more must be done before a block device can be partitioned.

To demonstrate how partitions are supported, we introduce a new device called spull, a “Simple Partitionable Utility.” It is far simpler than sbull, lacking the request queue management and some flexibility (like the ability to change the hard-sector size). The device resides in the spull directory and is completely detached from sbull, even though they share some code.

To be able to support partitions on a device, we must assign several minor numbers to each physical device. One number is used to access the whole device (for example, /dev/hda), and the others are used to access the various partitions (such as /dev/hda1). Since fdisk creates partition names by adding a numerical suffix to the whole-disk device name, we’ll follow the same naming convention in the spull driver.

The device nodes implemented by spull are called pd, for “partitionable disk.” The four whole devices (also called units) are thus named /dev/pda through /dev/pdd; each device supports at most 15 partitions. Minor numbers have the following meaning: the least significant four bits represent the partition number (where 0 is the whole device), and the most significant four bits represent the unit number. This convention is expressed in the source file by the following macros:

#define MAJOR_NR spull_major /* force definitions on in blk.h */
int spull_major; /* must be declared before including blk.h */

#define SPULL_SHIFT 4                         /* max 16 partitions  */
#define SPULL_MAXNRDEV 4                      /* max 4 device units */
#define DEVICE_NR(device) (MINOR(device)>>SPULL_SHIFT)
#define DEVICE_NAME "pd"                      /* name for messaging */

The spull driver also hardwires the value of the hard-sector size in order to simplify the code:

#define SPULL_HARDSECT 512  /* 512-byte hardware sectors */

The Generic Hard Disk

Every partitionable device needs to know how it is partitioned. The information is available in the partition table, and part of the initialization process consists of decoding the partition table and updating the internal data structures to reflect the partition information.

This decoding isn’t easy, but fortunately the kernel offers “generic hard disk” support usable by all block drivers. Such support considerably reduces the amount of code needed in the driver for handling partitions. Another advantage of the generic support is that the driver writer doesn’t need to understand how the partitioning is done, and new partitioning schemes can be supported in the kernel without requiring changes to driver code.

A block driver that supports partitions must include <linux/genhd.h> and should declare a struct gendisk structure. This structure describes the layout of the disk(s) provided by the driver; the kernel maintains a global list of such structures, which may be queried to see what disks and partitions are available on the system.

Before we go further, let’s look at some of the fields in struct gendisk. You’ll need to understand them in order to exploit generic device support.

int major

The major number for the device that the structure refers to.

const char *major_name

The base name for devices belonging to this major number. Each device name is derived from this name by adding a letter for each unit and a number for each partition. For example, “hd” is the base name that is used to build /dev/hda1 and /dev/hdb3. In modern kernels, the full length of the disk name can be up to 32 characters; the 2.0 kernel, however, was more restricted. Drivers wishing to be backward portable to 2.0 should limit the major_name field to five characters. The name for spull is pd (“partitionable disk”).

int minor_shift

The number of bit shifts needed to extract the drive number from the device minor number. In spull the number is 4. The value in this field should be consistent with the definition of the macro DEVICE_NR(device) (see Section 12.2). The macro in spull expands to device>>4.

int max_p

The maximum number of partitions. In our example, max_p is 16, or more generally, 1 << minor_shift.

struct hd_struct *part

The decoded partition table for the device. The driver uses this item to determine what range of the disk’s sectors is accessible through each minor number. The driver is responsible for allocation and deallocation of this array, which most drivers implement as a static array of max_nr << minor_shift structures. The driver should initialize the array to zeros before the kernel decodes the partition table.

int *sizes

An array of integers with the same information as the global blk_size array. In fact, they are usually the same array. The driver is responsible for allocating and deallocating the sizes array. Note that the partition check for the device copies this pointer to blk_size, so a driver handling partitionable devices doesn’t need to allocate the latter array.

int nr_real

The number of real devices (units) that exist.

void *real_devices

A private area that may be used by the driver to keep any additional required information.

void struct gendisk *next

A pointer used to implement the linked list of generic hard-disk structures.

struct block_device_operations *fops;

A pointer to the block operations structure for this device.

Many of the fields in the gendisk structure are set up at initialization time, so the compile-time setup is relatively simple:

struct gendisk spull_gendisk = {
    major:              0,            /* Major number assigned later */
    major_name:         "pd",         /* Name of the major device */
    minor_shift:        SPULL_SHIFT,  /* Shift to get device number */
    max_p:              1 << SPULL_SHIFT, /* Number of partitions */
    fops:               &spull_bdops, /* Block dev operations */
/* everything else is dynamic */

Partition Detection

When a module initializes itself, it must set things up properly for partition detection. Thus, spull starts by setting up the spull_sizes array for the gendisk structure (which also gets stored in blk_size[MAJOR_NR] and in the sizes field of the gendisk structure) and the spull_partitions array, which holds the actual partition information (and gets stored in the part member of the gendisk structure). Both of these arrays are initialized to zeros at this time. The code looks like this:

spull_sizes = kmalloc( (spull_devs << SPULL_SHIFT) * sizeof(int),
if (!spull_sizes)
    goto fail_malloc;

/* Start with zero-sized partitions, and correctly sized units */
memset(spull_sizes, 0, (spull_devs << SPULL_SHIFT) * sizeof(int));
for (i=0; i< spull_devs; i++)
    spull_sizes[i<<SPULL_SHIFT] = spull_size;
blk_size[MAJOR_NR] = spull_gendisk.sizes = spull_sizes;

/* Allocate the partitions array. */
spull_partitions = kmalloc( (spull_devs << SPULL_SHIFT) *
                           sizeof(struct hd_struct), GFP_KERNEL);
if (!spull_partitions)
    goto fail_malloc;

memset(spull_partitions, 0, (spull_devs << SPULL_SHIFT) *
       sizeof(struct hd_struct));
/* fill in whole-disk entries */
for (i=0; i < spull_devs; i++) 
    spull_partitions[i << SPULL_SHIFT].nr_sects =
spull_gendisk.part = spull_partitions;
spull_gendisk.nr_real = spull_devs;

The driver should also include its gendisk structure on the global list. There is no kernel-supplied function for adding gendisk structures; it must be done by hand: = gendisk_head;
gendisk_head = &spull_gendisk;

In practice, the only thing the system does with this list is to implement /proc/partitions.

The register_disk function, which we have already seen briefly, handles the job of reading the disk’s partition table.

register_disk(struct gendisk *gd, int drive, unsigned minors,
              struct block_device_operations *ops, long size);

Here, gd is the gendisk structure that we built earlier, drive is the device number, minors is the number of partitions supported, ops is the block_device_operations structure for the driver, and size is the size of the device in sectors.

Fixed disks might read the partition table only at module initialization time and when BLKRRPART is invoked. Drivers for removable drives will also need to make this call in the revalidate method. Either way, it is important to remember that register_disk will call your driver’s request function to read the partition table, so the driver must be sufficiently initialized at that point to handle requests. You should also not have any locks held that will conflict with locks acquired in the request function. register_disk must be called for each disk actually present on the system.

spull sets up partitions in the revalidate method:

int spull_revalidate(kdev_t i_rdev)
  /* first partition, # of partitions */
  int part1 = (DEVICE_NR(i_rdev) << SPULL_SHIFT) + 1;
  int npart = (1 << SPULL_SHIFT) -1;

  /* first clear old partition information */
  memset(spull_gendisk.sizes+part1, 0, npart*sizeof(int));
  memset(spull_gendisk.part +part1, 0, npart*sizeof(struct hd_struct));
  spull_gendisk.part[DEVICE_NR(i_rdev) << SPULL_SHIFT].nr_sects =
          spull_size << 1;

  /* then fill new info */
  printk(KERN_INFO "Spull partition check: (%d) ", DEVICE_NR(i_rdev));
  register_disk(&spull_gendisk, i_rdev, SPULL_MAXNRDEV, &spull_bdops,
                  spull_size << 1);
  return 0;

It’s interesting to note that register_disk prints partition information by repeatedly calling

printk(" %s", disk_name(hd, minor, buf));

That’s why spull prints a leading string. It’s meant to add some context to the information that gets stuffed into the system log.

When a partitionable module is unloaded, the driver should arrange for all the partitions to be flushed, by calling fsync_dev for every supported major/minor pair. All of the relevant memory should be freed as well, of course. The cleanup function for spull is as follows:

for (i = 0; i < (spull_devs << SPULL_SHIFT); i++)
    fsync_dev(MKDEV(spull_major, i)); /* flush the devices */
read_ahead[major] = 0;
kfree(blk_size[major]); /* which is gendisk->sizes as well */
blk_size[major] = NULL;
blksize_size[major] = NULL;

It is also necessary to remove the gendisk structure from the global list. There is no function provided to do this work, so it’s done by hand:

for (gdp = &gendisk_head; *gdp; gdp = &((*gdp)->next))
    if (*gdp == &spull_gendisk) {
        *gdp = (*gdp)->next;

Note that there is no unregister_disk to complement the register_disk function. Everything done by register_disk is stored in the driver’s own arrays, so there is no additional cleanup required at unload time.

Partition Detection Using initrd

If you want to mount your root filesystem from a device whose driver is available only in modularized form, you must use the initrd facility offered by modern Linux kernels. We won’t introduce initrd here; this subsection is aimed at readers who know about initrd and wonder how it affects block drivers. More information on initrd can be found in Documentation/initrd.txt in the kernel source.

When you boot a kernel with initrd, it establishes a temporary running environment before it mounts the real root filesystem. Modules are usually loaded from within the RAM disk being used as the temporary root file system.

Because the initrd process is run after all boot-time initialization is complete (but before the real root filesystem has been mounted), there’s no difference between loading a normal module and loading one living in the initrd RAM disk. If a driver can be correctly loaded and used as a module, all Linux distributions that have initrd available can include the driver on their installation disks without requiring you to hack in the kernel source.

The Device Methods for spull

We have seen how to initialize partitionable devices, but not yet how to access data within the partitions. To do that, we need to make use of the partition information stored in the gendisk->part array by register_disk. This array is made up of hd_struct structures, and is indexed by the minor number. The hd_struct has two fields of interest: start_sect tells where a given partition starts on the disk, and nr_sects gives the size of that partition.

Here we will show how spull makes use of that information. The following code includes only those parts of spull that differ from sbull, because most of the code is exactly the same.

First of all, open and close must keep track of the usage count for each device. Because the usage count refers to the physical device (unit), the following declaration and assignment is used for the dev variable:

Spull_Dev *dev = spull_devices + DEVICE_NR(inode->i_rdev);

The DEVICE_NR macro used here is the one that must be declared before <linux/blk.h> is included; it yields the physical device number without taking into account which partition is being used.

Although almost every device method works with the physical device as a whole, ioctl should access specific information for each partition. For example, when mkfs calls ioctl to retrieve the size of the device on which it will build a filesystem, it should be told the size of the partition of interest, not the size of the whole device. Here is how the BLKGETSIZE ioctl command is affected by the change from one minor number per device to multiple minor numbers per device. As you might expect, spull_gendisk->part is used as the source of the partition size.

   /* Return the device size, expressed in sectors */
   err = ! access_ok (VERIFY_WRITE, arg, sizeof(long));
   if (err) return -EFAULT;
   size = spull_gendisk.part[MINOR(inode->i_rdev)].nr_sects;
   if (copy_to_user((long *) arg, &size, sizeof (long)))
   return -EFAULT;
   return 0;

The other ioctl command that is different for partitionable devices is BLKRRPART. Rereading the partition table makes sense for partitionable devices and is equivalent to revalidating a disk after a disk change:

 case BLKRRPART: /* re-read partition table */
   return spull_revalidate(inode->i_rdev);

But the major difference between sbull and spull is in the request function. In spull, the request function needs to use the partition information in order to correctly transfer data for the different minor numbers. Locating the transfer is done by simply adding the starting sector to that provided in the request; the partition size information is then used to be sure the request fits within the partition. Once that is done, the implementation is the same as for sbull.

Here are the relevant lines in spull_request:

ptr = device->data +
      (spull_partitions[minor].start_sect + req->sector)*SPULL_HARDSECT;
size = req->current_nr_sectors*SPULL_HARDSECT;
 * Make sure that the transfer fits within the device.
if (req->sector + req->current_nr_sectors >
                spull_partitions[minor].nr_sects) {
    static int count = 0;
    if (count++ < 5)
      printk(KERN_WARNING "spull: request past end of partition\n");
    return 0;

The number of sectors is multiplied by the hardware sector size (which, remember, is hardwired in spull) to get the size of the partition in bytes.

Get Linux Device Drivers, Second Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.