Most block devices are not used in one large chunk. Instead, the
system administrator expects to be able to
partition the device—to split it into
several independent pseudodevices. If you try to create partitions on
an sbull device with
fdisk, you’ll run into problems. The
fdisk program calls the partitions
/dev/sbull01
, /dev/sbull02
,
and so on, but those names don’t exist on the filesystem. More to the
point, there is no mechanism in place for binding those names to
partitions in the sbull device. Something
more must be done before a block device can be partitioned.
To demonstrate how partitions are supported, we introduce a new device called spull, a “Simple Partitionable Utility.” It is far simpler than sbull, lacking the request queue management and some flexibility (like the ability to change the hard-sector size). The device resides in the spull directory and is completely detached from sbull, even though they share some code.
To be able to support partitions on a device, we must assign several
minor numbers to each physical device. One number is used to access
the whole device (for example, /dev/hda
), and the
others are used to access the various partitions (such as
/dev/hda1
). Since
fdisk creates partition names by adding a
numerical suffix to the whole-disk device name, we’ll follow the same
naming convention in the spull driver.
The device nodes implemented by spull are
called pd
, for “partitionable disk.” The four
whole devices (also called units) are thus named
/dev/pda
through /dev/pdd
;
each device supports at most 15 partitions. Minor numbers have the
following meaning: the least significant four bits represent the
partition number (where 0 is the whole device), and the most
significant four bits represent the unit number. This convention is
expressed in the source file by the following macros:
#define MAJOR_NR spull_major /* force definitions on in blk.h */ int spull_major; /* must be declared before including blk.h */ #define SPULL_SHIFT 4 /* max 16 partitions */ #define SPULL_MAXNRDEV 4 /* max 4 device units */ #define DEVICE_NR(device) (MINOR(device)>>SPULL_SHIFT) #define DEVICE_NAME "pd" /* name for messaging */
The spull driver also hardwires the value of the hard-sector size in order to simplify the code:
#define SPULL_HARDSECT 512 /* 512-byte hardware sectors */
Every partitionable device needs to know how it is partitioned. The information is available in the partition table, and part of the initialization process consists of decoding the partition table and updating the internal data structures to reflect the partition information.
This decoding isn’t easy, but fortunately the kernel offers “generic hard disk” support usable by all block drivers. Such support considerably reduces the amount of code needed in the driver for handling partitions. Another advantage of the generic support is that the driver writer doesn’t need to understand how the partitioning is done, and new partitioning schemes can be supported in the kernel without requiring changes to driver code.
A block driver that supports partitions must include
<linux/genhd.h>
and should declare a
struct gendisk
structure. This structure describes
the layout of the disk(s) provided by the driver; the kernel maintains
a global list of such structures, which may be queried to see what
disks and partitions are available on the system.
Before we go further, let’s look at some of the fields in
struct gendisk
. You’ll need to understand them in
order to exploit generic device support.
-
int major
The major number for the device that the structure refers to.
-
const char *major_name
The base name for devices belonging to this major number. Each device name is derived from this name by adding a letter for each unit and a number for each partition. For example, “hd” is the base name that is used to build
/dev/hda1
and/dev/hdb3
. In modern kernels, the full length of the disk name can be up to 32 characters; the 2.0 kernel, however, was more restricted. Drivers wishing to be backward portable to 2.0 should limit themajor_name
field to five characters. The name for spull ispd
(“partitionable disk”).-
int minor_shift
The number of bit shifts needed to extract the drive number from the device minor number. In spull the number is 4. The value in this field should be consistent with the definition of the macro
DEVICE_NR(device)
(see Section 12.2). The macro in spull expands todevice>>4
.-
int max_p
The maximum number of partitions. In our example,
max_p
is 16, or more generally,1 << minor_shift
.-
struct hd_struct *part
The decoded partition table for the device. The driver uses this item to determine what range of the disk’s sectors is accessible through each minor number. The driver is responsible for allocation and deallocation of this array, which most drivers implement as a static array of
max_nr << minor_shift
structures. The driver should initialize the array to zeros before the kernel decodes the partition table.-
int *sizes
An array of integers with the same information as the global
blk_size
array. In fact, they are usually the same array. The driver is responsible for allocating and deallocating thesizes
array. Note that the partition check for the device copies this pointer toblk_size
, so a driver handling partitionable devices doesn’t need to allocate the latter array.-
int nr_real
The number of real devices (units) that exist.
-
void *real_devices
A private area that may be used by the driver to keep any additional required information.
-
void struct gendisk *next
A pointer used to implement the linked list of generic hard-disk structures.
-
struct block_device_operations *fops;
A pointer to the block operations structure for this device.
Many of the fields in the gendisk
structure are set
up at initialization time, so the compile-time setup is relatively
simple:
struct gendisk spull_gendisk = { major: 0, /* Major number assigned later */ major_name: "pd", /* Name of the major device */ minor_shift: SPULL_SHIFT, /* Shift to get device number */ max_p: 1 << SPULL_SHIFT, /* Number of partitions */ fops: &spull_bdops, /* Block dev operations */ /* everything else is dynamic */ };
When a module initializes itself, it must set things up properly for
partition detection. Thus, spull starts by
setting up the spull_sizes
array for the
gendisk
structure (which also gets stored in
blk_size[MAJOR_NR]
and in the
sizes
field of the gendisk
structure) and the spull_partitions
array, which
holds the actual partition information (and gets stored in the
part
member of the gendisk
structure). Both of these arrays are initialized to zeros at this
time. The code looks like this:
spull_sizes = kmalloc( (spull_devs << SPULL_SHIFT) * sizeof(int), GFP_KERNEL); if (!spull_sizes) goto fail_malloc; /* Start with zero-sized partitions, and correctly sized units */ memset(spull_sizes, 0, (spull_devs << SPULL_SHIFT) * sizeof(int)); for (i=0; i< spull_devs; i++) spull_sizes[i<<SPULL_SHIFT] = spull_size; blk_size[MAJOR_NR] = spull_gendisk.sizes = spull_sizes; /* Allocate the partitions array. */ spull_partitions = kmalloc( (spull_devs << SPULL_SHIFT) * sizeof(struct hd_struct), GFP_KERNEL); if (!spull_partitions) goto fail_malloc; memset(spull_partitions, 0, (spull_devs << SPULL_SHIFT) * sizeof(struct hd_struct)); /* fill in whole-disk entries */ for (i=0; i < spull_devs; i++) spull_partitions[i << SPULL_SHIFT].nr_sects = spull_size*(blksize/SPULL_HARDSECT); spull_gendisk.part = spull_partitions; spull_gendisk.nr_real = spull_devs;
The driver should also include its gendisk
structure on the global list. There is no kernel-supplied function
for adding gendisk
structures; it must be done by
hand:
spull_gendisk.next = gendisk_head; gendisk_head = &spull_gendisk;
In practice, the only thing the system does with this list is to
implement /proc/partitions
.
The register_disk function, which we have already seen briefly, handles the job of reading the disk’s partition table.
register_disk(struct gendisk *gd, int drive, unsigned minors, struct block_device_operations *ops, long size);
Here, gd
is the gendisk
structure that we built earlier, drive
is the device
number, minors
is the number of partitions
supported, ops
is the
block_device_operations
structure for the driver,
and size
is the size of the device in sectors.
Fixed disks might read the partition table only at module
initialization time and when BLKRRPART
is
invoked. Drivers for removable drives will also need to make this call
in the revalidate method. Either way, it is
important to remember that register_disk will
call your driver’s request function to read the
partition table, so the driver must be sufficiently initialized at
that point to handle requests. You should also not have any locks
held that will conflict with locks acquired in the
request function.
register_disk must be called for each disk
actually present on the system.
spull sets up partitions in the revalidate method:
int spull_revalidate(kdev_t i_rdev) { /* first partition, # of partitions */ int part1 = (DEVICE_NR(i_rdev) << SPULL_SHIFT) + 1; int npart = (1 << SPULL_SHIFT) -1; /* first clear old partition information */ memset(spull_gendisk.sizes+part1, 0, npart*sizeof(int)); memset(spull_gendisk.part +part1, 0, npart*sizeof(struct hd_struct)); spull_gendisk.part[DEVICE_NR(i_rdev) << SPULL_SHIFT].nr_sects = spull_size << 1; /* then fill new info */ printk(KERN_INFO "Spull partition check: (%d) ", DEVICE_NR(i_rdev)); register_disk(&spull_gendisk, i_rdev, SPULL_MAXNRDEV, &spull_bdops, spull_size << 1); return 0; }
It’s interesting to note that register_disk prints partition information by repeatedly calling
printk(" %s", disk_name(hd, minor, buf));
That’s why spull prints a leading string. It’s meant to add some context to the information that gets stuffed into the system log.
When a partitionable module is unloaded, the driver should arrange for all the partitions to be flushed, by calling fsync_dev for every supported major/minor pair. All of the relevant memory should be freed as well, of course. The cleanup function for spull is as follows:
for (i = 0; i < (spull_devs << SPULL_SHIFT); i++) fsync_dev(MKDEV(spull_major, i)); /* flush the devices */ blk_cleanup_queue(BLK_DEFAULT_QUEUE(major)); read_ahead[major] = 0; kfree(blk_size[major]); /* which is gendisk->sizes as well */ blk_size[major] = NULL; kfree(spull_gendisk.part); kfree(blksize_size[major]); blksize_size[major] = NULL;
It is also necessary to remove the gendisk
structure from the global list. There is no function provided to do
this work, so it’s done by hand:
for (gdp = &gendisk_head; *gdp; gdp = &((*gdp)->next)) if (*gdp == &spull_gendisk) { *gdp = (*gdp)->next; break; }
Note that there is no unregister_disk to complement the register_disk function. Everything done by register_disk is stored in the driver’s own arrays, so there is no additional cleanup required at unload time.
If you want to mount your root filesystem from a device whose driver
is available only in modularized form, you must use the
initrd facility offered by modern Linux
kernels. We won’t introduce initrd here; this
subsection is aimed at readers who know about
initrd and wonder how it affects block drivers.
More information on initrd can be found in
Documentation/initrd.txt
in the kernel source.
When you boot a kernel with initrd, it establishes a temporary running environment before it mounts the real root filesystem. Modules are usually loaded from within the RAM disk being used as the temporary root file system.
Because the initrd process is run after all boot-time initialization is complete (but before the real root filesystem has been mounted), there’s no difference between loading a normal module and loading one living in the initrd RAM disk. If a driver can be correctly loaded and used as a module, all Linux distributions that have initrd available can include the driver on their installation disks without requiring you to hack in the kernel source.
We have seen how to initialize partitionable devices, but not yet how
to access data within the partitions. To do that, we need to make use
of the partition information stored in the
gendisk->part
array by
register_disk. This array is made up of
hd_struct
structures, and is indexed by the minor
number. The hd_struct
has two fields of interest:
start_sect
tells where a given partition starts on
the disk, and nr_sects
gives the size of that
partition.
Here we will show how spull makes use of that information. The following code includes only those parts of spull that differ from sbull, because most of the code is exactly the same.
First of all, open and close
must keep track of the usage count for each device. Because the usage
count refers to the physical device (unit), the following declaration
and assignment is used for the dev
variable:
Spull_Dev *dev = spull_devices + DEVICE_NR(inode->i_rdev);
The DEVICE_NR
macro used here is the one that must
be declared before <linux/blk.h>
is included;
it yields the physical device number without taking into account which
partition is being used.
Although almost every device method works with the physical device as
a whole, ioctl should access specific information
for each partition. For example, when mkfs
calls ioctl to retrieve the size of the device on
which it will build a filesystem, it should be told the size of the
partition of interest, not the size of the whole device. Here is how
the BLKGETSIZE
ioctl command
is affected by the change from one minor number per device to multiple
minor numbers per device. As you might expect,
spull_gendisk->part
is used as the source of the
partition size.
case BLKGETSIZE: /* Return the device size, expressed in sectors */ err = ! access_ok (VERIFY_WRITE, arg, sizeof(long)); if (err) return -EFAULT; size = spull_gendisk.part[MINOR(inode->i_rdev)].nr_sects; if (copy_to_user((long *) arg, &size, sizeof (long))) return -EFAULT; return 0;
The other ioctl command that is different for
partitionable devices is BLKRRPART
. Rereading the
partition table makes sense for partitionable devices and is
equivalent to revalidating a disk after a disk change:
case BLKRRPART: /* re-read partition table */ return spull_revalidate(inode->i_rdev);
But the major difference between sbull and spull is in the request function. In spull, the request function needs to use the partition information in order to correctly transfer data for the different minor numbers. Locating the transfer is done by simply adding the starting sector to that provided in the request; the partition size information is then used to be sure the request fits within the partition. Once that is done, the implementation is the same as for sbull.
Here are the relevant lines in spull_request:
ptr = device->data + (spull_partitions[minor].start_sect + req->sector)*SPULL_HARDSECT; size = req->current_nr_sectors*SPULL_HARDSECT; /* * Make sure that the transfer fits within the device. */ if (req->sector + req->current_nr_sectors > spull_partitions[minor].nr_sects) { static int count = 0; if (count++ < 5) printk(KERN_WARNING "spull: request past end of partition\n"); return 0; }
The number of sectors is multiplied by the hardware sector size (which, remember, is hardwired in spull) to get the size of the partition in bytes.
Get Linux Device Drivers, Second Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.