read and write

The read and write methods perform a similar task, that is, copying data from and to application code. Therefore, their prototypes are pretty similar and it’s worth introducing them at the same time:

 ssize_t read(struct file *filp, char *buff,
     size_t count, loff_t *offp);
 ssize_t write(struct file *filp, const char *buff,
     size_t count, loff_t *offp);

For both methods, filp is the file pointer and count is the size of the requested data transfer. The buff argument points to the user buffer holding the data to be written or the empty buffer where the newly read data should be placed. Finally, offp is a pointer to a “long offset type” object that indicates the file position the user is accessing. The return value is a “signed size type;” its use is discussed later.

As far as data transfer is concerned, the main issue associated with the two device methods is the need to transfer data between the kernel address space and the user address space. The operation cannot be carried out through pointers in the usual way, or through memcpy. User-space addresses cannot be used directly in kernel space, for a number of reasons.

One big difference between kernel-space addresses and user-space addresses is that memory in user-space can be swapped out. When the kernel accesses a user-space pointer, the associated page may not be present in memory, and a page fault is generated. The functions we introduce in this section and in Section 5.1.4 in Chapter 5 use some hidden magic to deal with page faults in the proper way even when the CPU is executing in kernel space.

Also, it’s interesting to note that the x86 port of Linux 2.0 used a completely different memory map for user space and kernel space. Thus, user-space pointers couldn’t be dereferenced at all from kernel space.

If the target device is an expansion board instead of RAM, the same problem arises, because the driver must nonetheless copy data between user buffers and kernel space (and possibly between kernel space and I/O memory).

Cross-space copies are performed in Linux by special functions, defined in <asm/uaccess.h>. Such a copy is either performed by a generic (memcpy-like) function or by functions optimized for a specific data size (char, short, int, long); most of them are introduced in Section 5.1.4 in Chapter 5.

The code for read and write in scull needs to copy a whole segment of data to or from the user address space. This capability is offered by the following kernel functions, which copy an arbitrary array of bytes and sit at the heart of every read and write implementation:

 unsigned long copy_to_user(void *to, const void *from, 
        unsigned long count);
 unsigned long copy_from_user(void *to, const void *from, 
        unsigned long count);

Although these functions behave like normal memcpy functions, a little extra care must be used when accessing user space from kernel code. The user pages being addressed might not be currently present in memory, and the page-fault handler can put the process to sleep while the page is being transferred into place. This happens, for example, when the page must be retrieved from swap space. The net result for the driver writer is that any function that accesses user space must be reentrant and must be able to execute concurrently with other driver functions (see also Section 5.2.3 in Chapter 5). That’s why we use semaphores to control concurrent access.

The role of the two functions is not limited to copying data to and from user-space: they also check whether the user space pointer is valid. If the pointer is invalid, no copy is performed; if an invalid address is encountered during the copy, on the other hand, only part of the data is copied. In both cases, the return value is the amount of memory still to be copied. The scull code looks for this error return, and returns -EFAULT to the user if it’s not 0.

The topic of user-space access and invalid user space pointers is somewhat advanced, and is discussed in Section 5.1.4 in Chapter 5. However, it’s worth suggesting that if you don’t need to check the user-space pointer you can invoke __copy_to_user and __copy_from_user instead. This is useful, for example, if you know you already checked the argument.

As far as the actual device methods are concerned, the task of the read method is to copy data from the device to user space (using copy_to_user), while the write method must copy data from user space to the device (using copy_from_user). Each read or write system call requests transfer of a specific number of bytes, but the driver is free to transfer less data—the exact rules are slightly different for reading and writing and are described later in this chapter.

Whatever the amount of data the methods transfer, they should in general update the file position at *offp to represent the current file position after successful completion of the system call. Most of the time the offp argument is just a pointer to filp->f_pos, but a different pointer is used in order to support the pread and pwrite system calls, which perform the equivalent of lseek and read or write in a single, atomic operation.

Figure 3-2 represents how a typical read implementation uses its arguments.

The arguments to read

Figure 3-2. The arguments to read

Both the read and write methods return a negative value if an error occurs. A return value greater than or equal to 0 tells the calling program how many bytes have been successfully transferred. If some data is transferred correctly and then an error happens, the return value must be the count of bytes successfully transferred, and the error does not get reported until the next time the function is called.

Although kernel functions return a negative number to signal an error, and the value of the number indicates the kind of error that occurred (as introduced in Chapter 2 in Section 2.4.1), programs that run in user space always see -1 as the error return value. They need to access the errno variable to find out what happened. The difference in behavior is dictated by the POSIX calling standard for system calls and the advantage of not dealing with errno in the kernel.

The read Method

The return value for read is interpreted by the calling application program as follows:

  • If the value equals the count argument passed to the read system call, the requested number of bytes has been transferred. This is the optimal case.

  • If the value is positive, but smaller than count, only part of the data has been transferred. This may happen for a number of reasons, depending on the device. Most often, the application program will retry the read. For instance, if you read using the fread function, the library function reissues the system call till completion of the requested data transfer.

  • If the value is 0, end-of-file was reached.

  • A negative value means there was an error. The value specifies what the error was, according to <linux/errno.h>. These errors look like -EINTR (interrupted system call) or -EFAULT (bad address).

What is missing from the preceding list is the case of “there is no data, but it may arrive later.” In this case, the read system call should block. We won’t deal with blocking input until Section 5.2 in Chapter 5.

The scull code takes advantage of these rules. In particular, it takes advantage of the partial-read rule. Each invocation of scull_read deals only with a single data quantum, without implementing a loop to gather all the data; this makes the code shorter and easier to read. If the reading program really wants more data, it reiterates the call. If the standard I/O library (i.e., fread and friends) is used to read the device, the application won’t even notice the quantization of the data transfer.

If the current read position is greater than the device size, the read method of scull returns 0 to signal that there’s no data available (in other words, we’re at end-of-file). This situation can happen if process A is reading the device while process B opens it for writing, thus truncating the device to a length of zero. Process A suddenly finds itself past end-of-file, and the next read call returns 0.

Here is the code for read:

ssize_t scull_read(struct file *filp, char *buf, size_t count,
    loff_t *f_pos)
{
 Scull_Dev *dev = filp->private_data; /* the first list item */
 Scull_Dev *dptr;
 int quantum = dev->quantum;
 int qset = dev->qset;
 int itemsize = quantum * qset; /* how many bytes in the list item */
 int item, s_pos, q_pos, rest;
 ssize_t ret = 0;

 if (down_interruptible(&dev->sem))
   return -ERESTARTSYS;
 if (*f_pos >= dev->size)
  goto out;
 if (*f_pos + count > dev->size)
  count = dev->size - *f_pos;
 /* find list item, qset index, and offset in the quantum */
 item = (long)*f_pos / itemsize;
 rest = (long)*f_pos % itemsize;
 s_pos = rest / quantum; q_pos = rest % quantum;

 /* follow the list up to the right position (defined elsewhere) */
 dptr = scull_follow(dev, item);

 if (!dptr->data)
  goto out; /* don't fill holes */
 if (!dptr->data[s_pos])
  goto out;
 /* read only up to the end of this quantum */
 if (count > quantum - q_pos)
  count = quantum - q_pos;


 if (copy_to_user(buf, dptr->data[s_pos]+q_pos, count)) {
  ret = -EFAULT;
	goto out;
 }
 *f_pos += count;
 ret = count;

 out:
 up(&dev->sem);
 return ret;
}

The write Method

write, like read, can transfer less data than was requested, according to the following rules for the return value:

  • If the value equals count, the requested number of bytes has been transferred.

  • If the value is positive, but smaller than count, only part of the data has been transferred. The program will most likely retry writing the rest of the data.

  • If the value is 0, nothing was written. This result is not an error, and there is no reason to return an error code. Once again, the standard library retries the call to write. We’ll examine the exact meaning of this case in Section 5.2 in Chapter 5, where blocking write is introduced.

  • A negative value means an error occurred; like for read, valid error values are those defined in <linux/errno.h>.

Unfortunately, there may be misbehaving programs that issue an error message and abort when a partial transfer is performed. This happens because some programmers are accustomed to seeing write calls that either fail or succeed completely, which is actually what happens most of the time and should be supported by devices as well. This limitation in the scull implementation could be fixed, but we didn’t want to complicate the code more than necessary.

The scull code for write deals with a single quantum at a time, like the read method does:

ssize_t scull_write(struct file *filp, const char *buf, size_t count,
    loff_t *f_pos)
{
 Scull_Dev *dev = filp->private_data;
 Scull_Dev *dptr;
 int quantum = dev->quantum;
 int qset = dev->qset;
 int itemsize = quantum * qset;
 int item, s_pos, q_pos, rest;
 ssize_t ret = -ENOMEM; /* value used in "goto out" statements */

 if (down_interruptible(&dev->sem))
   return -ERESTARTSYS;

 /* find list item, qset index and offset in the quantum */
 item = (long)*f_pos / itemsize;
 rest = (long)*f_pos % itemsize;
 s_pos = rest / quantum; q_pos = rest % quantum;

 /* follow the list up to the right position */
 dptr = scull_follow(dev, item);
 if (!dptr->data) {
  dptr->data = kmalloc(qset * sizeof(char *), GFP_KERNEL);
  if (!dptr->data)
   goto out;
  memset(dptr->data, 0, qset * sizeof(char *));
 }
 if (!dptr->data[s_pos]) {
  dptr->data[s_pos] = kmalloc(quantum, GFP_KERNEL);
  if (!dptr->data[s_pos])
   goto out;
 }
 /* write only up to the end of this quantum */
 if (count > quantum - q_pos)
  count = quantum - q_pos;

 if (copy_from_user(dptr->data[s_pos]+q_pos, buf, count)) {
  ret = -EFAULT;
	goto out;
 }
 *f_pos += count;
 ret = count;

 /* update the size */
 if (dev->size < *f_pos)
  dev-> size = *f_pos;

 out:
 up(&dev->sem);
 return ret;
}

readv and writev

Unix systems have long supported two alternative system calls named readv and writev. These “vector” versions take an array of structures, each of which contains a pointer to a buffer and a length value. A readv call would then be expected to read the indicated amount into each buffer in turn. writev, instead, would gather together the contents of each buffer and put them out as a single write operation.

Until version 2.3.44 of the kernel, however, Linux always emulated readv and writev with multiple calls to read and write. If your driver does not supply methods to handle the vector operations, they will still be implemented that way. In many situations, however, greater efficiency is achieved by implementing readv and writev directly in the driver.

The prototypes for the vector operations are as follows:

 ssize_t (*readv) (struct file *filp, const struct iovec *iov, 
      unsigned long count, loff_t *ppos);
 ssize_t (*writev) (struct file *filp, const struct iovec *iov, 
      unsigned long count, loff_t *ppos);

Here, the filp and ppos arguments are the same as for read and write. The iovec structure, defined in <linux/uio.h>, looks like this:

 struct iovec
 {
  void *iov_base;
  _&thinsp;_kernel_size_t iov_len;
 };

Each iovec describes one chunk of data to be transferred; it starts at iov_base (in user space) and is iov_len bytes long. The count parameter to the method tells how many iovec structures there are. These structures are created by the application, but the kernel copies them into kernel space before calling the driver.

The simplest implementation of the vectored operations would be a simple loop that just passes the address and length out of each iovec to the driver’s read or write function. Often, however, efficient and correct behavior requires that the driver do something smarter. For example, a writev on a tape drive should write the contents of all the iovec structures as a single record on the tape.

Many drivers, though, will gain no benefit from implementing these methods themselves. Thus, scull omits them. The kernel will emulate them with read and write, and the end result is the same.

Get Linux Device Drivers, Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.