Buy this Book
Print Book $49.99 PDF $34.99 Read it Now! Print Book £30.99
Add to UK Cart
Reprint Licensing

Linux System Programming
Linux System Programming Talking Directly to the Kernel and C Library

By Robert Love
Book Price: $49.99 USD
£30.99 GBP
PDF Price: $34.99

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: Introduction and Essential Concepts
This book is about system programming, which is the art of writing system software. System software lives at a low level, interfacing directly with the kernel and core system libraries. System software includes your shell and your text editor, your compiler and your debugger, your core utilities and system daemons. These components are entirely system software, based on the kernel and the C library. Much other software (such as high-level GUI applications) lives mostly in the higher levels, delving into the low level only on occasion, if at all. Some programmers spend all day every day writing system software; others spend only part of their time on this task. There is no programmer, however, who does not benefit from some understanding of system programming. Whether it is the programmer's raison d'être, or merely a foundation for higher-level concepts, system programming is at the heart of all software that we write.
In particular, this book is about system programming on Linux. Linux is a modern Unix-like system, written from scratch by Linus Torvalds, and a loose-knit community of hackers around the globe. Although Linux shares the goals and ideology of Unix, Linux is not Unix. Instead, Linux follows its own course, diverging where desired, and converging only where practical. Generally, the core of Linux system programming is the same as on any other Unix system. Beyond the basics, however, Linux does well to differentiate itself—in comparison with traditional Unix systems, Linux is rife with additional system calls, different behavior, and new features.
Traditionally speaking, all Unix programming is system-level programming. Historically, Unix systems did not include many higher-level abstractions. Even programming in a development environment such as the X Window System exposed in full view the core Unix system API. Consequently, it can be said that this book is a book on Linux programming in general. But note that this book does not cover the Linux programming environment—there is no tutorial on make in these pages. What is covered is the system programming API exposed on a modern Linux machine.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
System Programming
Traditionally speaking, all Unix programming is system-level programming. Historically, Unix systems did not include many higher-level abstractions. Even programming in a development environment such as the X Window System exposed in full view the core Unix system API. Consequently, it can be said that this book is a book on Linux programming in general. But note that this book does not cover the Linux programming environment—there is no tutorial on make in these pages. What is covered is the system programming API exposed on a modern Linux machine.
System programming is most commonly contrasted with application programming. System-level and application-level programming differ in some aspects, but not in others. System programming is distinct in that system programmers must have a strong awareness of the hardware and operating system on which they are working. Of course, there are also differences between the libraries used and calls made. Depending on the "level" of the stack at which an application is written, the two may not actually be very interchangeable, but, generally speaking, moving from application programming to system programming (or vice versa) is not hard. Even when the application lives very high up the stack, far from the lowest levels of the system, knowledge of system programming is important. And the same good practices are employed in all forms of programming.
The last several years have witnessed a trend in application programming away from system-level programming and toward very high-level development, either through web software (such as JavaScript or PHP), or through managed code (such as C# or Java). This development, however, does not foretell the death of system programming. Indeed, someone still has to write the JavaScript interpreter and the C# runtime, which is itself system programming. Furthermore, the developers writing PHP or Java can still benefit from knowledge of system programming, as an understanding of the core internals allows for better code no matter where in the stack the code is written.
Despite this trend in application programming, the majority of Unix and Linux code is still written at the system level. Much of it is C, and subsists primarily on interfaces provided by the C library and the kernel. This is traditional system programming—Apache,
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
APIs and ABIs
Programmers are naturally interested in ensuring their programs run on all of the systems that they have promised to support, now and in the future. They want to feel secure that programs they write on their Linux distributions will run on other Linux distributions, as well as on other supported Linux architectures and newer (or earlier) Linux versions.
At the system level, there are two separate sets of definitions and descriptions that impact portability. One is the application programming interface (API), and the other is the application binary interface (ABI). Both define and describe the interfaces between different pieces of computer software.
An API defines the interfaces by which one piece of software communicates with another at the source level. It provides abstraction by providing a standard set of interfaces—usually functions—that one piece of software (typically, although not necessarily, a higher-level piece) can invoke from another piece of software (usually a lower-level piece). For example, an API might abstract the concept of drawing text on the screen through a family of functions that provide everything needed to draw the text. The API merely defines the interface; the piece of software that actually provides the API is known as the implementation of the API.
It is common to call an API a "contract." This is not correct, at least in the legal sense of the term, as an API is not a two-way agreement. The API user (generally, the higher-level software) has zero input into the API and its implementation. It may use the API as-is, or not use it at all: take it or leave it! The API acts only to ensure that if both pieces of software follow the API, they are source compatible; that is, that the user of the API will successfully compile against the implementation of the API.
A real-world example is the API defined by the C standard and implemented by the standard C library. This API defines a family of basic and essential functions, such as string-manipulation routines.
Throughout this book, we will rely on the existence of various APIs, such as the standard I/O library discussed in . The most important APIs in Linux system programming are discussed in the section "" later in this chapter.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Standards
Unix system programming is an old art. The basics of Unix programming have existed untouched for decades. Unix systems, however, are dynamic beasts. Behavior changes and features are added. To help bring order to chaos, standards groups codify system interfaces into official standards. Numerous such standards exist, but technically speaking, Linux does not officially comply with any of them. Instead, Linux aims toward compliance with two of the most important and prevalent standards: POSIX and the Single UNIX Specification (SUS).
POSIX and SUS document, among other things, the C API for a Unix-like operating system interface. Effectively, they define system programming, or at least a common subset thereof, for compliant Unix systems.
In the mid-1980s, the Institute of Electrical and Electronics Engineers (IEEE) spearheaded an effort to standardize system-level interfaces on Unix systems. Richard Stallman, founder of the Free Software movement, suggested the standard be named POSIX (pronounced pahz-icks), which now stands for Portable Operating System Interface.
The first result of this effort, issued in 1988, was IEEE Std 1003.1-1988 (POSIX 1988, for short). In 1990, the IEEE revised the POSIX standard with IEEE Std 1003.1-1990 (POSIX 1990). Optional real-time and threading support were documented in, respectively, IEEE Std 1003.1b-1993 (POSIX 1993 or POSIX.1b), and IEEE Std 1003.1c-1995 (POSIX 1995 or POSIX.1c). In 2001, the optional standards were rolled together with the base POSIX 1990, creating a single standard: IEEE Std 1003.1-2001 (POSIX 2001). The latest revision, released in April 2004, is IEEE Std 1003.1-2004. All of the core POSIX standards are abbreviated POSIX.1, with the 2004 revision being the latest.
In the late 1980s and early 1990s, Unix system vendors were engaged in the "Unix Wars," with each struggling to define its Unix variant as the Unix operating system. Several major Unix vendors rallied around The Open Group, an industry consortium formed from the merging of the Open Software Foundation (OSF) and X/Open. The Open Group provides certification, white papers, and compliance testing. In the early 1990s, with the Unix Wars raging, The Open Group released the Single UNIX Specification. SUS rapidly grew in popularity, in large part due to its cost (free) versus the high cost of the POSIX standard. Today, SUS incorporates the latest POSIX standard.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Concepts of Linux Programming
This section presents a concise overview of the services provided by a Linux system. All Unix systems, Linux included, provide a mutual set of abstractions and interfaces. Indeed, this commonality defines Unix. Abstractions such as the file and the process, interfaces to manage pipes and sockets, and so on, are at the core of what is Unix.
This overview assumes that you are familiar with the Linux environment: I presume that you can get around in a shell, use basic commands, and compile a simple C program. This is not an overview of Linux, or its programming environment, but rather of the "stuff" that forms the basis of Linux system programming.
The file is the most basic and fundamental abstraction in Linux. Linux follows the everything-is-a-file philosophy (although not as strictly as some other systems, such as Plan9). Consequently, much interaction transpires via reading of and writing to files, even when the object in question is not what you would consider your everyday file.
In order to be accessed, a file must first be opened. Files can be opened for reading, writing, or both. An open file is referenced via a unique descriptor, a mapping from the metadata associated with the open file back to the specific file itself. Inside the Linux kernel, this descriptor is handled by an integer (of the C type int) called the file descriptor, abbreviated fd. File descriptors are shared with user space, and are used directly by user programs to access files. A large part of Linux system programming consists of opening, manipulating, closing, and otherwise using file descriptors.

Regular files

What most of us call "files" are what Linux labels regular files. A regular file contains bytes of data, organized into a linear array called a byte stream. In Linux, no further organization or formatting is specified for a file. The bytes may have any values, and they may be organized within the file in any way. At the system level, Linux does not enforce a structure upon files beyond the byte stream. Some operating systems, such as VMS, provide highly structured files, supporting concepts such as
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Getting Started with System Programming
This chapter looked at the fundamentals of Linux system programming and provided a programmer's overview of the Linux system. The next chapter discusses basic file I/O. This includes, of course, reading from and writing to files; however, because Linux implements many interfaces as files, file I/O is crucial to a lot more than just, well, files.
With the preliminaries behind us, growing smaller on the horizon, it's time to dive into actual system programming. Let's go!
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: File I/O
This chapter covers the basics of reading to and writing from files. Such operations form the core of a Unix system. The next chapter covers standard I/O from the standard C library, and continues the coverage with a treatment of the more advanced and specialized file I/O interfaces. rounds out the discussion by addressing the topic of file and directory manipulation.
Before a file can be read from or written to, it must be opened. The kernel maintains a per-process list of open files, called the file table. This table is indexed via nonnegative integers known as file descriptors (often abbreviated fds). Each entry in the list contains information about an open file, including a pointer to an in-memory copy of the file's backing inode and associated metadata, such as the file position and access modes. Both user space and kernel space use file descriptors as unique per-process cookies. Opening a file returns a file descriptor, while subsequent operations (reading, writing, and so on) take the file descriptor as their primary argument.
By default, a child process receives a copy of its parent's file table. The list of open files and their access modes, current file positions, and so on, are the same, but a change in one process—say, the child closing a file—does not affect the other process' file table. However, as you'll see in , it is possible for the child and parent to share the parent's file table (as threads do).
File descriptors are represented by the C int type. Not using a special type—an fd_t, say—is often considered odd, but is, historically, the Unix way. Each Linux process has a maximum number of files that it may open. File descriptors start at 0, and go up to one less than this maximum value. By default, the maximum is 1,024, but it can be configured as high as 1,048,576. Because negative values are not legal file descriptors, −1 is often used to indicate an error from a function that would otherwise return a valid file descriptor.
Unless the process explicitly closes them, every process by convention has at least three file descriptors open: 0, 1, and 2. File descriptor 0 is
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Opening Files
The most basic method of accessing a file is via the read( ) and write( ) system calls. Before a file can be accessed, however, it must be opened via an open( ) or creat( ) system call. Once done using the file, it should be closed using the system call close( ).
A file is opened, and a file descriptor is obtained with the open( ) system call:
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int open (const char *name, int flags);
int open (const char *name, int flags, mode_t mode);
The open( ) system call maps the file given by the pathname name to a file descriptor, which it returns on success. The file position is set to zero, and the file is opened for access according to the flags given by flags.

Flags for open( )

The flags argument must be one of O_RDONLY, O_WRONLY, or O_RDWR. Respectively, these arguments request that the file be opened only for reading, only for writing, or for both reading and writing.
For example, the following code opens /home/kidd/madagascar for reading:
int fd;

fd = open ("/home/kidd/madagascar", O_RDONLY);
if (fd == −1)
        /* error */
A file opened only for writing cannot also be read, and vice versa. The process issuing the open( ) system call must have sufficient permissions to obtain the access requested.
The flags argument can be bitwise-ORed with one or more of the following values, modifying the behavior of the open request:
O_APPEND
The file will be opened in append mode. That is, before each write, the file position will be updated to point to the end of the file. This occurs even if another process has written to the file after the issuing process' last write, thereby changing the file position. (See "" later in this chapter).
O_ASYNC
A signal (SIGIO by default) will be generated when the specified file becomes readable or writable. This flag is available only for terminals and sockets, not for regular files.
O_CREAT
If the file denoted by name does not exist, the kernel will create it. If the file already exists, this flag has no effect unless O_EXCL is also given.
O_DIRECT
The file will be opened for direct I/O (see "" later in this chapter).
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Reading via read( )
Now that you know how to open a file, let's look at how to read it. In the following section, we will examine writing.
The most basic—and common—mechanism used for reading is the read( ) system call, defined in POSIX.1:
#include <unistd.h>

ssize_t read (int fd, void *buf, size_t len);
Each call reads up to len bytes into buf from the current file offset of the file referenced by fd. On success, the number of bytes written into buf is returned. On error, the call returns −1, and errno is set. The file position is advanced by the number of bytes read from fd. If the object represented by fd is not capable of seeking (for example, a character device file), the read always occurs from the "current" position.
Basic usage is simple. This example reads from the file descriptor fd into word. The number of bytes read is equal to the size of the unsigned long type, which is four bytes on 32-bit Linux systems, and eight bytes on 64-bit systems. On return, nr contains the number of bytes read, or −1 on error:
unsigned long word;
ssize_t nr;

/* read a couple bytes into 'word' from 'fd' */
nr = read (fd, &word, sizeof (unsigned long));
if (nr == −1)
        /* error */
There are two problems with this naïve implementation: the call might return without reading all len bytes, and it could produce certain errors that this code does not check for and handle. Code such as this, unfortunately, is very common. Let's see how to improve it.
It is legal for read( ) to return a positive nonzero value less than len. This can happen for a number of reasons: less than len bytes may have been available, the system call may have been interrupted by a signal, the pipe may have broken (if fd is a pipe), and so on.
The possibility of a return value of 0 is another consideration when using read( ). The read( ) system call returns 0 to indicate end-of-file (EOF); in this case, of course, no bytes were read. EOF is not considered an error (and hence is not accompanied by a −1 return value); it simply indicates that the file position has advanced past the last valid offset in the file, and thus there is nothing else to read. If, however, a call is made for
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Writing with write( )
The most basic and common system call used for writing is write( ). write( ) is the counterpart of read( ) and is also defined in POSIX.1:
#include <unistd.h>

ssize_t write (int fd, const void *buf, size_t count);
A call to write( ) writes up to count bytes starting at buf to the current file position of the file referenced by the file descriptor fd. Files backed by objects that do not support seeking (for example, character devices) always write starting at the "head."
On success, the number of bytes written is returned, and the file position is updated in kind. On error, −1 is returned, and errno is set appropriately. A call to write( ) can return 0, but this return value does not have any special meaning; it simply implies that zero bytes were written.
As with read( ), the most basic usage is simple:
const char *buf = "My ship is solid!";
ssize_t nr;

/* write the string in 'buf' to 'fd' */
nr = write (fd, buf, strlen (buf));
if (nr == −1)
        /* error */
But again, as with read( ), this usage is not quite right. Callers also need to check for the possible occurrence of a partial write:
unsigned long word = 1720;
size_t count;
ssize_t nr;

count = sizeof (word);
nr = write (fd, &word, count);
if (nr == −1)
        /* error, check errno */
else if (nr != count)
        /* possible error, but 'errno' not set */
A write( ) system call is less likely to return a partial write than a read( ) system call is to return a partial read. Also, there is no EOF condition for a write( ) system call. For regular files, write( ) is guaranteed to perform the entire requested write, unless an error occurs.
Consequently, for regular files, you do not need to perform writes in a loop. However, for other file types—say, sockets—a loop may be required to guarantee that you really write out all of the requested bytes. Another benefit of using a loop is that a second call to write( ) may return an error revealing what caused the first call to perform only a partial write (although, again, this situation is not very common). Here's an example:
ssize_t ret, nr;

while (len != 0 && (ret = write (fd, buf, len)) != 0) {
        if (ret == −1) {
                if (errno == EINTR)
                        continue;
                perror ("write");
                break;
        }

        len -= ret;
        buf += ret;
}
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Synchronized I/O
Although synchronizing I/O is an important topic, the issues associated with delayed writes should not be feared. Buffering writes provides a huge performance improvement, and consequently, any operating system even halfway deserving the mark "modern" implements delayed writes via buffers. Nonetheless, there are times when applications want to control when data hits the disk. For those uses, the Linux kernel provides a handful of options that allow performance to be traded for synchronized operations.
The simplest method of ensuring that data has reached the disk is via the fsync( ) system call, defined by POSIX.1b:
#include <unistd.h>

int fsync (int fd);
A call to fsync( ) ensures that all dirty data associated with the file mapped by the file descriptor fd is written back to disk. The file descriptor fd must be open for writing. The call writes back both data and metadata, such as creation timestamps, and other attributes contained in the inode. It will not return until the hard drive says that the data and metadata are on the disk.
In the case of write caches on hard disks, it is not possible for fsync( ) to know whether the data is physically on the disk. The hard drive can report that the data was written, but the data may in fact reside in the drive's write cache. Fortunately, data in a hard disk's cache should be committed to the disk in short order.
Linux also provides the system call fdatasync( ):
#include <unistd.h>

int fdatasync (int fd);
This system call does the same thing as fsync( ), except that it only flushes data. The call does not guarantee that metadata is synchronized to disk, and is therefore potentially faster. Often this is sufficient.
Both functions are used the same way, which is very simple:
int ret;

ret = fsync (fd);
if (ret == −1)
        /* error */
Neither function guarantees that any updated directory entries containing the file are synchronized to disk. This implies that if a file's link has recently been updated, the file's data may successfully reach the disk, but not the associated directory entry, rendering the file unreachable. To ensure that any updates to the directory entry are also committed to disk,
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Direct I/O
The Linux kernel, like any modern operating system kernel, implements a complex layer of caching, buffering, and I/O management between devices and applications (see "" at the end of this chapter). A high-performance application may wish to bypass this layer of complexity and perform its own I/O management. Rolling your own I/O system is usually not worth the effort, though, and in fact the tools available at the operating-system level are likely to achieve much better performance than those available at the application level. Still, database systems often prefer to perform their own caching, and want to minimize the presence of the operating system as much as feasible.
Providing the O_DIRECT flag to open( ) instructs the kernel to minimize the presence of I/O management. When this flag is provided, I/O will initiate directly from user-space buffers to the device, bypassing the page cache. All I/O will be synchronous; operations will not return until completed.
When performing direct I/O, the request length, buffer alignment, and file offsets must all be integer multiples of the underlying device's sector size—generally, this is 512 bytes. Before the 2.6 Linux kernel, this requirement was stricter: in 2.4, everything must be aligned on the filesystem's logical block size (often 4 KB). To remain compatible, applications should align to the larger (and potentially less convenient) logical block size.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Closing Files
After a program has finished working with a file descriptor, it can unmap the file descriptor from the associated file via the close( ) system call:
#include <unistd.h>

int close (int fd);
A call to close( ) unmaps the open file descriptor fd, and disassociates the process from the file. The given file descriptor is then no longer valid, and the kernel is free to reuse it as the return value to a subsequent open( ) or creat( ) call. A call to close( ) returns 0 on success. On error, it returns −1, and sets errno appropriately. Usage is simple:
if (close (fd) == −1)
        perror ("close");
Note that closing a file has no bearing on when the file is flushed to disk. If an application wants to ensure that the file is committed to disk before closing it, it needs to make use of one of the synchronization options discussed earlier in "."
Closing a file does have some side effects, though. When the last open file descriptor referring to a file is closed, the data structure representing the file inside the kernel is freed. When this data structure is freed, it unpins the in-memory copy of the inode associated with the file. If nothing else is pinning the inode, it too may be freed from memory (it may stick around because the kernel caches inodes for performance reasons, but it need not). If a file has been unlinked from the disk, but was kept open before it was unlinked, it is not physically removed until it is closed and its inode is removed from memory. Therefore, calling close( ) may also result in an unlinked file finally being physically removed from the disk.
It is a common mistake to not check the return value of close( ). This can result in missing a crucial error condition because errors associated with deferred operations may not manifest until later, and close( ) can report them.
There are a handful of possible errno values on failure. Other than EBADF (the given file descriptor was invalid), the most important error value is EIO, indicating a low-level I/O error probably unrelated to the actual close. Regardless of any reported error, the file descriptor, if valid, is always closed, and the associated data structures are freed.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Seeking with lseek( )
Normally, I/O occurs linearly through a file, and the implicit updates to the file position caused by reads and writes are all the seeking that is needed. Some applications, however, need to jump around in the file. The lseek( ) system call is provided to set the file position of a file descriptor to a given value. Other than updating the file position, it performs no other action, and initiates no I/O whatsoever:
#include <sys/types.h>
#include <unistd.h>

off_t lseek (int fd, off_t pos, int origin);
The behavior of lseek( ) depends on the origin argument, which can be one of the following:
SEEK_CUR
The current file position of fd is set to its current value plus pos, which can be negative, zero, or positive. A pos of zero returns the current file position value.
SEEK_END
The current file position of fd is set to the current length of the file plus pos, which can be negative, zero, or positive. A pos of zero sets the offset to the end of the file.
SEEK_SET
The current file position of fd is set to pos. A pos of zero sets the offset to the beginning of the file.
The call returns the new file position on success. On error, it returns −1 and errno is set as appropriate.
For example, to set the file position of fd to 1825:
off_t ret;

ret = lseek (fd, (off_t) 1825, SEEK_SET);
if (ret == (off_t) −1)
        /* error */
Alternatively, to set the file position of fd to the end of the file:
off_t ret;

ret = lseek (fd, 0, SEEK_END);
if (ret == (off_t) −1)
        /* error */
As lseek( ) returns the updated file position, it can be used to find the current file position via a SEEK_CUR to zero:
int pos;

pos = lseek (fd, 0, SEEK_CUR);
if (pos == (off_t) −1)
        /* error */
else
        /* 'pos' is the current position of fd */
By far, the most common uses of lseek( ) are seeking to the beginning, seeking to the end, or determining the current file position of a file descriptor.
It is possible to instruct lseek( ) to advance the file pointer past the end of a file. For example, this code seeks to 1,688 bytes beyond the end of the file mapped by fd:
int ret;

ret = lseek (fd, (off_t) 1688, SEEK_END);
if (ret == (off_t) −1)
        /* error */
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Positional Reads and Writes
In lieu of using lseek( ), Linux provides two variants of the read( ) and write( ) system calls that each take as a parameter the file position from which to read or write. Upon completion, they do not update the file position.
The read form is called pread( ):
#define _XOPEN_SOURCE 500

#include <unistd.h>

ssize_t pread (int fd, void *buf, size_t count, off_t pos);
This call reads up to count bytes into buf from the file descriptor fd at file position pos.
The write form is called pwrite( ):
#define _XOPEN_SOURCE 500

#include <unistd.h>

ssize_t pwrite (int fd, const void *buf, size_t count, off_t pos);
This call writes up to count bytes from buf to the file descriptor fd at file position pos.
These calls are almost identical in behavior to their non-p brethren, except that they completely ignore the current file position; instead of using the current position, they use the value provided by pos. Also, when done, they do not update the file position. In other words, any intermixed read( ) and write( ) calls could potentially corrupt the work done by the positional calls.
Both positional calls can be used only on seekable file descriptors. They provide semantics similar to preceding a read( ) or write( ) call with a call to lseek( ), with three differences. First, these calls are easier to use, especially when doing a tricky operation such as moving through a file backward or randomly. Second, they do not update the file pointer upon completion. Finally, and most importantly, they avoid any potential races that might occur when using lseek( ). As threads share file descriptors, it would be possible for a different thread in the same program to update the file position after the first thread's call to lseek( ), but before its read or write operation executed. Such race conditions can be avoided by using the pread( ) and pwrite( ) system calls.
On success, both calls return the number of bytes read or written. A return value of 0 from pread( ) indicates EOF; from pwrite( ), a return value of 0 indicates that the call did not write anything. On error, both calls return
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Truncating Files
Linux provides two system calls for truncating the length of a file, both of which are defined and required (to varying degrees) by various POSIX standards. They are:
#include <unistd.h>
#include <sys/types.h>

int ftruncate (int fd, off_t len);
and:
#include <unistd.h>
#include <sys/types.h>

int truncate (const char *path, off_t len);
Both system calls truncate the given file to the length given by len. The ftruncate( ) system call operates on the file descriptor given by fd, which must be open for writing. The truncate( ) system call operates on the filename given by path, which must be writable. Both return 0 on success. On error, they return −1, and set errno as appropriate.
The most common use of these system calls is to truncate a file to a size smaller than its current length. Upon successful return, the file's length is len. The data previously existing between len and the old length is discarded, and no longer accessible via a read request.
The functions can also be used to "truncate" a file to a larger size, similar to the seek plus write combination described earlier in "." The extended bytes are filled with zeros.
Neither operation updates the current file position.
For example, consider the file pirate.txt of length 74 bytes with the following contents:
Edward Teach was a notorious English pirate.
He was nicknamed Blackbeard.
From the same directory, running the following program:
#include <unistd.h>
#include <stdio.h>

int main(  )
{
        int ret;

        ret = truncate ("./pirate.txt", 45);
        if (ret == −1) {
                perror ("truncate");
                return −1;
        }

        return 0;
}
results in a file of length 45 bytes with the contents:
Edward Teach was a notorious English pirate.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Multiplexed I/O
Applications often need to block on more than one file descriptor, juggling I/O between keyboard input (stdin), interprocess communication, and a handful of files. Modern event-driven graphical user interface (GUI) applications may contend with literally hundreds of pending events via their mainloops.
Without the aid of threads—essentially servicing each file descriptor separately—a single process cannot reasonably block on more than one file descriptor at the same time. Working with multiple file descriptors is fine, so long as they are always ready to be read from or written to. But as soon as one file descriptor that is not yet ready is encountered—say, if a read( ) system call is issued, and there is not yet any data—the process will block, no longer able to service the other file descriptors. It might block for just a few seconds, making the application inefficient and annoying the user. However, if no data becomes available on the file descriptor, it could block forever. Because file descriptors' I/O is often interrelated—think pipes—it is quite possible for one file descriptor not to become ready until another is serviced. Particularly with network applications, which may have many sockets open simultaneously, this is potentially quite a problem.
Imagine blocking on a file descriptor related to interprocess communication while stdin has data pending. The application won't know that keyboard input is pending until the blocked IPC file descriptor ultimately returns data—but what if the blocked operation never returns?
Earlier in this chapter, we looked at nonblocking I/O as a solution to this problem. With nonblocking I/O, applications can issue I/O requests that return a special error condition instead of blocking. However, this solution is inefficient, for two reasons. First, the process needs to continually issue I/O operations in some arbitrary order, waiting for one of its open file descriptors to be ready for I/O. This is poor program design. Second, it would be much more efficient if the program could sleep, freeing the processor for other tasks, to be woken up only when one or more file descriptors were ready to perform I/O.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Kernel Internals
This section looks at how the Linux kernel implements I/O, focusing on three primary subsystems of the kernel: the virtual filesystem (VFS), the page cache, and page writeback. Together, these subsystems help make I/O seamless, efficient, and optimal.
In , we will look at a fourth subsystem, the I/O scheduler.
The virtual filesystem, occasionally also called a virtual file switch, is a mechanism of abstraction that allows the Linux kernel to call filesystem functions and manipulate filesystem data without knowing—or even caring about—the specific type of filesystem being used.
The VFS accomplishes this abstraction by providing a common file model, which is the basis for all filesystems in Linux. Via function pointers and various object-oriented practices, the common file model provides a framework to which filesystems in the Linux kernel must adhere. This allows the VFS to generically make requests of the filesystem. The framework provides hooks to support reading, creating links, synchronizing, and so on. Each filesystem then registers functions to handle the operations of which it is capable.
This approach forces a certain amount of commonality between filesystems. For example, the VFS talks in terms of inodes, superblocks, and directory entries. A filesystem not of Unix origins, possibly devoid of Unix-like concepts such as inodes, simply has to cope. Indeed, cope they do: Linux supports filesystems such as FAT and NTFS without issue.
The benefits of the VFS are enormous. A single system call can read from any filesystem on any medium; a single utility can copy from any one filesystem to any other. All filesystems support the same concepts, the same interfaces, and the same calls. Everything just works—and works well.
When an application issues a read( ) system call, it takes an interesting journey. The C library provides definitions of the system call that are converted to the appropriate trap statements at compile-time. Once a user-space process is trapped into the kernel, passed through the system call handler, and handed to the read( ) system call, the kernel figures out what object
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Conclusion
This chapter discussed the basics of Linux system programming: file I/O. On a system such as Linux, which strives to represent as much as possible as a file, it's very important to know how to open, read, write, and close files. All of these operations are classic Unix, and are represented in many standards.
The next chapter tackles buffered I/O, and the standard C library's standard I/O interfaces. The standard C library is not just a convenience; buffering I/O in user space provides crucial performance improvements.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: Buffered I/O
Recall from that the block, a filesystem abstraction, is the lingua franca of I/O—all disk operations occur in terms of blocks. Consequently, I/O performance is optimal when requests are issued on block-aligned boundaries in integer multiples of the block size.
Performance degradation is exacerbated by the increased number of system calls required to read, say, a single byte 1,024 times rather than 1,024 bytes all at once. Even a series of operations performed in a size larger than a block can be suboptimal if the size is not an integer multiple of the block size. For example, if the block size is one kilobyte, operations in chunks of 1,130 bytes may still be slower than 1,024 byte operations.
Programs that have to issue many small I/O requests to regular files often perform user-buffered I/O. This refers to buffering done in user space, either manually by the application, or transparently in a library, not to buffering done by the kernel. As discussed in , for reasons of performance, the kernel buffers data internally by delaying writes, coalescing adjacent I/O requests, and reading ahead. Through different means, user buffering also aims to improve performance.
Consider an example using the user-space program dd:
dd bs=1 count=2097152 if=/dev/zero of=pirate
Because of the bs=1 argument, this command will copy two megabytes from the device /dev/zero (a virtual device providing an endless stream of zeros) to the file pirate in 2,097,152 one byte chunks. That is, it will copy the data via about two million read and write operations—one byte at a time.
Now consider the same two megabyte copy, but using 1,024 byte blocks:
dd bs=1024 count=2048 if=/dev/zero of=pirate
This operation copies the same two megabytes to the same file, yet issues 1,024 times fewer read and write operations. The performance improvement is huge, as you can see in . Here, I've recorded the time taken (using three different measures) by four dd commands that differed only in block size. Real time is the total elapsed wall clock time, user time is the time spent executing the program's code in user space, and system time is the time spent executing system calls in kernel space on the process' behalf.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
User-Buffered I/O
Programs that have to issue many small I/O requests to regular files often perform user-buffered I/O. This refers to buffering done in user space, either manually by the application, or transparently in a library, not to buffering done by the kernel. As discussed in , for reasons of performance, the kernel buffers data internally by delaying writes, coalescing adjacent I/O requests, and reading ahead. Through different means, user buffering also aims to improve performance.
Consider an example using the user-space program dd:
dd bs=1 count=2097152 if=/dev/zero of=pirate
Because of the bs=1 argument, this command will copy two megabytes from the device /dev/zero (a virtual device providing an endless stream of zeros) to the file pirate in 2,097,152 one byte chunks. That is, it will copy the data via about two million read and write operations—one byte at a time.
Now consider the same two megabyte copy, but using 1,024 byte blocks:
dd bs=1024 count=2048 if=/dev/zero of=pirate
This operation copies the same two megabytes to the same file, yet issues 1,024 times fewer read and write operations. The performance improvement is huge, as you can see in . Here, I've recorded the time taken (using three different measures) by four dd commands that differed only in block size. Real time is the total elapsed wall clock time, user time is the time spent executing the program's code in user space, and system time is the time spent executing system calls in kernel space on the process' behalf.
Table : Effects of block size on performance
Block size
Real time
User time
System time
1 byte
18.707 seconds
1.118 seconds
17.549 seconds
1,024 bytes
0.025 seconds
0.002 seconds
0.023 seconds
1,130 bytes
0.035 seconds
0.002 seconds
0.027 seconds
Using 1,024 byte chunks results in an enormous performance improvement compared to the single byte chunk. However, the table also demonstrates that using a larger block size—which implies even fewer system calls—can result in performance degradation if the operations are not performed in multiples of the disk's block size. Despite requiring fewer calls, the 1,130 byte requests end up generating unaligned requests, and are therefore less efficient than the 1,024 byte requests.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Standard I/O
The standard C library provides the standard I/O library (often simply called stdio), which in turn provides a platform-independent, user-buffering solution. The standard I/O library is simple to use, yet powerful.
Unlike programming languages such as FORTRAN, the C language does not include any built-in support or keywords providing any functionality more advanced than flow control, arithmetic, and so on—there's certainly no inherent support for I/O. As the C programming language progressed, users developed standard sets of routines to provide core functionality, such as string manipulation, mathematical routines, time and date functionality, and I/O. Over time, these routines matured, and with the ratification of the ANSI C standard in 1989 (C89) they were eventually formalized as the standard C library. Although both C95 and C99 added several new interfaces, the standard I/O library has remained relatively untouched since its creation in 1989.
The remainder of this chapter discusses user-buffered I/O as it pertains to file I/O, and is implemented in the standard C library—that is, opening, closing, reading, and writing files via the standard C library. Whether an application will use standard I/O, a home-rolled user-buffering solution, or straight system calls is a decision that developers must make carefully after weighing the application's needs and behavior.
The C standards always leave some details up to each implementation, and implementations often add additional features. This chapter, just like the rest of the book, documents the interfaces and behavior as they are implemented in glibc on a modern Linux system. Where Linux deviates from the basic standard, this is noted.
Standard I/O routines do not operate directly on file descriptors. Instead, they use their own unique identifier, known as the file pointer. Inside the C library, the file pointer maps to a file descriptor. The file pointer is represented by a pointer to the FILE typedef, which is defined in <stdio.h>.
In standard I/O parlance, an open file is called a stream. Streams may be opened for reading (
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Opening Files
Files are opened for reading or writing via fopen( ):
#include <stdio.h>

FILE * fopen (const char *path, const char *mode);
This function opens the file path according to the given modes, and associates a new stream with it.
The mode argument describes how to open the given file. It is one of the following strings:
r
Open the file for reading. The stream is positioned at the start of the file.
r+
Open the file for both reading and writing. The stream is positioned at the start of the file.
w
Open the file for writing. If the file exists, it is truncated to zero length. If the file does not exist, it is created. The stream is positioned at the start of the file.
w+
Open the file for both reading and writing. If the file exists, it is truncated to zero length. If the file does not exist, it is created. The stream is positioned at the start of the file.
a
Open the file for writing in append mode. The file is created if it does not exist. The stream is positioned at the end of the file. All writes will append to the file.
a+
Open the file for both reading and writing in append mode. The file is created if it does not exist. The stream is positioned at the end of the file. All writes will append to the file.
The given mode may also contain the character b, although this value is always ignored on Linux. Some operating systems treat text and binary files differently, and the b mode instructs the file to be opened in binary mode. Linux, as with all POSIX-conforming systems, treats text and binary files identically.
Upon success, fopen( ) returns a valid FILE pointer. On failure, it returns NULL, and sets errno appropriately.
For example, the following code opens /etc/manifest for reading, and associates it with stream:
FILE *stream;

stream = fopen ("/etc/manifest", "r");
if (!stream)
        /* error */
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Opening a Stream via File Descriptor
The function fdopen( ) converts an already open file descriptor (fd) to a stream:
#include <stdio.h>

FILE * fdopen (int fd, const char *mode);
The possible modes are the same as for fopen( ), and must be compatible with the modes originally used to open the file descriptor. The modes w and w+ may be specified, but they will not cause truncation. The stream is positioned at the file position associated with the file descriptor.
Once a file descriptor is converted to a stream, I/O should no longer be directly performed on the file descriptor. It is, however, legal to do so. Note that the file descriptor is not duplicated, but is merely associated with a new stream. Closing the stream will close the file descriptor as well.
On success, fdopen( ) returns a valid file pointer; on failure, it returns NULL.
For example, the following code opens /home/kidd/map.txt via the open( ) system call, and then uses the backing file descriptor to create an associated stream:
FILE *stream;
int fd;

fd = open ("/home/kidd/map.txt", O_RDONLY);
if (fd == &#8722;1)
        /* error */

stream = fdopen (fd, "r");
if (!stream)
        /* error */
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Closing Streams
The fclose( ) function closes a given stream:
#include <stdio.h>

int fclose (FILE *stream);
Any buffered and not-yet-written data is first flushed. On success, fclose( ) returns 0. On failure, it returns EOF and sets errno appropriately.
The fcloseall( ) function closes all streams associated with the current process, including standard in, standard out, and standard error:
#define _GNU_SOURCE

#include <stdio.h>

int fcloseall (void);
Before closing, all streams are flushed. The function always returns 0; it is Linux-specific.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Reading from a Stream
Content preview·Buy PDF of this chapter|