BUY THIS BOOK

Safari Books Online

What is this?

Looking to Reprint this content?


Understanding the Linux Kernel
Understanding the Linux Kernel, Second Edition

By Daniel P. Bovet, Marco Cesati

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: Introduction
Linux is a member of the large family of Unix-like operating systems. A relative newcomer experiencing sudden spectacular popularity starting in the late 1990s, Linux joins such well-known commercial Unix operating systems as System V Release 4 (SVR4), developed by AT&T (now owned by the SCO Group); the 4.4 BSD release from the University of California at Berkeley (4.4BSD); Digital Unix from Digital Equipment Corporation (now Hewlett-Packard); AIX from IBM; HP-UX from Hewlett-Packard; Solaris from Sun Microsystems; and Mac OS X from Apple Computer, Inc.
Linux was initially developed by Linus Torvalds in 1991 as an operating system for IBM-compatible personal computers based on the Intel 80386 microprocessor. Linus remains deeply involved with improving Linux, keeping it up to date with various hardware developments and coordinating the activity of hundreds of Linux developers around the world. Over the years, developers have worked to make Linux available on other architectures, including Hewlett-Packard's Alpha, Itanium (the recent Intel's 64-bit processor), MIPS, SPARC, Motorola MC680x0, PowerPC, and IBM's zSeries.
One of the more appealing benefits to Linux is that it isn't a commercial operating system: its source code under the GNU Public License is open and available to anyone to study (as we will in this book); if you download the code (the official site is http://www.kernel.org) or check the sources on a Linux CD, you will be able to explore, from top to bottom, one of the most successful, modern operating systems. This book, in fact, assumes you have the source code on hand and can apply what we say to your own explorations.
Technically speaking, Linux is a true Unix kernel, although it is not a full Unix operating system because it does not include all the Unix applications, such as filesystem utilities, windowing systems and graphical desktops, system administrator commands, text editors, compilers, and so on. However, since most of these programs are freely available under the GNU General Public License, they can be installed onto one of the filesystems supported by Linux.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Linux Versus Other Unix-Like Kernels
The various Unix-like systems on the market, some of which have a long history and show signs of archaic practices, differ in many important respects. All commercial variants were derived from either SVR4 or 4.4BSD, and all tend to agree on some common standards like IEEE's Portable Operating Systems based on Unix (POSIX) and X/Open's Common Applications Environment (CAE).
The current standards specify only an application programming interface (API)—that is, a well-defined environment in which user programs should run. Therefore, the standards do not impose any restriction on internal design choices of a compliant kernel.
To define a common user interface, Unix-like kernels often share fundamental design ideas and features. In this respect, Linux is comparable with the other Unix-like operating systems. Reading this book and studying the Linux kernel, therefore, may help you understand the other Unix variants too.
The 2.4 version of the Linux kernel aims to be compliant with the IEEE POSIX standard. This, of course, means that most existing Unix programs can be compiled and executed on a Linux system with very little effort or even without the need for patches to the source code. Moreover, Linux includes all the features of a modern Unix operating system, such as virtual memory, a virtual filesystem, lightweight processes, reliable signals, SVR4 interprocess communications, support for Symmetric Multiprocessor (SMP) systems, and so on.
By itself, the Linux kernel is not very innovative. When Linus Torvalds wrote the first kernel, he referred to some classical books on Unix internals, like Maurice Bach's The Design of the Unix Operating System (Prentice Hall, 1986). Actually, Linux still has some bias toward the Unix baseline described in Bach's book (i.e., SVR2). However, Linux doesn't stick to any particular variant. Instead, it tries to adopt the best features and design choices of several different Unix kernels.
The following list describes how Linux competes against some well-known commercial Unix kernels:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Hardware Dependency
Linux tries to maintain a neat distinction between hardware-dependent and hardware-independent source code. To that end, both the arch and the include directories include nine subdirectories that correspond to the nine hardware platforms supported. The standard names of the platforms are:
alpha
Hewlett-Packard's Alpha workstations
arm
ARM processor-based computers and embedded devices
cris
"Code Reduced Instruction Set" CPUs used by Axis in its thin-servers, such as web cameras or development boards
i386
IBM-compatible personal computers based on 80 × 86 microprocessors
ia64
Workstations based on Intel 64-bit Itanium microprocessor
m68k
Personal computers based on Motorola MC680 × 0 microprocessors
mips
Workstations based on MIPS microprocessors
mips64
Workstations based on 64-bit MIPS microprocessors
parisc
Workstations based on Hewlett Packard HP 9000 PA-RISC microprocessors
ppc
Workstations based on Motorola-IBM PowerPC microprocessors
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Linux Versions
Linux distinguishes stable kernels from development kernels through a simple numbering scheme. Each version is characterized by three numbers, separated by periods. The first two numbers are used to identify the version; the third number identifies the release.
As shown in Figure 1-1, if the second number is even, it denotes a stable kernel; otherwise, it denotes a development kernel. At the time of this writing, the current stable version of the Linux kernel is 2.4.18, and the current development version is 2.5.22. The 2.4 kernel — which is the basis for this book — was first released in January 2001 and differs considerably from the 2.2 kernel, particularly with respect to memory management. Work on the 2.5 development version started in November 2001.
Figure 1-1: Numbering Linux versions
New releases of a stable version come out mostly to fix bugs reported by users. The main algorithms and data structures used to implement the kernel are left unchanged.
Development versions, on the other hand, may differ quite significantly from one another; kernel developers are free to experiment with different solutions that occasionally lead to drastic kernel changes. Users who rely on development versions for running applications may experience unpleasant surprises when upgrading their kernel to a newer release. This book concentrates on the most recent stable kernel that we had available because, among all the new features being tried in experimental kernels, there's no way of telling which will ultimately be accepted and what they'll look like in their final form.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Basic Operating System Concepts
Each computer system includes a basic set of programs called the operating system . The most important program in the set is called the kernel . It is loaded into RAM when the system boots and contains many critical procedures that are needed for the system to operate. The other programs are less crucial utilities; they can provide a wide variety of interactive experiences for the user—as well as doing all the jobs the user bought the computer for—but the essential shape and capabilities of the system are determined by the kernel. The kernel provides key facilities to everything else on the system and determines many of the characteristics of higher software. Hence, we often use the term "operating system" as a synonym for "kernel."
The operating system must fulfill two main objectives:
  • Interact with the hardware components, servicing all low-level programmable elements included in the hardware platform.
  • Provide an execution environment to the applications that run on the computer system (the so-called user programs).
Some operating systems allow all user programs to directly play with the hardware components (a typical example is MS-DOS). In contrast, a Unix-like operating system hides all low-level details concerning the physical organization of the computer from applications run by the user. When a program wants to use a hardware resource, it must issue a request to the operating system. The kernel evaluates the request and, if it chooses to grant the resource, interacts with the relative hardware components on behalf of the user program.
To enforce this mechanism, modern operating systems rely on the availability of specific hardware features that forbid user programs to directly interact with low-level hardware components or to access arbitrary memory locations. In particular, the hardware introduces at least two different execution modes for the CPU: a nonprivileged mode for user programs and a privileged mode for the kernel. Unix calls these
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
An Overview of the Unix Filesystem
The Unix operating system design is centered on its filesystem, which has several interesting characteristics. We'll review the most significant ones, since they will be mentioned quite often in forthcoming chapters.
A Unix file is an information container structured as a sequence of bytes; the kernel does not interpret the contents of a file. Many programming libraries implement higher-level abstractions, such as records structured into fields and record addressing based on keys. However, the programs in these libraries must rely on system calls offered by the kernel. From the user's point of view, files are organized in a tree-structured namespace, as shown in Figure 1-2.
Figure 1-2: An example of a directory tree
All the nodes of the tree, except the leaves, denote directory names. A directory node contains information about the files and directories just beneath it. A file or directory name consists of a sequence of arbitrary ASCII characters, with the exception of / and of the null character \0. Most filesystems place a limit on the length of a filename, typically no more than 255 characters. The directory corresponding to the root of the tree is called the root directory . By convention, its name is a slash (/). Names must be different within the same directory, but the same name may be used in different directories.
Unix associates a current working directory with each process (see Section 1.6.1 later in this chapter); it belongs to the process execution context, and it identifies the directory currently used by the process. To identify a specific file, the process uses a pathname , which consists of slashes alternating with a sequence of directory names that lead to the file. If the first item in the pathname is a slash, the pathname is said to be
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
An Overview of Unix Kernels
Unix kernels provide an execution environment in which applications may run. Therefore, the kernel must implement a set of services and corresponding interfaces. Applications use those interfaces and do not usually interact directly with hardware resources.
As already mentioned, a CPU can run in either User Mode or Kernel Mode. Actually, some CPUs can have more than two execution states. For instance, the 80 × 86 microprocessors have four different execution states. But all standard Unix kernels use only Kernel Mode and User Mode.
When a program is executed in User Mode, it cannot directly access the kernel data structures or the kernel programs. When an application executes in Kernel Mode, however, these restrictions no longer apply. Each CPU model provides special instructions to switch from User Mode to Kernel Mode and vice versa. A program usually executes in User Mode and switches to Kernel Mode only when requesting a service provided by the kernel. When the kernel has satisfied the program's request, it puts the program back in User Mode.
Processes are dynamic entities that usually have a limited life span within the system. The task of creating, eliminating, and synchronizing the existing processes is delegated to a group of routines in the kernel.
The kernel itself is not a process but a process manager. The process/kernel model assumes that processes that require a kernel service use specific programming constructs called system calls . Each system call sets up the group of parameters that identifies the process request and then executes the hardware-dependent CPU instruction to switch from User Mode to Kernel Mode.
Besides user processes, Unix systems include a few privileged processes called kernel threads with the following characteristics:
  • They run in Kernel Mode in the kernel address space.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: Memory Addressing
This chapter deals with addressing techniques. Luckily, an operating system is not forced to keep track of physical memory all by itself; today's microprocessors include several hardware circuits to make memory management both more efficient and more robust in case of programming errors.
As in the rest of this book, we offer details in this chapter on how 80 × 86 microprocessors address memory chips and how Linux uses the available addressing circuits. You will find, we hope, that when you learn the implementation details on Linux's most popular platform you will better understand both the general theory of paging and how to research the implementation on other platforms.
This is the first of three chapters related to memory management; Chapter 7 discusses how the kernel allocates main memory to itself, while Chapter 8 considers how linear addresses are assigned to processes.
Programmers casually refer to a memory address as the way to access the contents of a memory cell. But when dealing with 80 × 86 microprocessors, we have to distinguish three kinds of addresses:
Logical address
Included in the machine language instructions to specify the address of an operand or of an instruction. This type of address embodies the well-known 80 × 86 segmented architecture that forces MS-DOS and Windows programmers to divide their programs into segments. Each logical address consists of a segment and an offset (or displacement ) that denotes the distance from the start of the segment to the actual address.
Linear address (also known as virtual address)
A single 32-bit unsigned integer that can be used to address up to 4 GB — that is, up to 4,294,967,296 memory cells. Linear addresses are usually represented in hexadecimal notation; their values range from
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Memory Addresses
Programmers casually refer to a memory address as the way to access the contents of a memory cell. But when dealing with 80 × 86 microprocessors, we have to distinguish three kinds of addresses:
Logical address
Included in the machine language instructions to specify the address of an operand or of an instruction. This type of address embodies the well-known 80 × 86 segmented architecture that forces MS-DOS and Windows programmers to divide their programs into segments. Each logical address consists of a segment and an offset (or displacement ) that denotes the distance from the start of the segment to the actual address.
Linear address (also known as virtual address)
A single 32-bit unsigned integer that can be used to address up to 4 GB — that is, up to 4,294,967,296 memory cells. Linear addresses are usually represented in hexadecimal notation; their values range from 0x00000000 to 0xffffffff.
Physical address
Used to address memory cells in memory chips. They correspond to the electrical signals sent along the address pins of the microprocessor to the memory bus. Physical addresses are represented as 32-bit unsigned integers.
The CPU control unit transforms a logical address into a linear address by means of a hardware circuit called a segmentation unit ; subsequently, a second hardware circuit called a paging unit transforms the linear address into a physical address (see Figure 2-1).
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Segmentation in Hardware
Starting with the 80386 model, Intel microprocessors perform address translation in two different ways called real mode and protected mode . These are described in the next sections. Real mode exists mostly to maintain processor compatibility with older models and to allow the operating system to bootstrap (see Appendix A for a short description of real mode).
A logical address consists of two parts: a segment identifier and an offset that specifies the relative address within the segment. The segment identifier is a 16-bit field called the Segment Selector , while the offset is a 32-bit field.
To make it easy to retrieve segment selectors quickly, the processor provides segmentation registers whose only purpose is to hold Segment Selectors; these registers are called cs, ss, ds, es, fs, and gs. Although there are only six of them, a program can reuse the same segmentation register for different purposes by saving its content in memory and then restoring it later.
Three of the six segmentation registers have specific purposes:
cs
The code segment register, which points to a segment containing program instructions
ss
The stack segment register, which points to a segment containing the current program stack
ds
The data segment register, which points to a segment containing static and external data
The remaining three segmentation registers are general purpose and may refer to arbitrary data segments.
The cs register has another important function: it includes a 2-bit field that specifies the Current Privilege Level (CPL) of the CPU. The value 0 denotes the highest privilege level, while the value 3 denotes the lowest one. Linux uses only levels 0 and 3, which are respectively called Kernel Mode and User Mode.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Segmentation in Linux
Segmentation has been included in 80 × 86 microprocessors to encourage programmers to split their applications into logically related entities, such as subroutines or global and local data areas. However, Linux uses segmentation in a very limited way. In fact, segmentation and paging are somewhat redundant since both can be used to separate the physical address spaces of processes: segmentation can assign a different linear address space to each process, while paging can map the same linear address space into different physical address spaces. Linux prefers paging to segmentation for the following reasons:
  • Memory management is simpler when all processes use the same segment register values — that is, when they share the same set of linear addresses.
  • One of the design objectives of Linux is portability to a wide range of architectures; RISC architectures in particular have limited support for segmentation.
The 2.4 version of Linux uses segmentation only when required by the 80 × 86 architecture. In particular, all processes use the same logical addresses, so the total number of segments to be defined is quite limited, and it is possible to store all Segment Descriptors in the Global Descriptor Table (GDT). This table is implemented by the array gdt_table referred to by the gdt variable.
Local Descriptor Tables are not used by the kernel, although a system call called modify_ldt( ) exists that allows processes to create their own LDTs. This turns out to be useful to applications (such as Wine) that execute segment-oriented Microsoft Windows applications.
Here are the segments used by Linux:
  • A kernel code segment. The fields of the corresponding Segment Descriptor in the GDT have the following values:
    • Base = 0x00000000
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Paging in Hardware
The paging unit translates linear addresses into physical ones. It checks the requested access type against the access rights of the linear address. If the memory access is not valid, it generates a Page Fault exception (see Chapter 4 and Chapter 7).
For the sake of efficiency, linear addresses are grouped in fixed-length intervals called pages ; contiguous linear addresses within a page are mapped into contiguous physical addresses. In this way, the kernel can specify the physical address and the access rights of a page instead of those of all the linear addresses included in it. Following the usual convention, we shall use the term "page" to refer both to a set of linear addresses and to the data contained in this group of addresses.
The paging unit thinks of all RAM as partitioned into fixed-length page frames (sometimes referred to as physical pages ). Each page frame contains a page — that is, the length of a page frame coincides with that of a page. A page frame is a constituent of main memory, and hence it is a storage area. It is important to distinguish a page from a page frame; the former is just a block of data, which may be stored in any page frame or on disk.
The data structures that map linear to physical addresses are called Page Tables ; they are stored in main memory and must be properly initialized by the kernel before enabling the paging unit.
In 80 × 86 processors, paging is enabled by setting the PG flag of a control register named cr0. When PG = 0, linear addresses are interpreted as physical addresses.
Starting with the 80386, the paging unit of Intel processors handles 4 KB pages.
The 32 bits of a linear address are divided into three fields:
Directory
The most significant 10 bits
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Paging in Linux
As we explained earlier in Section 2.4.5, Linux adopted a three-level paging model so paging is feasible on 64-bit architectures. Figure 2-11 shows the model, which defines three types of paging tables.
  • Page Global Directory
  • Page Middle Directory
  • Page Table
The Page Global Directory includes the addresses of several Page Middle Directories, which in turn include the addresses of several Page Tables. Each Page Table entry points to a page frame. The linear address is thus split into four parts. Figure 2-11 does not show the bit numbers because the size of each part depends on the computer architecture.
Figure 2-11: The Linux paging model
Linux's handling of processes relies heavily on paging. In fact, the automatic translation of linear addresses into physical ones makes the following design objectives feasible:
  • Assign a different physical address space to each process, ensuring an efficient protection against addressing errors.
  • Distinguish pages (groups of data) from page frames (physical addresses in main memory). This allows the same page to be stored in a page frame, then saved to disk and later reloaded in a different page frame. This is the basic ingredient of the virtual memory mechanism (see Chapter 16).
As we shall see in Chapter 8, each process has its own Page Global Directory and its own set of Page Tables. When a process switch occurs (see Section 3.3), Linux saves the cr3 control register in the descriptor of the process previously in execution and then loads cr3 with the value stored in the descriptor of the process to be executed next. Thus, when the new process resumes its execution on the CPU, the paging unit refers to the correct set of Page Tables.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: Processes
The concept of a process is fundamental to any multiprogramming operating system. A process is usually defined as an instance of a program in execution; thus, if 16 users are running vi at once, there are 16 separate processes (although they can share the same executable code). Processes are often called tasks or threads in the Linux source code.
In this chapter, we discuss static properties of processes and then describe how process switching is performed by the kernel. The last two sections describe how processes can be created and destroyed. We also describe how Linux supports multithreaded applications — as mentioned in Chapter 1, it relies on so-called lightweight processes (LWP).
The term "process" is often used with several different meanings. In this book, we stick to the usual OS textbook definition: a process is an instance of a program in execution. You might think of it as the collection of data structures that fully describes how far the execution of the program has progressed.
Processes are like human beings: they are generated, they have a more or less significant life, they optionally generate one or more child processes, and eventually they die. A small difference is that sex is not really common among processes — each process has just one parent.
From the kernel's point of view, the purpose of a process is to act as an entity to which system resources (CPU time, memory, etc.) are allocated.
When a process is created, it is almost identical to its parent. It receives a (logical) copy of the parent's address space and executes the same code as the parent, beginning at the next instruction following the process creation system call. Although the parent and child may share the pages containing the program code (text), they have separate copies of the data (stack and heap), so that changes by the child to a memory location are invisible to the parent (and vice versa).
While earlier Unix kernels employed this simple model, modern Unix systems do not. They support
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Processes, Lightweight Processes, and Threads
The term "process" is often used with several different meanings. In this book, we stick to the usual OS textbook definition: a process is an instance of a program in execution. You might think of it as the collection of data structures that fully describes how far the execution of the program has progressed.
Processes are like human beings: they are generated, they have a more or less significant life, they optionally generate one or more child processes, and eventually they die. A small difference is that sex is not really common among processes — each process has just one parent.
From the kernel's point of view, the purpose of a process is to act as an entity to which system resources (CPU time, memory, etc.) are allocated.
When a process is created, it is almost identical to its parent. It receives a (logical) copy of the parent's address space and executes the same code as the parent, beginning at the next instruction following the process creation system call. Although the parent and child may share the pages containing the program code (text), they have separate copies of the data (stack and heap), so that changes by the child to a memory location are invisible to the parent (and vice versa).
While earlier Unix kernels employed this simple model, modern Unix systems do not. They support multithreaded applications — user programs having many relatively independent execution flows sharing a large portion of the application data structures. In such systems, a process is composed of several user threads (or simply threads), each of which represents an execution flow of the process. Nowadays, most multithreaded applications are written using standard sets of library functions called pthread (POSIX thread) libraries .
Older versions of the Linux kernel offered no support for multithreaded applications. From the kernel point of view, a multithreaded application was just a normal process. The multiple execution flows of a multithreaded application were created, handled, and scheduled entirely in User Mode, usually by means of a POSIX-compliant
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Process Descriptor
To manage processes, the kernel must have a clear picture of what each process is doing. It must know, for instance, the process's priority, whether it is running on a CPU or blocked on an event, what address space has been assigned to it, which files it is allowed to address, and so on. This is the role of the process descriptor — a task_struct type structure whose fields contain all the information related to a single process. As the repository of so much information, the process descriptor is rather complex. In addition to a large number of fields containing process attributes, the process descriptor contains several pointers to other data structures that, in turn, contain pointers to other structures. Figure 3-1 describes the Linux process descriptor schematically.
Figure 3-1: The Linux process descriptor
The five data structures on the right side of the figure refer to specific resources owned by the process. These resources are covered in future chapters. This chapter focuses on two types of fields that refer to the process state and to process parent/child relationships.
As its name implies, the state field of the process descriptor describes what is currently happening to the process. It consists of an array of flags, each of which describes a possible process state. In the current Linux version, these states are mutually exclusive, and hence exactly one flag of state is set; the remaining flags are cleared. The following are the possible process states:
TASK_RUNNING
The process is either executing on a CPU or waiting to be executed.
TASK_INTERRUPTIBLE
The process is suspended (sleeping) until some condition becomes true. Raising a hardware interrupt, releasing a system resource the process is waiting for, or delivering a signal are examples of conditions that might wake up the process (put its state back to
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Process Switch
To control the execution of processes, the kernel must be able to suspend the execution of the process running on the CPU and resume the execution of some other process previously suspended. This activity goes variously by the names process switch , task switch , or context switch . The next sections describe the elements of process switching in Linux.
While each process can have its own address space, all processes have to share the CPU registers. So before resuming the execution of a process, the kernel must ensure that each such register is loaded with the value it had when the process was suspended.
The set of data that must be loaded into the registers before the process resumes its execution on the CPU is called the hardware context . The hardware context is a subset of the process execution context, which includes all information needed for the process execution. In Linux, a part of the hardware context of a process is stored in the process descriptor, while the remaining part is saved in the Kernel Mode stack.
In the description that follows, we will assume the prev local variable refers to the process descriptor of the process being switched out and next refers to the one being switched in to replace it. We can thus define a process switch as the activity consisting of saving the hardware context of prev and replacing it with the hardware context of next. Since process switches occur quite often, it is important to minimize the time spent in saving and loading hardware contexts.
Old versions of Linux took advantage of the hardware support offered by the Intel architecture and performed a process switch through a far jmp instruction to the selector of the Task State Segment Descriptor of the next process. While executing the instruction, the CPU performs a hardware context switch by automatically saving the old hardware context and loading a new one. But Linux 2.4 uses software to perform a process switch for the following reasons:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Creating Processes
Unix operating systems rely heavily on process creation to satisfy user requests. For example, the shell creates a new process that executes another copy of the shell whenever the user enters a command.
Traditional Unix systems treat all processes in the same way: resources owned by the parent process are duplicated in the child process. This approach makes process creation very slow and inefficient, since it requires copying the entire address space of the parent process. The child process rarely needs to read or modify all the resources inherited from the parent; in many cases, it issues an immediate execve( ) and wipes out the address space that was so carefully copied.
Modern Unix kernels solve this problem by introducing three different mechanisms:
  • The Copy On Write technique allows both the parent and the child to read the same physical pages. Whenever either one tries to write on a physical page, the kernel copies its contents into a new physical page that is assigned to the writing process. The implementation of this technique in Linux is fully explained in Chapter 8.
  • Lightweight processes allow both the parent and the child to share many per-process kernel data structures, such as the paging tables (and therefore the entire User Mode address space), the open file tables, and the signal dispositions.
  • The vfork( ) system call creates a process that shares the memory address space of its parent. To prevent the parent from overwriting data needed by the child, the parent's execution is blocked until the child exits or executes a new program. We'll learn more about the vfork( ) system call in the following section.
Lightweight processes are created in Linux by using a function named clone( ), which uses four parameters:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Destroying Processes
Most processes "die" in the sense that they terminate the execution of the code they were supposed to run. When this occurs, the kernel must be notified so that it can release the resources owned by the process; this includes memory, open files, and any other odds and ends that we will encounter in this book, such as semaphores.
The usual way for a process to terminate is to invoke the exit( ) library function, which releases the resources allocated by the C library, executes each function registered by the programmer, and ends up invoking the _exit( ) system call. The exit( ) function may be inserted by the programmer explicitly. Additionally, the C compiler always inserts an exit( ) function call right after the last statement of the main( ) function.
Alternatively, the kernel may force a process to die. This typically occurs when the process has received a signal that it cannot handle or ignore (see Chapter 10) or when an unrecoverable CPU exception has been raised in Kernel Mode while the kernel was running on behalf of the process (see Chapter 4).
All process terminations are handled by the do_exit( ) function, which removes most references to the terminating process from kernel data structures. The do_exit( ) function executes the following actions:
  1. Sets the PF_EXITING flag in the flag field of the process descriptor to indicate that the process is being eliminated.
  2. Removes, if necessary, the process descriptor from an IPC semaphore queue via the sem_exit( ) function (see Chapter 19) or from a dynamic timer queue via the del_timer_sync( ) function (see Chapter 6).
  3. Examines the process's data structures related to paging, filesystem, open file descriptors, and signal handling, respectively, with the _ _exit_mm( ), _ _exit_files( ), _ _exit_fs( ), and exit_sighand( ) functions. These functions also remove each of these data structures if no other process are sharing them.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 4: Interrupts and Exceptions
An interrupt is usually defined as an event that alters the sequence of instructions executed by a processor. Such events correspond to electrical signals generated by hardware circuits both inside and outside the CPU chip.
Interrupts are often divided into synchronous and asynchronous interrupts:
  • Synchronous interrupts are produced by the CPU control unit while executing instructions and are called synchronous because the control unit issues them only after terminating the execution of an instruction.
  • Asynchronous interrupts are generated by other hardware devices at arbitrary times with respect to the CPU clock signals.
Intel microprocessor manuals designate synchronous and asynchronous interrupts as exceptions and interrupts, respectively. We'll adopt this classification, although we'll occasionally use the term "interrupt signal" to designate both types together (synchronous as well as asynchronous).
Interrupts are issued by interval timers and I/O devices; for instance, the arrival of a keystroke from a user sets off an interrupt. Exceptions, on the other hand, are caused either by programming errors or by anomalous conditions that must be handled by the kernel. In the first case, the kernel handles the exception by delivering to the current process one of the signals familiar to every Unix programmer. In the second case, the kernel performs all the steps needed to recover from the anomalous condition, such as a Page Fault or a request (via an int instruction) for a kernel service.
We start by describing in the next section the motivation for introducing such signals. We then show how the well-known IRQs (Interrupt ReQuests) issued by I/O devices give rise to interrupts, and we detail how 80 × 86 processors handle interrupts and exceptions at the hardware level. Then we illustrate, in Section 4.4, how Linux initializes all the data structures required by the Intel interrupt architecture. The remaining three sections describe how Linux handles interrupt signals at the software level.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Role of Interrupt Signals
As the name suggests, interrupt signals provide a way to divert the processor to code outside the normal flow of control. When an interrupt signal arrives, the CPU must stop what it's currently doing and switch to a new activity; it does this by saving the current value of the program counter (i.e., the content of the eip and cs registers) in the Kernel Mode stack and by placing an address related to the interrupt type into the program counter.
There are some things in this chapter that will remind you of the context switch described in the previous chapter, carried out when a kernel substitutes one process for another. But there is a key difference between interrupt handling and process switching: the code executed by an interrupt or by an exception handler is not a process. Rather, it is a kernel control path that runs on behalf of the same process that was running when the interrupt occurred (see the later section Section 4.3). As a kernel control path, the interrupt handler is lighter than a process (it has less context and requires less time to set up or tear down).
Interrupt handling is one of the most sensitive tasks performed by the kernel, since it must satisfy the following constraints:
  • Interrupts can come at any time, when the kernel may want to finish something else it was trying to do. The kernel's goal is therefore to get the interrupt out of the way as soon as possible and defer as much processing as it can. For instance, suppose a block of data has arrived on a network line. When the hardware interrupts the kernel, it could simply mark the presence of data, give the processor back to whatever was running before, and do the rest of the processing later (such as moving the data into a buffer where its recipient process can find it and then restarting the process). The activities that the kernel needs to perform in response to an interrupt are thus divided into two parts: a top half that the kernel executes right away and a
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Interrupts and Exceptions
The Intel documentation classifies interrupts and exceptions as follows:
  • Interrupts:
    Maskable interrupts
    All Interrupt Requests (IRQs) issued by I/O devices give rise to maskable interrupts. A maskable interrupt can be in two states: masked or unmasked; a masked interrupt is ignored by the control unit as long as it remains masked.
    Nonmaskable interrupts
    Only a few critical events (such as hardware failures) give rise to nonmaskable interrupts. Nonmaskable interrupts are always recognized by the CPU.
  • Exceptions:
    Processor-detected exceptions
    Generated when the CPU detects an anomalous condition while executing an instruction. These are further divided into three groups, depending on the value of the eip register that is saved on the Kernel Mode stack when the CPU control unit raises the exception.
    Faults
    Can generally be corrected; once corrected, the program is allowed to restart with no loss of continuity. The saved value of eip is the address of the instruction that caused the fault, and hence that instruction can be resumed when the exception handler terminates. As we shall see in Section 8.4, resuming the same instruction is necessary whenever the handler is able to correct the anomalous condition that caused the exception.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Nested Execution of Exception and Interrupt Handlers
When handling an interrupt or an exception, the kernel begins a new kernel control path , or separate sequence of instructions. When a process issues a system call request, for instance, the first instructions of the corresponding kernel control path are those that save the content of the registers in the Kernel Mode stack, while the last instructions are those that restore the content of the registers and put the CPU back into User Mode.
Linux design does not allow process switching while the CPU is executing a kernel control path associated with an interrupt. However, such kernel control paths may be arbitrarily nested; an interrupt handler may be interrupted by another interrupt handler, thus giving raise to a nested execution of kernel threads. We emphasize that the current process doesn't change while the kernel is handling a nested set of kernel control paths.
Assuming that the kernel is bug free, most exceptions can occur only while the CPU is in User Mode. Indeed, they are either caused by programming errors or triggered by debuggers. However, the Page Fault exception may occur in Kernel Mode. This happens when the process attempts to address a page that belongs to its address space but is not currently in RAM. While handling such an exception, the kernel may suspend the current process and replace it with another one until the requested page is available. The kernel control path that handles the Page fault exception resumes execution as soon as the process gets the processor again.
Since the Page Fault exception handler never gives rise to further exceptions, at most two kernel control paths associated with exceptions (the first one caused by a system call invocation, the second one caused by a Page Fault) may be stacked, one on top of the other.
In contrast to exceptions, interrupts issued by I/O devices do not refer to data structures specific to the current process, although the kernel control paths that handle them run on behalf of that process. As a matter of fact, it is impossible to predict which process will be running when a given interrupt occurs.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Initializing the Interrupt Descriptor Table
Now that you understand what the Intel processor does with interrupts and exceptions at the hardware level, we can move on to describe how the Interrupt Descriptor Table is initialized.
Remember that before the kernel enables the interrupts, it must load the initial address of the IDT table into the idtr register and initialize all the entries of that table. This activity is done while initializing the system (see Appendix A).
The int instruction allows a User Mode process to issue an interrupt signal that has an arbitrary vector ranging from 0 to 255. Therefore, initialization of the IDT must be done carefully, to block illegal interrupts and exceptions simulated by User Mode processes via int instructions. This can be achieved by setting the DPL field of the Interrupt or Trap Gate Descriptor to 0. If the process attempts to issue one of these interrupt signals, the control unit checks the CPL value against the DPL field and issues a "General protection" exception.
In a few cases, however, a User Mode process must be able to issue a programmed exception. To allow this, it is sufficient to set the DPL field of the corresponding Interrupt or Trap Gate Descriptors to 3 — that is, as high as possible.
Let's now see how Linux implements this strategy.
As mentioned in the earlier section Section 4.2.3, Intel provides three types of interrupt descriptors: Task, Interrupt, and Trap Gate Descriptors. Task Gate Descriptors are irrelevant to Linux, but its Interrupt Descriptor Table contains several Interrupt and Trap Gate Descriptors. Linux classifies them as follows, using a slightly different breakdown and terminology from Intel:
Interrupt gate
An Intel interrupt gate that cannot be accessed by a User Mode process (the gate's DPL field is equal to 0). All Linux interrupt handlers are activated by means of interrupt gates, and all are restricted to Kernel Mode.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Exception Handling
Most exceptions issued by the CPU are interpreted by Linux as error conditions. When one of them occurs, the kernel sends a signal to the process that caused the exception to notify it of an anomalous condition. If, for instance, a process performs a division by zero, the CPU raises a "Divide error" exception and the corresponding exception handler sends a SIGFPE signal to the current process, which then takes the necessary steps to recover or (if no signal handler is set for that signal) abort.
There are a couple of cases, however, where Linux exploits CPU exceptions to manage hardware resources more efficiently. A first case is already described in section Section 3.3.4. The "Device not available" exception is used together with the TS flag of the cr0 register to force the kernel to load the floating point registers of the CPU with new values. A second case refers to the Page Fault exception, which is used to defer allocating new page frames to the process until the last possible moment. The corresponding handler is complex because the exception may, or may not, denote an error condition (see Section 8.4).
Exception handlers have a standard structure consisting of three parts:
  1. Save the contents of most registers in the Kernel Mode stack (this part is coded in assembly language).
  2. Handle the exception by means of a high-level C function.
  3. Exit from the handler by means of the ret_from_exception( ) function.
To take advantage of exceptions, the IDT must be properly initialized with an exception handler function for each recognized exception. It is the job of the trap_init( ) function to insert the final values—the functions that handle the exceptions—into all IDT entries that refer to nonmaskable interrupts and exceptions. This is accomplished through the set_trap_gate, set_intr_gate, and set_system_gate macros:
set_trap_gate(0,&divide_error); 
set_trap_gate(1,&debug); 
set_intr_gate(2,&nmi); 
set_system_gate(3,&int3); 
set_system_gate(4,&overflow); 
set_system_gate(5,&bounds); 
set_trap_gate(6,&invalid_op); 
set_trap_gate(7,&device_not_available); 
set_trap_gate(8,&double_fault); 
set_trap_gate(9,&coprocessor_segment_overrun); 
set_trap_gate(10,&invalid_TSS); 
set_trap_gate(11,&segment_not_present); 
set_trap_gate(12,&stack_segment); 
set_trap_gate(13,&general_protection); 
set_intr_gate(14,&page_fault); 
set_trap_gate(16,&coprocessor_error); 
set_trap_gate(17,&alignment_check); 
set_trap_gate(18,&machine_check); 
set_trap_gate(19,&simd_coprocessor_error); 
set_system_gate(128,&system_call); 
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Interrupt Handling
As we explained earlier, most exceptions are handled simply by sending a Unix signal to the process that caused the exception. The action to be taken is thus deferred until the process receives the signal; as a result, the kernel is able to process the exception quickly.
This approach does not hold for interrupts because they frequently arrive long after the process to which they are related (for instance, a process that requested a data transfer) has been suspended and a completely unrelated process is running. So it would make no sense to send a Unix signal to the current process.
Interrupt handling depends on the type of interrupt. For our purposes, we'll distinguish three main classes of interrupts:
I/O interrupts
Some I/O devices require attention; the corresponding interrupt handler must query the device to determine the proper course of action. We cover this type of interrupt in the later section Section 4.6.1.
Timer interrupts
Some timer, either a local APIC timer or an external timer, has issued an interrupt; this kind of interrupt tells the kernel that a fixed-time interval has elapsed. These interrupts are handled mostly as I/O interrupts; we discuss the peculiar characteristics of timer interrupts in Chapter 6.
Interprocessor interrupts
A CPU issued an interrupt to another CPU of a multiprocessor system. We cover such interrupts in the later section Section 4.6.2.
In general, an I/O interrupt handler must be flexible enough to service several devices at the same time. In the PCI bus architecture, for instance, several devices may share the same IRQ line. This means that the interrupt vector alone does not tell the whole story. In the example shown in Table 4-3, the same vector 43 is assigned to the USB port and to the sound card. However, some hardware devices found in older PC architectures (like ISA) do not reliably operate if their IRQ line is shared with other devices.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Softirqs, Taskl