book

Understanding the Linux Kernel, Second Edition

by Daniel P. Bovet, Marco Cesati

December 2002

Intermediate to advanced

784 pages

27h 7m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Understanding the Linux Kernel, 2nd Edition
Preface
The Audience for This Book
Organization of the Material
Level of Description
Overview of the Book
Background Information
Conventions in This Book
How to Contact Us
Acknowledgments
1. Introduction
Linux Versus Other Unix-Like Kernels
Hardware Dependency

Linux Versions
Basic Operating System Concepts
Multiuser SystemsUsers and GroupsProcessesKernel Architecture
An Overview of the Unix Filesystem
FilesHard and Soft LinksFile TypesFile Descriptor and InodeAccess Rights and File ModeFile-Handling System CallsOpening a fileAccessing an opened fileClosing a fileRenaming and deleting a file
An Overview of Unix Kernels
The Process/Kernel ModelProcess ImplementationReentrant KernelsProcess Address SpaceSynchronization and Critical RegionsNonpreemptive kernelsInterrupt disablingSemaphoresSpin locksAvoiding deadlocksSignals and Interprocess CommunicationProcess ManagementZombie processesProcess groups and login sessionsMemory ManagementVirtual memoryRandom access memory usageKernel Memory AllocatorProcess virtual address space handlingSwapping and cachingDevice Drivers
2. Memory Addressing
Memory Addresses
Segmentation in Hardware
Segmentation RegistersSegment DescriptorsFast Access to Segment DescriptorsSegmentation Unit
Segmentation in Linux
Paging in Hardware
Regular PagingExtended PagingHardware Protection SchemeAn Example of Regular PagingThree-Level PagingThe Physical Address Extension (PAE) Paging MechanismHardware CacheTranslation Lookaside Buffers (TLB)
Paging in Linux
The Linear Address FieldsPage Table HandlingReserved Page FramesProcess Page TablesKernel Page TablesProvisional kernel Page TablesFinal kernel Page Table when RAM size is less than 896 MBFinal kernel Page Table when RAM size is between 896 MB and 4096 MBFinal kernel Page Table when RAM size is more than 4096 MBFix-Mapped Linear AddressesHandling the Hardware Cache and the TLBHandling the hardware cacheHandling the TLB
3. Processes
Processes, Lightweight Processes, and Threads
Process Descriptor
Process StateIdentifying a ProcessProcessor descriptors handlingThe current macroThe process listDoubly linked listsThe list of TASK_RUNNING processesThe pidhash table and chained listsParenthood Relationships Among ProcessesHow Processes Are OrganizedWait queuesHandling wait queuesProcess Resource Limits
Process Switch
Hardware ContextTask State SegmentThe thread fieldPerforming the Process SwitchSaving the FPU, MMX, and XMM Registers
Creating Processes
The clone( ), fork( ), and vfork( ) System CallsKernel ThreadsCreating a kernel threadProcess 0Process 1Other kernel threads
Destroying Processes
Process TerminationProcess Removal
4. Interrupts and Exceptions
The Role of Interrupt Signals
Interrupts and Exceptions
IRQs and InterruptsThe Advanced Programmable Interrupt Controller (APIC)ExceptionsInterrupt Descriptor TableHardware Handling of Interrupts and Exceptions
Nested Execution of Exception and Interrupt Handlers
Initializing the Interrupt Descriptor Table
Interrupt, Trap, and System GatesPreliminary Initialization of the IDT
Exception Handling
Saving the Registers for the Exception HandlerEntering and Leaving the Exception Handler
Interrupt Handling
I/O Interrupt HandlingInterrupt vectorsIRQ data structuresIRQ distribution in multiprocessor systemsSaving the registers for the interrupt handlerThe do_IRQ( ) functionReviving a lost interruptInterrupt service routinesDynamic allocation of IRQ linesInterprocessor Interrupt Handling
Softirqs, Tasklets, and Bottom Halves
SoftirqsThe softirq kernel threadsTaskletsBottom HalvesExtending a bottom half
Returning from Interrupts and Exceptions
The ret_ from_exception( ) FunctionThe ret_ from_intr( ) FunctionThe ret_ from_sys_call( ) FunctionThe ret_ from_ fork( ) Function
5. Kernel Synchronization
Kernel Control Paths
When Synchronization Is Not Necessary
Synchronization Primitives
Atomic OperationsMemory BarriersSpin LocksRead/Write Spin LocksGetting and releasing a lock for readingGetting and releasing a lock for writingThe Big Reader LockSemaphoresGetting and releasing semaphoresRead/Write SemaphoresCompletionsLocal Interrupt DisablingGlobal Interrupt DisablingDisabling Deferrable Functions
Synchronizing Accesses to Kernel Data Structures
Choosing Among Spin Locks, Semaphores, and Interrupt DisablingProtecting a data structure accessed by exceptionsProtecting a data structure accessed by interruptsProtecting a data structure accessed by deferrable functionsProtecting a data structure accessed by exceptions and interruptsProtecting a data structure accessed by exceptions and deferrable functionsProtecting a data structure accessed by interrupts and deferrable functionsProtecting a data structure accessed by exceptions, interrupts, and deferrable functions
Examples of Race Condition Prevention
Reference CountersThe Global Kernel LockMemory Descriptor Read/Write SemaphoreSlab Cache List SemaphoreInode Semaphore
6. Timing Measurements
Hardware ClocksReal Time ClockTime Stamp CounterProgrammable Interval TimerCPU Local Timers
The Linux Timekeeping Architecture
Timekeeping Architecture in Uniprocessor SystemsPIT’s interrupt service routineThe TIMER_BH bottom halfTimekeeping Architecture in Multiprocessor SystemsInitialization of the timekeeping architectureThe local timer interrupt handler
CPU’s Time Sharing
Updating the Time and Date
Updating System Statistics
Checking the Current Process CPU Resource LimitKeeping Track of System LoadProfiling the Kernel CodeChecking the NMI Watchdogs
Software Timers
Dynamic TimersDynamic timers and race conditionsDynamic timers handlingAn Application of Dynamic Timers
System Calls Related to Timing Measurements
The time( ), ftime( ), and gettimeofday( ) System CallsThe adjtimex( ) System CallThe setitimer( ) and alarm( ) System Calls
7. Memory Management
Page Frame ManagementPage DescriptorsMemory ZonesNon-Uniform Memory Access (NUMA)Initialization of the Memory Handling Data StructuresRequesting and Releasing Page FramesKernel Mappings of High-Memory Page FramesPermanent kernel mappingsTemporary kernel mappingsThe Buddy System AlgorithmData structuresAllocating a blockFreeing a block
Memory Area Management
The Slab AllocatorCache DescriptorSlab DescriptorGeneral and Specific CachesInterfacing the Slab Allocator with the Buddy SystemAllocating a Slab to a CacheReleasing a Slab from a CacheObject DescriptorAligning Objects in MemorySlab ColoringLocal Array of Objects in Multiprocessor SystemsAllocating an Object in a CacheThe uniprocessor caseThe multiprocessor caseReleasing an Object from a CacheThe uniprocessor caseThe multiprocessor caseGeneral Purpose Objects
Noncontiguous Memory Area Management
Linear Addresses of Noncontiguous Memory AreasDescriptors of Noncontiguous Memory AreasAllocating a Noncontiguous Memory AreaReleasing a Noncontiguous Memory Area
8. Process Address Space
The Process’s Address Space
The Memory Descriptor
Memory Descriptor of Kernel Threads
Memory Regions
Memory Region Data StructuresMemory Region Access RightsMemory Region HandlingFinding the closest region to a given address: find_vma( )Finding a region that overlaps a given interval: find_vma_intersection( )Finding a free interval: arch_get_unmapped_area( )Inserting a region in the memory descriptor list: insert_vm_struct( )Allocating a Linear Address IntervalReleasing a Linear Address IntervalFirst phase: scanning the memory regionsSecond phase: updating the Page Tables
Page Fault Exception Handler
Handling a Faulty Address Outside the Address SpaceHandling a Faulty Address Inside the Address SpaceDemand PagingCopy On WriteHandling Noncontiguous Memory Area Accesses
Creating and Deleting a Process Address Space
Creating a Process Address SpaceDeleting a Process Address Space
Managing the Heap
9. System Calls
POSIX APIs and System Calls
System Call Handler and Service Routines
Initializing System CallsThe system_call( ) FunctionParameter PassingVerifying the ParametersAccessing the Process Address SpaceDynamic Address Checking: The Fixup CodeThe exception tablesGenerating the exception tables and the fixup code
Kernel Wrapper Routines
10. Signals
The Role of SignalsActions Performed upon Delivering a SignalData Structures Associated with SignalsOperations on Signal Data Structures
Generating a Signal
The send_sig_info( ) and send_sig( ) FunctionsThe force_sig_info( ) and force_sig( ) Functions
Delivering a Signal
Ignoring the SignalExecuting the Default Action for the SignalCatching the SignalSetting up the frameEvaluating the signal flagsStarting the signal handlerTerminating the signal handlerReexecution of System Calls
System Calls Related to Signal Handling
The kill( ) System CallChanging a Signal ActionExamining the Pending Blocked SignalsModifying the Set of Blocked SignalsSuspending the ProcessSystem Calls for Real-Time Signals
11. Process Scheduling
Scheduling PolicyProcess PreemptionHow Long Must a Quantum Last?
The Scheduling Algorithm
Data Structures Used by the SchedulerProcess descriptorCPU’s data structuresThe schedule( ) FunctionDirect invocationLazy invocationActions performed by schedule( ) before a process switchActions performed by schedule( ) after a process switchHow good is a runnable process?Scheduling on multiprocessor systemsPerformance of the Scheduling AlgorithmThe algorithm does not scale wellThe predefined quantum is too large for high system loadsI/O-bound process boosting strategy is not optimalSupport for real-time applications is weak
System Calls Related to Scheduling
The nice( ) System CallThe getpriority( ) and setpriority( ) System CallsSystem Calls Related to Real-Time ProcessesThe sched_getscheduler( ) and sched_setscheduler( ) system callsThe sched_ getparam( ) and sched_setparam( ) system callsThe sched_ yield( ) system callThe sched_ get_priority_min( ) and sched_ get_priority_max( ) system callsThe sched_rr_ get_interval( ) system call
12. The Virtual Filesystem
The Role of the Virtual Filesystem (VFS)The Common File ModelSystem Calls Handled by the VFS
VFS Data Structures
Superblock ObjectsInode ObjectsFile Objectsdentry ObjectsThe dentry CacheFiles Associated with a Process
Filesystem Types
Special FilesystemsFilesystem Type Registration
Filesystem Mounting
Mounting the Root FilesystemMounting a Generic FilesystemUnmounting a Filesystem
Pathname Lookup
Standard Pathname LookupParent Pathname LookupLookup of Symbolic Links
Implementations of VFS System Calls
The open( ) System CallThe read( ) and write( ) System CallsThe close( ) System Call
File Locking
Linux File LockingFile-Locking Data StructuresFL_FLOCK LocksFL_POSIX Locks
13. Managing I/O Devices
I/O ArchitectureI/O PortsAccessing I/O portsI/O InterfacesCustom I/O interfacesGeneral-purpose I/O interfacesDevice ControllersMapping addresses of I/O shared memoryAccessing the I/O shared memoryDirect Memory Access (DMA)Putting DMA to work
Device Files
Old-Style Device FilesDevfs Device FilesVFS Handling of Device Files
Device Drivers
Levels of Kernel SupportBuffering Strategies of Device DriversRegistering a Device DriverInitializing a Device DriverMonitoring I/O OperationsPolling modeInterrupt mode
Block Device Drivers
Keeping Track of Block Device DriversInitializing a Block Device DriverSectors, Blocks, and BuffersBuffer HeadsAn Overview of Block Device Driver ArchitectureRequest descriptorsRequest queue descriptorsBlock device low-level driver descriptorThe ll_rw_block( ) FunctionScheduling the activation of the strategy routineExtending the request queueLow-Level Request HandlingBlock and Page I/O OperationsBlock I/O operationsPage I/O operations
Character Device Drivers
14. Disk Caches
The Page CacheThe address_space ObjectPage Cache Data StructuresThe page hash tableThe lists of page descriptors in the address_space objectPage descriptor fields related to the page cachePage Cache Handling Functions
The Buffer Cache
Buffer Head Data StructuresThe list of unused buffer headsLists of buffer heads for cached buffersThe hash table of cached buffer headsBuffer usage counterBuffer PagesAllocating buffer pagesThe getblk( ) FunctionWriting Dirty Buffers to DiskThe bdflush kernel threadThe kupdate kernel threadThe sync( ), fsync( ), and fdatasync( ) system calls
15. Accessing Files
Reading and Writing a FileReading from a FileThe readpage method for regular filesThe readpage method for block device filesRead-Ahead of FilesThe accessed page is locked (synchronous read-ahead)The accessed page is unlocked (asynchronous read-ahead)Writing to a FileThe prepare_write and commit_write methods for regular filesThe prepare_write and commit_write methods for block device files
Memory Mapping
Memory Mapping Data StructuresCreating a Memory MappingDestroying a Memory MappingDemand Paging for Memory MappingFlushing Dirty Memory Mapping Pages to Disk
Direct I/O Transfers
16. Swapping: Methods for Freeing Memory
What Is Swapping?Which Kind of Page to Swap OutHow to Distribute Pages in the Swap AreasHow to Select the Page to Be Swapped OutWhen to Perform Page Swap Out
Swap Area
Swap Area DescriptorSwapped-Out Page IdentifierActivating and Deactivating a Swap AreaThe sys_swapon( ) service routineThe sys_swapoff( ) service routineThe try_to_unuse( ) functionAllocating and Releasing a Page SlotThe scan_swap_map( ) functionThe get_swap_page( ) functionThe swap_free( ) function
The Swap Cache
Swap Cache Helper Functions
Transferring Swap Pages
The rw_swap_ page( ) FunctionThe read_swap_cache_async( ) FunctionThe rw_swap_ page_nolock( ) Function
Swapping Out Pages
The try_to_swap_out( ) Function
Swapping in Pages
The do_swap_page( ) Function
Reclaiming Page Frame
Outline of the Page Frame Reclaiming AlgorithmThe Least Recently Used (LRU) ListsMoving pages across the LRU listsThe try_to_ free_ pages( ) FunctionThe shrink_caches( ) FunctionThe shrink_cache( ) FunctionReclaiming Page Frames from the Dentry and Inode CachesReclaiming page frames from the dentry cacheReclaiming page frames from the inode cacheThe kswapd Kernel Thread
17. The Ext2 and Ext3 Filesystems
General Characteristics of Ext2
Ext2 Disk Data Structures
SuperblockGroup Descriptor and BitmapInode TableHow Various File Types Use Disk BlocksRegular fileDirectorySymbolic linkDevice file, pipe, and socket
Ext2 Memory Data Structures
The ext2_sb_info and ext2_inode_info StructuresBitmap Caches
Creating the Ext2 Filesystem
Ext2 Methods
Ext2 Superblock OperationsExt2 Inode OperationsExt2 File Operations
Managing Ext2 Disk Space
Creating InodesDeleting InodesData Blocks AddressingFile HolesAllocating a Data BlockReleasing a Data Block
The Ext3 Filesystem
Journaling FilesystemsThe Ext3 Journaling FilesystemThe Journaling Block Device LayerLog recordsAtomic operation handlesTransactionsHow Journaling Works
18. Networking
Main Networking Data StructuresNetwork ArchitecturesNetwork Interface CardsBSD SocketsINET SocketsThe Destination CacheRouting Data StructuresThe Forwarding Information Base (FIB)The routing cacheThe neighbor cacheThe Socket Buffer
System Calls Related to Networking
The socket( ) System CallSocket initializationSocket’s filesThe bind( ) System CallThe connect( ) System CallWriting Packets to a SocketTransport layer: the udp_sendmsg( ) functionTransport and network layers: the ip_build_xmit( ) functionData link layer: composing the hardware headerData link layer: enqueueing the socket buffer for transmission
Sending Packets to the Network Card
Receiving Packets from the Network Card
19. Process Communication
PipesUsing a PipePipe Data StructuresThe pipefs special filesystemCreating and Destroying a PipeReading from a PipeWriting into a Pipe
FIFOs
Creating and Opening a FIFO
System V IPC
Using an IPC ResourceThe ipc( ) System CallIPC SemaphoresUndoable semaphore operationsThe queue of pending requestsIPC MessagesIPC Shared MemorySwapping out pages of IPC shared memory regionsDemand paging for IPC shared memory regions
20. Program Execution
Executable FilesProcess Credentials and CapabilitiesProcess capabilitiesCommand-Line Arguments and Shell EnvironmentLibrariesProgram Segments and Process Memory RegionsExecution Tracing
Executable Formats
Execution Domains
The exec Functions
A. System Startup
Prehistoric Age: The BIOS
Ancient Age: The Boot Loader
Booting Linux from Floppy DiskBooting Linux from Hard Disk
Middle Ages: The setup( ) Function
Renaissance: The startup_32( ) Functions
Modern Age: The start_kernel( ) Function
B. Modules
To Be (a Module) or Not to Be?
Module Implementation
Module Usage CounterExporting SymbolsModule Dependency
Linking and Unlinking Modules
Linking Modules on Demand
The modprobe ProgramThe request_module( ) Function
C. Source Code Structure
21. Bibliography
Books on Unix Kernels
Books on the Linux Kernel
Books on PC Architecture and Technical Manuals on Intel Microprocessors
Other Online Documentation Sources
Index
Colophon

Content preview from Understanding the Linux Kernel, Second Edition

Returning from Interrupts and Exceptions

We will finish the chapter by examining the termination phase of interrupt and exception handlers. Although the main objective is clear — namely, to resume execution of some program — several issues must be considered before doing it:

Number of kernel control paths being concurrently executed: If there is just one, the CPU must switch back to User Mode.
Pending process switch requests: If there is any request, the kernel must perform process scheduling; otherwise, control is returned to the current process.
Pending signals: If a signal is sent to the current process, it must be handled.

The kernel assembly language code that accomplishes all these things is not, technically speaking, a function, since control is never returned to the functions that invoke it. It is a piece of code with four different entry points called ret_from_intr, ret_from_exception, ret_from_sys_call, and ret_from_fork. We will refer to it as four different functions since this makes the description simpler, and we shall refer quite often to the following three entry points as functions:

ret_from_exception( ): Terminates all exceptions except the 0x80 ones
ret_from_intr( ): Terminates interrupt handlers
ret_from_sys_call( ): Terminates system calls (i.e., kernel control paths engendered by 0x80 programmed exceptions)
ret_from_fork( ): Terminates the fork( ), vfork( ), or clone( ) system calls (child only).

The general flow diagram with the corresponding four entry points is ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 0596002130Catalog Page Errata

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Understanding the Linux Kernel, Second Edition

by Daniel P. Bovet, Marco Cesati

Returning from Interrupts and Exceptions

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.