book

Understanding the Linux Kernel, Second Edition

by Daniel P. Bovet, Marco Cesati

December 2002

Intermediate to advanced

784 pages

27h 7m

English

O'Reilly Media, Inc.

Read now

Unlock full access

The Audience for This Book
Level of Description
Linux Versus Other Unix-Like Kernels

Multiuser SystemsUsers and GroupsProcessesKernel Architecture
FilesHard and Soft LinksFile TypesFile Descriptor and InodeAccess Rights and File ModeFile-Handling System CallsOpening a fileAccessing an opened fileClosing a fileRenaming and deleting a file
The Process/Kernel ModelProcess ImplementationReentrant KernelsProcess Address SpaceSynchronization and Critical RegionsNonpreemptive kernelsInterrupt disablingSemaphoresSpin locksAvoiding deadlocksSignals and Interprocess CommunicationProcess ManagementZombie processesProcess groups and login sessionsMemory ManagementVirtual memoryRandom access memory usageKernel Memory AllocatorProcess virtual address space handlingSwapping and cachingDevice Drivers
Memory Addresses
Segmentation RegistersSegment DescriptorsFast Access to Segment DescriptorsSegmentation Unit
Regular PagingExtended PagingHardware Protection SchemeAn Example of Regular PagingThree-Level PagingThe Physical Address Extension (PAE) Paging MechanismHardware CacheTranslation Lookaside Buffers (TLB)
The Linear Address FieldsPage Table HandlingReserved Page FramesProcess Page TablesKernel Page TablesProvisional kernel Page TablesFinal kernel Page Table when RAM size is less than 896 MBFinal kernel Page Table when RAM size is between 896 MB and 4096 MBFinal kernel Page Table when RAM size is more than 4096 MBFix-Mapped Linear AddressesHandling the Hardware Cache and the TLBHandling the hardware cacheHandling the TLB
Processes, Lightweight Processes, and Threads
Process StateIdentifying a ProcessProcessor descriptors handlingThe current macroThe process listDoubly linked listsThe list of TASK_RUNNING processesThe pidhash table and chained listsParenthood Relationships Among ProcessesHow Processes Are OrganizedWait queuesHandling wait queuesProcess Resource Limits
Hardware ContextTask State SegmentThe thread fieldPerforming the Process SwitchSaving the FPU, MMX, and XMM Registers
The clone( ), fork( ), and vfork( ) System CallsKernel ThreadsCreating a kernel threadProcess 0Process 1Other kernel threads
Process TerminationProcess Removal
The Role of Interrupt Signals
IRQs and InterruptsThe Advanced Programmable Interrupt Controller (APIC)ExceptionsInterrupt Descriptor TableHardware Handling of Interrupts and Exceptions
Interrupt, Trap, and System GatesPreliminary Initialization of the IDT
Saving the Registers for the Exception HandlerEntering and Leaving the Exception Handler
I/O Interrupt HandlingInterrupt vectorsIRQ data structuresIRQ distribution in multiprocessor systemsSaving the registers for the interrupt handlerThe do_IRQ( ) functionReviving a lost interruptInterrupt service routinesDynamic allocation of IRQ linesInterprocessor Interrupt Handling
SoftirqsThe softirq kernel threadsTaskletsBottom HalvesExtending a bottom half
The ret_ from_exception( ) FunctionThe ret_ from_intr( ) FunctionThe ret_ from_sys_call( ) FunctionThe ret_ from_ fork( ) Function
Kernel Control Paths
Atomic OperationsMemory BarriersSpin LocksRead/Write Spin LocksGetting and releasing a lock for readingGetting and releasing a lock for writingThe Big Reader LockSemaphoresGetting and releasing semaphoresRead/Write SemaphoresCompletionsLocal Interrupt DisablingGlobal Interrupt DisablingDisabling Deferrable Functions
Choosing Among Spin Locks, Semaphores, and Interrupt DisablingProtecting a data structure accessed by exceptionsProtecting a data structure accessed by interruptsProtecting a data structure accessed by deferrable functionsProtecting a data structure accessed by exceptions and interruptsProtecting a data structure accessed by exceptions and deferrable functionsProtecting a data structure accessed by interrupts and deferrable functionsProtecting a data structure accessed by exceptions, interrupts, and deferrable functions
Reference CountersThe Global Kernel LockMemory Descriptor Read/Write SemaphoreSlab Cache List SemaphoreInode Semaphore
Hardware ClocksReal Time ClockTime Stamp CounterProgrammable Interval TimerCPU Local Timers
Timekeeping Architecture in Uniprocessor SystemsPIT’s interrupt service routineThe TIMER_BH bottom halfTimekeeping Architecture in Multiprocessor SystemsInitialization of the timekeeping architectureThe local timer interrupt handler
Checking the Current Process CPU Resource LimitKeeping Track of System LoadProfiling the Kernel CodeChecking the NMI Watchdogs
Dynamic TimersDynamic timers and race conditionsDynamic timers handlingAn Application of Dynamic Timers
The time( ), ftime( ), and gettimeofday( ) System CallsThe adjtimex( ) System CallThe setitimer( ) and alarm( ) System Calls
Page Frame ManagementPage DescriptorsMemory ZonesNon-Uniform Memory Access (NUMA)Initialization of the Memory Handling Data StructuresRequesting and Releasing Page FramesKernel Mappings of High-Memory Page FramesPermanent kernel mappingsTemporary kernel mappingsThe Buddy System AlgorithmData structuresAllocating a blockFreeing a block
The Slab AllocatorCache DescriptorSlab DescriptorGeneral and Specific CachesInterfacing the Slab Allocator with the Buddy SystemAllocating a Slab to a CacheReleasing a Slab from a CacheObject DescriptorAligning Objects in MemorySlab ColoringLocal Array of Objects in Multiprocessor SystemsAllocating an Object in a CacheThe uniprocessor caseThe multiprocessor caseReleasing an Object from a CacheThe uniprocessor caseThe multiprocessor caseGeneral Purpose Objects
Linear Addresses of Noncontiguous Memory AreasDescriptors of Noncontiguous Memory AreasAllocating a Noncontiguous Memory AreaReleasing a Noncontiguous Memory Area
The Process’s Address Space
Memory Descriptor of Kernel Threads
Memory Region Data StructuresMemory Region Access RightsMemory Region HandlingFinding the closest region to a given address: find_vma( )Finding a region that overlaps a given interval: find_vma_intersection( )Finding a free interval: arch_get_unmapped_area( )Inserting a region in the memory descriptor list: insert_vm_struct( )Allocating a Linear Address IntervalReleasing a Linear Address IntervalFirst phase: scanning the memory regionsSecond phase: updating the Page Tables
Handling a Faulty Address Outside the Address SpaceHandling a Faulty Address Inside the Address SpaceDemand PagingCopy On WriteHandling Noncontiguous Memory Area Accesses
Creating a Process Address SpaceDeleting a Process Address Space
POSIX APIs and System Calls
Initializing System CallsThe system_call( ) FunctionParameter PassingVerifying the ParametersAccessing the Process Address SpaceDynamic Address Checking: The Fixup CodeThe exception tablesGenerating the exception tables and the fixup code
The Role of SignalsActions Performed upon Delivering a SignalData Structures Associated with SignalsOperations on Signal Data Structures
The send_sig_info( ) and send_sig( ) FunctionsThe force_sig_info( ) and force_sig( ) Functions
Ignoring the SignalExecuting the Default Action for the SignalCatching the SignalSetting up the frameEvaluating the signal flagsStarting the signal handlerTerminating the signal handlerReexecution of System Calls
The kill( ) System CallChanging a Signal ActionExamining the Pending Blocked SignalsModifying the Set of Blocked SignalsSuspending the ProcessSystem Calls for Real-Time Signals
Scheduling PolicyProcess PreemptionHow Long Must a Quantum Last?
Data Structures Used by the SchedulerProcess descriptorCPU’s data structuresThe schedule( ) FunctionDirect invocationLazy invocationActions performed by schedule( ) before a process switchActions performed by schedule( ) after a process switchHow good is a runnable process?Scheduling on multiprocessor systemsPerformance of the Scheduling AlgorithmThe algorithm does not scale wellThe predefined quantum is too large for high system loadsI/O-bound process boosting strategy is not optimalSupport for real-time applications is weak
The nice( ) System CallThe getpriority( ) and setpriority( ) System CallsSystem Calls Related to Real-Time ProcessesThe sched_getscheduler( ) and sched_setscheduler( ) system callsThe sched_ getparam( ) and sched_setparam( ) system callsThe sched_ yield( ) system callThe sched_ get_priority_min( ) and sched_ get_priority_max( ) system callsThe sched_rr_ get_interval( ) system call
The Role of the Virtual Filesystem (VFS)The Common File ModelSystem Calls Handled by the VFS
Superblock ObjectsInode ObjectsFile Objectsdentry ObjectsThe dentry CacheFiles Associated with a Process
Special FilesystemsFilesystem Type Registration
Mounting the Root FilesystemMounting a Generic FilesystemUnmounting a Filesystem
Standard Pathname LookupParent Pathname LookupLookup of Symbolic Links
The open( ) System CallThe read( ) and write( ) System CallsThe close( ) System Call
Linux File LockingFile-Locking Data StructuresFL_FLOCK LocksFL_POSIX Locks
I/O ArchitectureI/O PortsAccessing I/O portsI/O InterfacesCustom I/O interfacesGeneral-purpose I/O interfacesDevice ControllersMapping addresses of I/O shared memoryAccessing the I/O shared memoryDirect Memory Access (DMA)Putting DMA to work
Old-Style Device FilesDevfs Device FilesVFS Handling of Device Files
Levels of Kernel SupportBuffering Strategies of Device DriversRegistering a Device DriverInitializing a Device DriverMonitoring I/O OperationsPolling modeInterrupt mode
Keeping Track of Block Device DriversInitializing a Block Device DriverSectors, Blocks, and BuffersBuffer HeadsAn Overview of Block Device Driver ArchitectureRequest descriptorsRequest queue descriptorsBlock device low-level driver descriptorThe ll_rw_block( ) FunctionScheduling the activation of the strategy routineExtending the request queueLow-Level Request HandlingBlock and Page I/O OperationsBlock I/O operationsPage I/O operations
The Page CacheThe address_space ObjectPage Cache Data StructuresThe page hash tableThe lists of page descriptors in the address_space objectPage descriptor fields related to the page cachePage Cache Handling Functions
Buffer Head Data StructuresThe list of unused buffer headsLists of buffer heads for cached buffersThe hash table of cached buffer headsBuffer usage counterBuffer PagesAllocating buffer pagesThe getblk( ) FunctionWriting Dirty Buffers to DiskThe bdflush kernel threadThe kupdate kernel threadThe sync( ), fsync( ), and fdatasync( ) system calls
Reading and Writing a FileReading from a FileThe readpage method for regular filesThe readpage method for block device filesRead-Ahead of FilesThe accessed page is locked (synchronous read-ahead)The accessed page is unlocked (asynchronous read-ahead)Writing to a FileThe prepare_write and commit_write methods for regular filesThe prepare_write and commit_write methods for block device files
Memory Mapping Data StructuresCreating a Memory MappingDestroying a Memory MappingDemand Paging for Memory MappingFlushing Dirty Memory Mapping Pages to Disk
What Is Swapping?Which Kind of Page to Swap OutHow to Distribute Pages in the Swap AreasHow to Select the Page to Be Swapped OutWhen to Perform Page Swap Out
Swap Area DescriptorSwapped-Out Page IdentifierActivating and Deactivating a Swap AreaThe sys_swapon( ) service routineThe sys_swapoff( ) service routineThe try_to_unuse( ) functionAllocating and Releasing a Page SlotThe scan_swap_map( ) functionThe get_swap_page( ) functionThe swap_free( ) function
Swap Cache Helper Functions
The rw_swap_ page( ) FunctionThe read_swap_cache_async( ) FunctionThe rw_swap_ page_nolock( ) Function
The try_to_swap_out( ) Function
The do_swap_page( ) Function
Outline of the Page Frame Reclaiming AlgorithmThe Least Recently Used (LRU) ListsMoving pages across the LRU listsThe try_to_ free_ pages( ) FunctionThe shrink_caches( ) FunctionThe shrink_cache( ) FunctionReclaiming Page Frames from the Dentry and Inode CachesReclaiming page frames from the dentry cacheReclaiming page frames from the inode cacheThe kswapd Kernel Thread
General Characteristics of Ext2
SuperblockGroup Descriptor and BitmapInode TableHow Various File Types Use Disk BlocksRegular fileDirectorySymbolic linkDevice file, pipe, and socket
The ext2_sb_info and ext2_inode_info StructuresBitmap Caches
Ext2 Superblock OperationsExt2 Inode OperationsExt2 File Operations
Creating InodesDeleting InodesData Blocks AddressingFile HolesAllocating a Data BlockReleasing a Data Block
Journaling FilesystemsThe Ext3 Journaling FilesystemThe Journaling Block Device LayerLog recordsAtomic operation handlesTransactionsHow Journaling Works
Main Networking Data StructuresNetwork ArchitecturesNetwork Interface CardsBSD SocketsINET SocketsThe Destination CacheRouting Data StructuresThe Forwarding Information Base (FIB)The routing cacheThe neighbor cacheThe Socket Buffer
The socket( ) System CallSocket initializationSocket’s filesThe bind( ) System CallThe connect( ) System CallWriting Packets to a SocketTransport layer: the udp_sendmsg( ) functionTransport and network layers: the ip_build_xmit( ) functionData link layer: composing the hardware headerData link layer: enqueueing the socket buffer for transmission
PipesUsing a PipePipe Data StructuresThe pipefs special filesystemCreating and Destroying a PipeReading from a PipeWriting into a Pipe
Creating and Opening a FIFO
Using an IPC ResourceThe ipc( ) System CallIPC SemaphoresUndoable semaphore operationsThe queue of pending requestsIPC MessagesIPC Shared MemorySwapping out pages of IPC shared memory regionsDemand paging for IPC shared memory regions
Executable FilesProcess Credentials and CapabilitiesProcess capabilitiesCommand-Line Arguments and Shell EnvironmentLibrariesProgram Segments and Process Memory RegionsExecution Tracing
Prehistoric Age: The BIOS
Booting Linux from Floppy DiskBooting Linux from Hard Disk
To Be (a Module) or Not to Be?
Module Usage CounterExporting SymbolsModule Dependency
The modprobe ProgramThe request_module( ) Function
Books on Unix Kernels

Content preview from Understanding the Linux Kernel, Second Edition

Softirqs, Tasklets, and Bottom Halves

We mentioned earlier in Section 4.6 that several tasks among those executed by the kernel are not critical: they can be deferred for a long period of time, if necessary. Remember that the interrupt service routines of an interrupt handler are serialized, and often there should be no occurrence of an interrupt until the corresponding interrupt handler has terminated. Conversely, the deferrable tasks can execute with all interrupts enabled. Taking them out of the interrupt handler helps keep kernel response time small. This is a very important property for many time-critical applications that expect their interrupt requests to be serviced in a few milliseconds.

Linux 2.4 answers such a challenge by using three kinds of deferrable and interruptible kernel functions (in short, deferrable functions ^[35]): softirqs , tasklets , and bottom halves . Although these three kinds of deferrable functions work in different ways, they are strictly correlated. Tasklets are implemented on top of softirqs, and bottom halves are implemented by means of tasklets. As a matter of fact, the term “softirq,” which appears in the kernel source code, often denotes all kinds of deferrable functions.

As a general rule, no softirq can be interrupted to run another softirq on the same CPU; the same rule holds for tasklets and bottom halves built on top of softirqs. On a multiprocessor system, however, several deferrable functions can run concurrently on different CPUs. The ...