book

System Performance Tuning, 2nd Edition

by Gian-Paolo D. Musumeci, Mike Loukides

February 2002

Intermediate to advanced

350 pages

11h 46m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Who Should Buy This Book?A Note on CoverageHow to Read This BookThis Book as a StoryThis Book as a ReferenceOrganizationTypographic ConventionsComments and Questions Personal Comments and AcknowledgmentsAcknowledgments from Mike Loukides
1. An Introduction to Performance Tuning
1.1. An Introduction to Computer Architecture1.1.1. Levels of Transformation1.1.1.1. Software: algorithms and languages1.1.1.2. The Instruction Set Architecture1.1.1.3. Hardware: microarchitecture, circuits, and devices1.1.2. The von Neumann Model1.1.3. Caches and the Memory Hierarchy1.1.4. The Benefits of a 64-Bit Architecture1.1.4.1. What does it mean to be 64-bit?1.1.4.2. Performance ramifications1.2. Principles of Performance Tuning1.2.1. Principle 0: Understand Your Environment1.2.2. Principle 1: TANSTAAFL!1.2.3. Principle 2: Throughput Versus Latency1.2.4. Principle 3: Do Not Overutilize a Resource1.2.5. Principle 4: Design Tests Carefully1.3. Static Performance Tuning1.3.1. Other Miscellaneous Things to Check1.4. Concluding Thoughts
2. Workflow Management
2.1. Workflow Characterization2.1.1. Simple Commands2.1.2. Process Accounting2.1.2.1. Enabling process accounting2.1.2.2. Reviewing accounting records2.1.3. Automating sar2.1.3.1. Enabling sar2.1.3.2. Retrieving data2.1.4. Virtual Adrian2.1.5. Network Pattern Analysis2.1.5.1. Pattern 1: request-response2.1.5.2. Pattern 1B: inverse request-response2.1.5.3. Pattern 2: data transfer2.1.5.4. Pattern 3: message passing2.1.5.5. Packet size distributions2.2. Workload Control2.2.1. Education2.2.1.1. Usage and performance agreements2.2.2. The maxusers and the pt_cnt Parameters2.2.3. Limiting Users2.2.3.1. Quotas2.2.3.2. Environmental limits2.2.4. Complex Environments2.3. Benchmarking2.3.1. MIPS and Megaflops 2.3.1.1. MIPS2.3.1.2. Megaflops2.3.2. Component-Specific Benchmarks2.3.2.1. Linpack2.3.2.2. SPECint and SPECfp2.3.3. Commercial Workload Benchmarks2.3.3.1. TPC2.3.3.2. SPECweb992.3.4. User Benchmarks2.3.4.1. Choose your problem set2.3.4.2. Choose your runtime2.3.4.3. Automate heavily2.3.4.4. Set benchmark runtime rules2.4. Concluding Thoughts
3. Processors
3.1. Microprocessor Architecture3.1.1. Clock Rates3.1.2. Pipelining3.1.2.1. Variable-length instructions3.1.2.2. Branches3.1.3. The Second Generation of RISC Processor Design3.2. Caching3.2.1. The Cache Hierarchy3.2.2. Cache Organization and Operation3.2.3. Associativity3.2.4. Locality and “Cache-Busters”3.2.4.1. Unit stride3.2.4.2. Linked lists3.2.4.3. Cache-aligned block copy problems3.2.5. The Cache Size Anomaly3.3. Process Scheduling3.3.1. The System V Model: The Linux Model3.3.1.1. Finding a process’s priority3.3.1.2. Adjusting a process’s effective priority3.3.1.3. Modifications for SMP systems3.3.2. Multilayered Scheduling Classes: The Solaris Model3.3.2.1. The Solaris threading model3.3.2.2. Scheduling classes3.3.2.3. The dispatcher3.3.2.4. Checking a process’s priority3.3.2.5. Tuning the dispatch tables3.3.2.6. Adjusting priorities3.4. Multiprocessing3.4.1. Processor Communication3.4.1.1. Buses3.4.1.2. Crossbars3.4.1.3. UltraSPARC-III systems: Fireplane3.4.1.4. “Interconnectionless” architectures3.4.2. Operating System Multiprocessing3.4.3. Threads3.4.4. Locking3.4.5. Cache Influences on Multiprocessor Performance3.5. Peripheral Interconnects3.5.1. SBus3.5.1.1. Clock speed3.5.1.2. Burst transfer size3.5.1.3. Transfer mode3.5.1.4. Summary of SBus implementations3.5.1.5. SBus card utilization3.5.2. PCI3.5.2.1. PCI bus transactions3.5.2.2. CompactPCI3.5.3. A Summary of Peripheral Interconnects3.5.4. Interrupts in Linux3.5.5. Interrupts in Solaris3.6. Processor Performance Tools3.6.1. The Load Average3.6.2. Process Queues3.6.3. Specific Breakdowns3.6.4. Multiprocessor Systems3.6.5. top and prstat3.6.6. Lock Statistics3.6.7. Controlling Processors in Solaris3.6.7.1. psrinfo3.6.7.2. psradm3.6.7.3. psrset3.6.8. Peripheral Interconnect Performance Tools3.6.9. Advanced Processor Performance Statistics3.7. Concluding Thoughts
4. Memory
4.1. Implementations of Physical Memory4.2. Virtual Memory Architecture4.2.1. Pages4.2.2. Segments4.2.3. Estimating Memory Requirements4.2.4. Address Space Layout4.2.5. The Free List4.2.5.1. Virtual memory management in Linux4.2.6. Page Coloring4.2.7. Transaction Lookaside Buffers (TLB)4.3. Paging and Swapping4.3.1. The Decline and Fall of Interactive Performance4.3.2. Swap Space4.3.2.1. Anonymous memory4.3.2.2. Sizing swap space4.3.2.3. Organizing swap space4.3.2.4. Swapfiles4.4. Consumers of Memory4.4.1. Filesystem Caching4.4.2. Filesystem Cache Writes: fsflush and bdflush4.4.2.1. Solaris: fsflush4.4.2.2. Linux: bdflush4.4.3. Interactions Between the Filesystem Cache and Memory4.4.3.1. Priority paging4.4.3.2. Cyclic caching4.4.4. Interactions Between the Filesystem Cache and Disk4.5. Tools for Memory Performance Analysis4.5.1. Memory Benchmarking4.5.1.1. STREAM4.5.1.2. lmbench4.5.2. Examining Memory Usage System-Wide4.5.2.1. vmstat4.5.2.2. sar4.5.2.3. memstat4.5.3. Examining Memory Usage of Processes4.5.3.1. Solaris tools4.5.4. Linux Tools4.6. Concluding Thoughts
5. Disks
5.1. Disk Architecture5.1.1. Zoned Bit Rate (ZBR) recording5.1.1.1. Disk caches5.1.2. Access Patterns5.1.3. Reads5.1.4. Writes5.1.4.1. UFS write throttling5.1.5. Performance Specifications5.1.5.1. One million bytes is a megabyte?5.1.5.2. Burst speed versus internal transfer speed5.1.5.3. Internal transfer speed versus actual speed5.1.5.4. Average seek time5.1.5.5. Storage capacity and access capacity5.2. Interfaces5.2.1. IDE5.2.1.1. Improving IDE performance in Linux5.2.1.2. Limitations of IDE drives5.2.2. IPI5.2.3. SCSI5.2.3.1. Multi-initiator SCSI5.2.3.2. Bus transactions5.2.3.3. Synchronous versus asynchronous transfers5.2.3.4. Termination5.2.3.5. Command queuing5.2.3.6. Differential signaling5.2.3.7. Bus utilization5.2.3.8. Mixing different speed SCSI devices5.2.3.9. SCSI implementations5.2.4. Fibre Channel5.2.5. IEEE 1394 (FireWire)5.2.6. Universal Serial Bus (USB)5.3. Common Performance Problems5.3.1. High I/O Skew5.3.2. Memory-Disk Interactions5.3.3. High Service Times5.4. Filesystems5.4.1. vnodes, inodes, and rnodes5.4.1.1. The directory name lookup cache (DNLC)5.4.2. The Unix Filesystem (UFS)5.4.2.1. inode density5.4.2.2. Filesystem cluster size5.4.2.3. Minimum free space5.4.2.4. Rotational delay5.4.2.5. fstyp and tunefs5.4.2.6. Bypassing memory caching5.4.2.7. The inode cache5.4.2.8. The buffer cache5.4.3. Logging Filesystems5.4.3.1. Solstice:DiskSuite5.4.3.2. Solaris5.4.4. The Second Extended Filesystem (EXT2)5.4.5. The Third Extended Filesystem (EXT3)5.4.5.1. Tuning the elevator algorithm5.4.5.2. Choosing a journaling mode5.4.5.3. Transitioning from ext2 to ext35.4.6. The Reiser Filesystem (ReiserFS)5.4.6.1. Tail packing5.4.7. The Journaled Filesystem (JFS)5.4.8. The Temporary Filesystem (tmpfs)5.4.9. Veritas VxFS5.4.10. Caching Filesystems (CacheFS)5.4.10.1. Minimizing seek times by filesystem layout5.5. Tools for Analysis5.5.1. Enabling Disk Caches5.5.2. Disk Performance Benchmarking5.5.2.1. hdparm5.5.2.2. tiobench5.5.2.3. iozone5.5.3. Second-Time-Through Improvements?5.5.4. Using iostat5.5.4.1. Historical limitations: iostat and queuing terminology5.5.5. Using sar5.5.6. I/O Tracing5.5.6.1. Using the kernel probes5.5.6.2. Using process filtering5.5.6.3. Restarting prex5.6. Concluding Thoughts
6. Disk Arrays
6.1. Terminology6.2. RAID Levels6.2.1. RAID 0: Striping6.2.2. RAID 1: Mirroring6.2.3. RAID 2: Hamming Code Arrays6.2.4. RAID 3: Parity-Protected Striping6.2.5. RAID 4: Parity-Protected Striping with Independent Disks6.2.6. RAID 5: Distributed, Parity-Protected Striping6.2.7. RAID 10: Mirrored Striping6.3. Software Versus Hardware6.3.1. Software6.3.2. Hardware6.3.2.1. RAID overlap6.4. A Summary of Disk Array Design6.4.1. Choosing a RAID Level6.5. Software RAID Implementations6.5.1. Solaris: Solstice DiskSuite6.5.1.1. State databases6.5.1.2. RAID 0: stripes6.5.1.3. RAID 1: mirrors6.5.1.4. RAID 5 arrays6.5.1.5. Hot spare pools6.5.2. Linux: md6.5.2.1. Persistent superblocks6.5.2.2. Chunk size6.5.2.3. Linear mode6.5.2.4. RAID 0: stripes6.5.2.5. RAID 1: mirrors6.5.2.6. RAID 5 arrays6.5.2.7. Creating the array6.5.2.8. Autodetection6.5.2.9. Booting from an array device6.6. RAID Recipes6.6.1. Attribute-Intensive Home Directories6.6.2. Data-Intensive Home Directories6.6.3. High Performance Computing6.6.4. Databases6.6.5. Case Study: Applications Doing Large I/O6.7. Concluding Thoughts
7. Networks
7.1. Network Principles7.1.1. The OSI Model7.2. Physical Media7.2.1. UTP7.2.1.1. A note on terminology: plugs and jacks7.2.2. Fiber7.3. Network Interfaces7.3.1. Ethernet7.3.1.1. Fundamentals of Ethernet signaling7.3.1.2. Topologies7.3.1.3. 10BASE-T7.3.1.4. 100BASE-T47.3.1.5. 100BASE-TX7.3.1.6. Gigabit Ethernet topologies7.3.1.7. The 5-4-3 rule7.3.1.8. Collisions7.3.1.9. Autonegotiation7.3.1.10. Displaying and setting modes7.3.2. FDDI7.3.3. ATM7.3.4. Ethernet Versus ATM/FDDI7.4. Network Protocols7.4.1. IP7.4.1.1. Fragmentation7.4.1.2. Time-to-live7.4.1.3. Protocols7.4.1.4. IP addresses7.4.1.5. Classful addressing7.4.1.6. Subnetting classful networks7.4.1.7. Moving to a classless world7.4.1.8. Routing7.4.2. TCP7.4.2.1. Connection initiation and SYN flooding7.4.2.2. Path MTU discovery and the maximum segment size7.4.2.3. Buffers, watermarks, and windows7.4.2.4. Retransmissions7.4.2.5. Deferring acknowledgments7.4.2.6. Window congestion and the slow start algorithm7.4.2.7. TCP timers and intervals7.4.2.8. The Nagle algorithm7.4.3. UDP7.4.4. TCP Versus UDP for Network Transport7.5. NFS7.5.1. Characterizing NFS Activity7.5.2. Tuning Clients7.5.2.1. Obtaining statistics for an NFS-mounted filesystem7.5.2.2. The rnode cache7.5.2.3. Tuning NFS clients for bursty transfers7.5.2.4. Tuning NFS clients for sequential transfer7.5.3. Tuning Servers7.5.3.1. Designing disk subsystems for NFS servers7.5.3.2. NVRAM caching7.5.3.3. Memory requirements7.5.3.4. The two basic types of NFS servers7.5.3.5. Tuning the number of NFS threads7.5.3.6. Adjusting the buffer cache7.5.3.7. The maxusers parameter7.5.3.8. The directory name lookup cache (DNLC)7.5.3.9. The inode cache7.5.3.10. Observing NFS server performance with nfsstat7.5.4. Wide Area Networks and NFS7.6. CIFS via Unix7.7. Concluding Thoughts
8. Code Tuning
8.1. The Two Critical Approaches8.1.1. String Searching Algorithms8.1.1.1. Algorithm 1: naive searching8.1.1.2. Algorithm 2: Knuth-Morris-Pratt searching8.1.1.3. Algorithm 3: Boyer-Moore searching8.1.2. Caveats in Optimization8.2. Techniques for Code Analysis8.2.1. Application Timing: time, timex, and ptime8.2.1.1. time8.2.1.2. timex8.2.1.3. ptime8.2.1.4. Mechanisms of timing8.2.2. Timing-Specific Code Sections8.2.2.1. Timing via gethrtime8.2.2.2. Timing via the TICK register8.2.3. Probe-Based Analysis: Solaris TNF8.2.3.1. Inserting probes8.2.3.2. Caveats8.2.4. Profiler-Based Analysis: gprof8.2.4.1. Implementing profiling8.2.4.2. Compiling with profiling8.2.4.3. Execution with profiling8.2.4.4. Profile analysis8.2.4.5. Caveats8.3. Optimization Patterns8.3.1. Arithmetic8.3.2. Loops8.3.3. Strings8.4. Interacting with Compilers8.4.1. Typical Optimizations: -fast8.4.2. Optimization Level: -xO8.4.3. Specifying Instruction Set Architecture: -xarch8.4.4. Specifying Processor Architecture: -xchip8.4.5. Function Inlining: -xinlining and -xcrossfile8.4.6. Data Dependency Analysis: -xdepend8.4.7. Vector Operations: -xvector8.4.8. Default Floating Point Constant Size: -xsfpconst8.4.9. Data Prefetching: -xprefetch8.4.10. Quick and Dirty Compiler Flags8.4.11. Profiling Feedback8.5. Concluding Thoughts
9. Instant Tuning
9.1. Top Five Tuning Tips9.1.1. Where Is the Disk Bottleneck?9.1.2. Do You Have Enough Memory?9.1.3. Are the Processors Overloaded?9.1.4. Are Processes Blocked on Disk I/O?9.1.5. Does System Time Heavily Dominate User Time?9.2. Instant Tuning Recipes9.2.1. Single-User Development Workstations9.2.1.1. Filesystems9.2.1.2. Swap space9.2.1.3. Kernel tuning9.2.2. Workgroup Servers9.2.2.1. Memory9.2.2.2. Disks9.2.2.3. Filesystems9.2.2.4. Swap space9.2.2.5. Optimizing NFS9.2.2.6. Kernel tuning9.2.3. Web Servers9.2.3.1. Memory9.2.3.2. Disks9.2.3.3. Filesystems9.2.3.4. Swap space9.2.3.5. Networks9.2.3.6. Kernel tuning9.2.3.7. Special case: proxy servers

About the Authors
Colophon
Copyright

Content preview from System Performance Tuning, 2nd Edition

Chapter 1. An Introduction to Performance Tuning

There are a thousand hacking at the branches of evil to one who is striking at the root.

—Henry David Thoreau, 1854

Going faster seems to be a central part of human development. As little as three hundred years ago, the fastest you could expect to go was a few tens of miles an hour, aboard a fast clipper ship with a stiff wind behind you. Now, we must expand our view to things such as the fastest achievable speed while remaining in the Earth’s atmosphere -- perhaps fifteen hundred miles an hour, twice the speed of sound, if you are a civilian without access to the latest, fastest military aircraft. The journey that used to take three weeks under sail from London to New York City now takes us as little as two and a half hours, sipping champagne the whole way.

This innate human desire to go fast is expressed in many ways: microwave ovens let us cook dinner quickly, high-performance automobiles and motorcycles give us a wonderful thrill, email lets us communicate at almost the speed of thought. But what happens when that email server is overwhelmed, as when we all log in at eight o’clock in the morning to check what has gone on while we’ve slept? Or when the procurement system for the company that distributes microwave ovens is only able to handle half of the workload, or when a mechanical engineer’s CAD system runs so slowly that the car engine she’s designing won’t be ready in time for the new model year?

These are the problems facing ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Problem-solving in High Performance Computing

Publisher Resources

ISBN: 059600284XErrata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

System Performance Tuning, 2nd Edition

by Gian-Paolo D. Musumeci, Mike Loukides

Chapter 1. An Introduction to Performance Tuning

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.