BUY THIS BOOK
Add to Cart

Print Book $39.95


Add to Cart

PDF $31.99

Safari Books Online

What is this?

Add to UK Cart

Print Book £28.50

What is this?

Looking to Reprint or License this content?


System Performance Tuning
System Performance Tuning, Second Edition

By Gian-Paolo D. Musumeci, Mike Loukides
Book Price: $39.95 USD
£28.50 GBP
PDF Price: $31.99

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: An Introduction to Performance Tuning
There are a thousand hacking at the branches of evil to one who is striking at the root.
—Henry David Thoreau, 1854
Going faster seems to be a central part of human development. As little as three hundred years ago, the fastest you could expect to go was a few tens of miles an hour, aboard a fast clipper ship with a stiff wind behind you. Now, we must expand our view to things such as the fastest achievable speed while remaining in the Earth's atmosphere -- perhaps fifteen hundred miles an hour, twice the speed of sound, if you are a civilian without access to the latest, fastest military aircraft. The journey that used to take three weeks under sail from London to New York City now takes us as little as two and a half hours, sipping champagne the whole way.
This innate human desire to go fast is expressed in many ways: microwave ovens let us cook dinner quickly, high-performance automobiles and motorcycles give us a wonderful thrill, email lets us communicate at almost the speed of thought. But what happens when that email server is overwhelmed, as when we all log in at eight o'clock in the morning to check what has gone on while we've slept? Or when the procurement system for the company that distributes microwave ovens is only able to handle half of the workload, or when a mechanical engineer's CAD system runs so slowly that the car engine she's designing won't be ready in time for the new model year?
These are the problems facing us in performance tuning: it is the adaptation of the speed of a computer system to the speed requirements imposed by the real world.
Sometimes these problems are subtle, sometimes not. We must carefully consider the changes in the system that caused it to become unacceptably slow. These influences might come from within, such as a heavier load placed on the system, or from without, such as a new operating system revision subtly changing -- or utterly replacing -- an algorithm for resource management that is critically important to our software. The solutions are sometimes quick (a matter of adjusting one "knob" slightly), or sometimes slow and painful (a task for weeks of analysis, consultations with vendors, and careful redesign of an infrastructure).
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
An Introduction to Computer Architecture
A full discussion of computer architecture is far beyond the level of this text. Periodically, we'll go into architectural matters, in order to provide the conceptual underpinnings of the system under discussion. However, if this sort of thing interests you, there are a great many excellent texts on the topic. Perhaps the most commonly used are two textbooks by John Hennessy and David Patterson: they are titled Computer Organization and Design: The Hardware/Software Interface and Computer Architecture: A Quantitative Approach (both published by Morgan Kaufmann).
In this section, we'll focus on the two most important general concepts of architecture: the general means by which we approach a problem (the levels of transformation), and the essential model around which computers are designed.
When we approach a problem, we must reduce it to something that a computer can understand and work with: this might be anything from a set of logic gates, solving the fundamental problem of "How do we build a general-purpose computing machine?" to a few million bits worth of binary code. As we proceed through these logical steps, we transform the problem into a "simpler" one (at least from the computer's point of view). These steps are the levels of transformation.

Section 1.1.1.1: Software: algorithms and languages

When faced with a problem where we think a computer will be of assistance, we first develop an algorithm for completing the task in question. Algorithms are, very simply, a repetitive set of instructions for performing a particular task -- for example, a clerk inspecting and routing incoming mail follows an algorithm for how to properly sort the mail.
This algorithm must then be translated by a programmer into a program written in a language. Generally, this is a high-level language, such as C or Perl, although it might be a low-level language, such as assembler. The language layer exists to make our lives easier: the structure and grammar of high-level languages lets us easily write complex programs. This high-level language program, which is usually portable between different systems, is then transformed by a compiler into the low-level instructions required by a specific system. These instructions are specified by the Instruction Set Architecture.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Principles of Performance Tuning
In this book, I present a few "rules of thumb." As an old IBM technical bulletin says, rules of thumb come from people who live out of town and who have no production experience.
Keeping that in mind, I've found that much of the conceptual basis behind system performance tuning can be summarized in five principles: understand your environment, nothing is ever free, throughput and latency are distinct measurements, resources should not be overutilized, and one must design experiments carefully.
If you don't understand your environment, you can't possibly fix the problem. This is why there is such an emphasis on conceptual material throughout this text: even though the specific details of how an algorithm is implemented might change, or even the algorithm itself, the abstract knowledge of what the problem is and how it is approached is still valid. You have a much more powerful tool at your disposal if you understand the problem than if you know a solution without understanding why the problem exists.
Certainly, at some level, problems are well beyond the scope of the average systems administrator: tuning network performance at the level discussed here does not really require an in-depth understanding of how the TCP/IP stack is implemented in terms of modules and function calls. If you are interested in the inner details of how the different variants of the Unix operating system work, there are a few excellent texts that I strongly suggest you read: Solaris Kernel Internals by Richard McDougall and James Mauro, The Design of the UNIX Operating System by Maurice J. Bach, and Operating Systems, Design and Implementation by Andrew Tanenbaum and Andrew Woodfull (all published by Prentice Hall).
The ultimate reference, of course, is the source code itself. As of this writing, the source code for all the operating systems I describe in detail (Solaris and Linux) is available for free.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Static Performance Tuning
This book largely focuses on the performance tuning of systems in dynamic conditions. I focus almost exclusively on how the system's performance corresponds to the stresses placed on the system. Another sort of performance tuning exists, however, that focuses on static (that is, workload-independent) factors. These factors tend to degrade system performance no matter what the workload is, and are not generally tied to resource contention problems.
By far the largest culprit in static performance issues are problems with naming services: the means by which information is retrieved about an entity. Examples of naming services are NIS+, LDAP, and DNS.
The symptoms are often vague: logins are slow, the windowing environment or a web browser feels sluggish, a new window locks at startup, or the X or CDE login screen hangs. Here are a few places to check, in approximate order of likelihood:
/etc/nsswitch.conf
This file is read only once for each process that is using a naming service, so a reboot may be necessary to make changes take effect.
/etc/resolv.conf
Are the nameservers and the domain specified correctly? Incorrect or lengthy (having many subcomponents) domain specifications may cause DNS to generate a great many requests. The nameservers should be sorted by latency of response.
Is the name service cache daemon (nscd) running?
This process caches name service-provided information for a significant boost to performance. This daemon is standard-issue on Solaris, and available for Linux. Historically, nscd has been blamed for many problems. Most of these problems have been long since fixed: you should probably run it. One exception is when troubleshooting, however, as it can mask underlying name service problems.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Concluding Thoughts
Approaching any complex task usually requires some measure of grounding in fundamentals, and a basic understanding of the underlying principles. We've touched upon what performance tuning is, the basic ideas behind computer architecture, the big questions surrounding 64-bit environments, static performance tuning techniques, and some fundamental principles of performance. While it's very difficult to approach the effectiveness of, say, ground school in pilot training, in an introductory chapter, I hope this chapter has given you some sense of the currency in which the rest of this book trades.
I have a suggestion for an exercise, which I think is particularly applicable if you are intent on reading this book cover-to-cover. Go back and reread the five principles of performance tuning (see Section 1.2 earlier in this chapter), put this book down for an hour, and think about what those principles mean and what their implications are. They are very broad, general concepts that are relevant far beyond the context in which they've been presented. Their application is left entirely to you -- in some sense the most difficult part -- although I hope you will find the rest of this book to be an effective guide. When you sit down to analyze a problem, ask yourself a few questions: do I understand what is going on? If not, how can I design meaningful tests to validate my theories? Am I, or my customers, searching for a "free lunch"? What tradeoffs am I making to get the performance I want? Am I overconsuming a resource? Are the metrics I am using to measure performance ones I have developed? Are these metrics measuring what I think they are?
These are all difficult questions, no matter how simple they may appear: they are innocent-looking trapdoors that lead into dark, confusing dungeons. It's not unusual to find eight or ten performance experts in a meeting discussing what exactly is going on in a system, formulating theories, and designing experiments to test those theories. If there is anything you take from this book, I earnestly hope it is those five principles. It's hard to go too far wrong if you use them as your guiding light.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: Workflow Management
The more precisely the position is determined, the less precisely the momentum is known in this instant, and vice versa.
—Werner von Heisenberg, 1927
The topic of "workflow management" is slippery: it can be interpreted to mean many different things. In this chapter, I describe practical means to accomplish the zeroth principle of performance tuning: understanding the environment as it exists. This is the beating heart of dynamic performance tuning. The rest of this book exists simply to improve your understanding about possible environments.
We concern ourselves, as mentioned, primarily with dynamic performance analysis: the system we are measuring is changing beneath us. It is, in some sense, like watching a pond. There might be a creek that flows into the pond; how does that affect the life in the pond? What happens when some beavers build a dam across that creek, when some children find the pond and throw rocks into it, or when someone dumps the ashes of old love letters in the middle of it?
To further complicate things, we are governed by an inviolable principle of physics. Heisenberg's Principle of Uncertainty says that no matter how carefully we try, we will always perturb the system when we measure it, and some piece of knowledge will remain outside of our grasp. We can minimize the perturbation, however, and throughout this chapter we'll concern ourselves with how significant of a perturbation our measurements are inducing.
Things at this point seem pretty daunting. Not only are we measuring something that's always changing in ways we don't understand, driven by factors we can't control, but our mere observation of the system induces further change in it! Why do we bother to measure, then? The answer is simple: without careful, methodical measurements, it becomes practically impossible to provide reasoned, analytical solutions to problems, because we don't understand why the problem exists or what happened. It's always better to know something about the situation. If we didn't strive to understand, we would be doomed to be Cargo Cult Systems Administrators, sitting at our desks with wooden keyboards and terminals, waiting for the magicians to come back in their shiny silver birds at a few hundred dollars an hour to fix our problems for us. That's not a good solution for anyone but the magicians.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Workflow Characterization
Performance experts are, in some sense, the gurus atop mountains of the computer world. Every time some humble supplicant comes to them with a problem, the guru pokes around at things for a while, then sends the supplicant off to come up with more information. Characterization is the process of trying to gather as much information about the system as possible, so that trends and patterns can be determined. These patterns will prove to be vitally important the first time that performance falls through the floor; we will be able to piece together what the perturbations in the patterns are, and from that figure out what caused them -- rather like studying the wake of a passing ship to see what kind of vessel it was and where it's headed.
An analogy has been drawn between workload management and financial transactions. The first time I read of this was in Adrian Cockcroft's excellent book Sun Performance and Tuning (Prentice Hall). The essential idea is that workload management on computer systems is analogous to a department with a capital budget, staff, etc., performing a task. There are basically three possible outcomes:
  1. If there is no plan and no effective controls on the staff, then the staff will run wild, grabbing as much budget as they can to ensure that their own projects will be well-funded. Some staff will end up with no funding whatsoever, while other staff take up "fact-finding missions" to Maui. The project as a whole ends up a complete mess. (This is known as the "startup model.")
  2. Management is overzealous, and creates a huge bureaucratic staff to plan, assess the plans, and replan. The entire budget is consumed by this bureaucratic middle layer, which micromanages those responsible for actually doing the work, often by demanding daily status reports. The administrative overheads involved make it very difficult to spend any money. The work ends up a mess, because by the time it was supposed to be finished, it's barely been started. (This is known as the "government model.")
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Workload Control
The next step in the realm of workloads is imposing workload management. This step is often technically straightforward, but it can carry with it political problems. Forcibly attempting to control the users of a system is often not a good idea, and in my opinion should be restricted to cases where education and attempting to secure the cooperation of your user base has failed abjectly. If you are getting some pressure from management to do this, I think you would do well to get it in writing.
Why do I submit these warnings? It is significantly less of a problem in academic environments or in small companies, where everyone knows the systems administrator and it's easy to walk down to their office and ask them what's going on. However, in large professional environments or when the management of the users is divorced to a degree from the management of the administrators, things often get quite complex. It is often very frustrating to users (who, perhaps, designed the hardware or software that you are administrating) to have their abilities restricted: it reduces their flexibility in solving their own problems. If they appeal up through management, you could find yourself trying to explain what seemed like a reasonable technical decision at the time to a few very irate senior vice presidents. You really need to be careful if you are going to be forceful.
On the other hand, user education is one of the most powerful techniques available to you to control workloads. In this section, I discuss user education and the closely related area of written performance agreements, as well as some of the more direct techniques for limiting system resource consumption.
The most powerful tool available to you in order to control the workloads on your systems is user education. Enforcing strict CPU time or disk quotas, while effective, often adds to a "resentment of the mystery" phenomenon. This leaves users feeling rather like medieval serfs: there are certain things they just can't do, like encourage the rain in a dry season, and all they can do is go and beg some rather mysterious people who usually live in caves to try and fix the problem for them. The end result is that the users get very frustrated.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Benchmarking
A benchmark measures the performance of a standard workload. By holding the workload fixed, you can vary the underlying system parameters to generate relative performance numbers.
In general, there are three types of benchmarks: component-specific benchmarks, like SPECint and SPECfp for microprocessors; whole-system benchmarks designed to emulate commercial environments, like the TPC series and SPECweb99, and user-developed benchmarks. In this section, I touch on these different approaches to benchmarking, providing examples of real-world benchmarks throughout.
This book is not about tuning systems to produce world-record benchmark results. In general, producing a world record requires an immense amount of work, strong support from the engineers responsible for the underlying subsystems, and an understanding that the end result is a world record (which may not necessarily be directly applicable to a customer environment).
Before we get started in a discussion of benchmarks, we need to discuss a little bit of terminology and history: the MIPS and megaflops metrics.

Section 2.3.1.1: MIPS

One of the first ways of estimating the performance of an application on a computer system was to look at how fast the system could execute instructions. However, this is a dismal metric. The execution rate of a processor (typically expressed in millions of instructions per second, or MIPS) is strongly correlated to the clock rate of the processor. It's probably safe to assume that a 5-MIPS CPU will outperform a 100-MIPS CPU, but a 50-MIPS processor might well actually outperform a 100-MIPS CPU. Using MIPS to compare the performance of different computers is flawed in a basic way: a million instructions on one microprocessor might not accomplish the same work as the same number of instructions on another. For example, let's say you have a processor that does all floating-point work in software, and takes 50 integer instructions to perform 1 floating-point operation; your 50-MIPS CPU now performs 1 million useful instructions per second. Coupled with this is the fact that different instructions can take different lengths of time to complete, especially on CISC processors. Were those million instructions "do-nothing" operations, or floating-point multiplies?
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Concluding Thoughts
I hope this chapter has provided some theoretical and practical insight into understanding and managing workflows. I've discussed means of understanding the workload impressed upon a system, as well as some means of enforcing limits upon that workload. I have also talked about benchmarks as means of comparing the performance of systems under standardized workloads.
Parts of this chapter may seem almost childlike in their simplicity. As we progress into the performance analysis of dynamic systems, we are presented with simple things that are actually complex. Part of this book's purpose is to examine some of the simplifications, so that the reasons behind become apparent. This exploration lets you make better decisions about where to direct your tuning efforts.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: Processors
Quote performance in terms of processor utilization, parallel speed-ups, or MFLOPS per dollar.
—D. H. Bailey, "Twelve ways to fool the masses when giving performance results on parallel supercomputers." Supercomputing Review, 1991
The increase in microprocessor performance over the last 20 years has been phenomenal. The prediction made by Gordon Moore, the founder of Intel, that transistor counts on microprocessors would double every 18 months (and thus performance would increase proportionately), has certainly held true, if not been exceeded. In the early 1980s, the fastest microprocessor from Intel was the 4004, which was clocked at less than 4 KHz; by comparison, in mid-2001, the latest generations of IA-32 processors from Intel and AMD are running at clock rates of about 2.0 GHz -- and execute more than one instruction per clock cycle (I survey some the key elements in microprocessor design later in this chapter). This jump is so huge that it is hard to comprehend. If automobile performance followed the same growth curve as microprocessors over the last 20 years, the successor to the DeLorean DMC-12 (1981-1983), which had a top speed of perhaps 140 miles per hour, would have doubled 12 times: it would travel at almost 287 thousand miles per hour, which is equivalent to about 410 times the speed of sound, or about one-sixteenth of the speed of light. Even that's pretty difficult to comprehend. If a pack of bubble gum cost a quarter in 1983, and prices increased at the same rate as microprocessor performance, that same pack of gum would cost just over a thousand dollars.
It's not clear whether this rate of increase is sustainable. It has often been predicted that the rate will have to slow. However, enterprising engineers have found ways to improve the technologies used for processor fabrication and design so that this has not yet become the case.
The fact that microprocessor performance increases have far outpaced the performance increases of the other components in a computer system is a unifying theme for this book: how do we make the other components go as fast as possible, to try and optimize the efficiency of the processor? Processor performance is without a doubt the most-quoted figure in computer performance. This can easily be verified simply by going to the local computer store and looking at their advertisements; the information in the largest print, often set apart and highlighted, is the clock rate of the microprocessor. It is much more appealing to have the latest 2.0-GHz processor than a few very fast hard drives. However, while it is absolutely true that microprocessor performance is a critical component of overall system performance, it is often overrated.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Microprocessor Architecture
At a fundamental level, a microprocessor is the physical implementation of a set of rules. These rules, called instructions, specify exactly what tasks the microprocessor is allowed to perform. The set of all the rules together is called the instruction set; in combination with other information, they define the instruction set architecture, or ISA, which contains all the information you'd need to write a program that runs correctly. For a more detailed discussion of these abstract underpinnings, see Chapter 1.
In the last 20 years, microprocessor designers have split into two competing camps based on design philosophies. One camp espoused creating an instruction set in which individual instructions were very powerful, almost on the level of C primitives. These instructions are quite complex, and this camp came to be known as building complex instruction set computing (CISC) designs. The other camp decided to create instruction sets in which the individual instructions were minimal, but their simplicity allowed many optimizations to be performed. This is known as reduced instruction set computing (RISC) processor design. Both methods accomplish the same amount of work. The difference is that a RISC design will probably take more instructions, but those instructions will most likely execute in less time.
RISC has won the war of performance. Increasingly, CISC processor designs are becoming more RISC-like. For example, Intel's P6 processor core actually translates the complex IA-32 instructions into a much simpler internal format before executing them. Another good example is the Transmeta Crusoe processor, which uses specialized hardware (called code morphing) to, in part, translate IA-32 instructions into the Crusoe's internal instructions.
Lately, designers have started to add extra instructions, typically for graphics support. Sun's VIS and Intel's MMX extensions are good examples of this. These instructions are based around the idea that while the smallest addressable piece of memory is typically 32 bits long, we might only operate on 8 of those bits, so we perform 4 memory operations and 4 arithmetic operations when the data would have fit in 1 memory operation. So, these extensions work by packing the full 32 bits of data in, so that only 1 memory operation and only 1 arithmetic operation is required. However, using these extensions to maximum effect involves quite a bit of effort on the part of the compiler design team. Figure 3-1 shows multimedia extension parallelism.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Caching
A major concern of our microprocessor optimization strategy is that pipeline stalls cause serious performance problems. If every memory request has to be satisfied from main memory, the pipeline will stall very frequently. Modern microprocessors universally implement caches to try and solve this problem. In some processors, this cache is organized as a single unit, called a unified cache architecture. More commonly, however, the cache is divided up into two areas: one for data and one for instructions. This is called a Harvard cache architecture. Harvard caches tend to be significantly more efficient in the real world, especially for small caches.
Memory speed has increased at a much slower rate than processor speed: processor performance doubles roughly every eighteen months, but memory performance doubles roughly every seven years. The important thing to realize is that as processor speed has outpaced memory speed, even static RAM caches located off the processor die are not fast enough. The total time required to move data to or from memory is more than just the memory's physical access time, as there is overhead in moving data between the chips, as well as keeping the caches synchronized with main memory. The most common mechanism for minimizing access time is to create a hierarchy of caches with steadily increasing size and steadily decreasing performance:
  • The processor has separate caches for instructions and data; they are referred to as the i-cache and the d-cache, respectively. These are often about 16 KB each, and the fastest caches in the system. They are called level 1 or L1 caches.
  • There is a larger, combined data/instruction second-level cache, which is generally between 256 KB and 4 MB in size. A cache miss at the L1 cache can still produce a cache hit in this cache -- while it takes a few cycles longer it is still very efficient. This is the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Process Scheduling
Another factor that affects processor performance is process scheduling. Process scheduling is how the operating system determines which process to run on the CPU. The next four sections look at different process scheduling models used in the Unix world.
Linux implements a relatively simple, standard Unix scheduling model. It supports process preemption (interrupting a running process with a higher priority process), although its kernel is nonpreemptable.
The Linux kernel scheduling algorithm works by breaking the CPU time into units called epochs. In a given epoch, each process has a specified time quantum whose duration is computed when the epoch begins; therefore, each process can have a different quantum. This quantum represents the most CPU time that is consumable by that process during that epoch. When a process has exhausted its quantum, it is removed from the processor and replaced by another runnable process; of course, a process can be selected by the scheduler several times, as long as its quantum has not been exhausted (for example, it might have voluntarily relinquished the processor to wait for some disk I/O activity to complete).
When the scheduler is deciding what process to run next, it considers the priority of each process. There are two types of priority: static priority and dynamic priority. Static priority is assigned by users to real-time processes, and is never adjusted. Dynamic priority applies only to conventional processes, and is the sum of the base time quantum (also known as the base priority of the process) and the number of ticks of CPU time left to the process before its quantum expires in the current epoch. The static priority of a real-time process is always higher than any dynamic priority: the scheduler will only run a conventional process if there are no runnable real-time processes.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Multiprocessing
Shared-memory multiprocessor systems were considered rather exotic in the mid-1980s, and they were expensive. Since then, we have seen hardware costs drop, much more robust multiprocessor support in operating systems, and an increased demand for affordable high-throughput systems at a workgroup and even desktop level. These multiprocessor systems are described as uniform memory access (UMA) architecture, which means that all physical memory is accessible to all processors at the same rate. Some large-scale systems use a nonuniform memory access (NUMA) architecture, in which a specific processor can access some bits of physical memory faster than others. Figure 3-5 illustrates the UMA and NUMA architectures.
Figure 3-5: UMA and NUMA architectures
We'll restrict our discussion here to UMA architectures.
In order for the human body to function nominally, it is vital that information can be conveyed to and from the brain. Communication is also essential for successful operation of a multiprocessor system. If we draw an analogy between computers and the human body, and we think of the processor as the brain, the "spinal cord" facilitates the exchange of information to and from the processor to main memory and peripheral devices. This is facilitated in computer architecture by buses or crossbars, which also allow the arbitration of communication so that the components can figure out who can talk at any given point in time. It is important to note that we speak here only of interconnections between processors and main memory; for details on how interconnections between peripherals are handled, see Section 3.5 later in this chapter.

Section 3.4.1.1: Buses

A bus consists of three parts: a protocol for communication, a set of parallel wires that link all the system components, and some supporting hardware. Buses are cheap and easy to build, but as it becomes more heavily populated and the load increases, the bus begins to be a performance bottleneck. As a result, cache size and performance is very important in bus-based multiprocessor configurations. Typically, this sort of architecture supports at most four processors, but larger ones will support a few tens of processors. It is possible that the bus or memory subsystem will become saturated before the maximum number of supported processors have been installed!
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Peripheral Interconnects
The last means of interconnection that we need to discuss is between the bus or crossbar that facilitates the communication between processors and main memory (see Section 3.4.1 earlier in this chapter) and the bus that interacts with peripheral devices.
There are two bus architectures in widespread use: Sun's SBus and the industry-standard Peripheral Component Interconnect, or PCI. SBus is being phased out in favor of the PCI standard, but I discuss it here in light of its very large installed base.
SBus and PCI are fairly similar architecturally: both are designed to be I/O buses (rather than having more general applications), have small form factors requiring a high degree of integration, and roughly similar performance characteristics. If PCI had been available in 1988 when Sun was developing SBus for the SPARCstation 1, SBus probably would not have been developed. Every Sun system implements SBus slightly differently, making a different set of choices between performance, cost, ease of implementation, etc. However, in practice, all current SBus implementations can meet almost all requirements.
SBus is a parallel-transfer (many bits are transferred concurrently) bus architecture. In the original specification, there were 32 address lines and 32 data lines; the Rev B.0 SBus specification permits the address lines to be used during the data cycle to transfer 64 bits; this is implemented in the SPARCstation 10SX, the SPARCstation 20, and all the UltraSPARC-based systems. This arrangement means that 64-bit and 32-bit SBus implementations share the same form factor. However, bus width (while an important factor) is often overestimated in its relevance to performance; very few devices can exceed the 32-bit bus capacity. These are largely very high performance graphics framebuffers and very fast network interfaces (faster than about 250 Mbits/second).

Section 3.5.1.1: Clock speed

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Processor Performance Tools
Concretely detecting whether a system is short of processor time is often more complicated than it first seems. Regular indications of CPU shortage can be attributed to many things: disk configuration problems, peripheral devices being bound to a single processor for interrupt management, processor starvation due to a lack of memory, etc. In a sense, the best way to tell when you're legitimately processor-bound is that you can exclude all other possibilities.
There is also a philosophical issue involved. Some people like seeing a low load average on their systems, or more than 50% of the processor idle. This makes some sense in terms of the system accomodating bursting load activity. Similarly, some people like seeing a load average that approaches the number of processors, meaning the system is running full-tilt. This also makes some sense, because you are using everything that you have paid for in terms of processor performance. My personal philosophy varies between the following: on a computational server, I would aim for full utilization of all processors; on a development system or a database server, I would aim for a degree of overcapacity to accomodate spikes in load.
There are quite a few tools for monitoring processor performance: the load average, vmstat, mpstat, prstat and/or top, and lockstat. There are also two tools in Solaris for managing processors in multiprocessor systems: psrinfo and psrset. We'll cover all of these in some detail.
The system understands a concept of the load average, which is expressed in the uptime command:
% uptime
  4:02pm  up 19 day(s),  9:28,  119 users,  load average: 1.22, 1.21, 1.25
The load average is represented as the average number of runnable jobs (the sum of the run queue length and the number of jobs currently running) over the last one, five, and fifteen minutes. This gives a quick estimate of how heavily loaded the system is, but it is not exclusively a measurement of processor load; a system that has a memory shortage will usually have a high load average.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Concluding Thoughts
Microprocessors are some of the most complex devices per square centimeter that humanity has ever constructed, and understanding how they function is extremely challenging. I hope this chapter has given you some sense of that complexity, along with furthering your understanding and helping you prepare to measure a processor's performance in a dynamic system.
We've covered a lot, from the fundamentals of microprocessor design, to bus and crossbar architectures, to discussing caching and peripheral interconnects. This is a lot of information to grasp, but the payoff of understanding how these components function and interact with each other is commensurately large.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 4: Memory
Ideally one would desire an indefinitely large memory capacity such that any particular. . . word would be immediately available. . . .
—A. W. Burks, H. H. Goldstine, and J. von Neumann: Preliminary Discussion of the Logical Design of an Electronic Computing Instrument, 1946
Physical memory is a set of integrated circuits designed to store binary data. This storage has two characteristic properties: it is transient, as all stored information vanishes when electrical power is lost, and randomly accessible, meaning any bit can be accessed as fast as any other bit. In addition to physical memory, most systems implement virtual memory, which acts to manage physical memory and provide a simple interface to application developers. Virtual memory is consumed by the system kernel, filesystem caches, intimately shared memory, and processes.
Memory performance begins to affect overall system performance in two instances. The first instance occurs when the system is unable to retrieve and store data from physical memory fast enough, or when the system is forced to travel to main memory frequently. This sort of problem can be attacked by tuning the algorithm that is responsible or by buying a system with faster access to main memory. The second, and more likely, case is that the demand for physical memory by all currently running applications, including the kernel, exceeds the available amount. The system is then forced to begin paging , or writing unused pieces of memory to disk. If the low memory condition worsens, the memory consumed by entire processes will be written to disk, which is called swapping . Memory conditions fall into four categories:
  • Sufficient memory is available, and the system performs optimally.
  • Memory is constrained (one likely culprit, especially on older Solaris systems, is the filesystem cache). Performance begins to suffer as the system attempts to scavenge memory that is not in active use.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Implementations of Physical Memory
Let's start by looking at how memory is physically implemented in modern systems. All modern, fast memory implementations are accomplished via semiconductors, of which there are two major types: dynamic random access memory (DRAM) and static random access memory (SRAM). The difference between them is how each memory cell is designed. Dynamic cells are charge-based, where each bit is represented by a charge stored in a tiny capacitor. The charge leaks away in a short period of time, so the memory must be continually refreshed to prevent data loss. The act of reading a bit also serves to drain the capacitor, so it's not possible to read that bit again until it has been refreshed. Static cells, however, are based on gates, and each bit is stored in four or six connected transistors. SRAM memories retain data as long as they have power; refreshing is not required. In general, DRAM is substantially cheaper and offers the highest densities of cells per chip; it is smaller, less power-intensive, and runs cooler. However, SRAM is as much as an order of magnitude faster, and therefore is used in high-performance environments. Interestingly, the Cray-1S supercomputer had a main memory constructed entirely from SRAM. The heat generated by the memory subsystem was the primary reason that system was liquid-cooled.
There are two primary performance specifications for memory. The first represents the amount of time required to read or write a given location in memory, and is called the memory access time. The second, the memory cycle time, describes how frequently you can repeat a memory reference. They sound identical, but they are often quite different due to phenomena such as the need to refresh DRAM cells.
There is quite a gap between the speed of memory and the speed of microprocessors. In the early 1980s, the access time of commonly available DRAM was about 200 ns, which was shorter than the clock cycle of the commonly used 4.77 MHz (210 ns) microprocessors of the day. Fastforwarding two decades, the clock cycle time of the average home microprocessor is down to about a nanosecond (1 GHz), but memory access times are hovering around 50 ns.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Virtual Memory Architecture
A virtual memory system exists to provide a framework for the system to manage memory on the behalf of various processes. The virtual memory system provides two primary benefits. It allows software developers to write to a simple memory model, which shields the programmer from the memory subsystem's hardware architecture and allows the use of memory sizes substantially greater than physical memory through backing stores. It also permits processes to have nonfragmented address spaces, regardless of how physical memory is organized or fragmented. In order to implement such a scheme, four key functions are required.
First, each process is presented with its own virtual address space; that is, it can "see" and potentially use a range of memory. This range of memory is equal to the maximum address size of the machine. For example, a process running on a 32-bit system will have a virtual address space of about 4 GB (232). The virtual memory system is responsible for managing the associations between the used portions of this virtual address space into physical memory.
Second, several processes might have substantial sharing between their address spaces. For example, let's say that two copies of the shell /bin/csh are running. Both copies will have separate virtual address spaces, each with their own copy of the executable itself, the libc library, and other shared resources. The virtual memory system transparently maps these shared segments to the same area of physical memory, so that multiple copies are not stored. This is analogous to making a hard link in a filesystem instead of duplicating the file.
However, sometimes physical memory will become insufficient to hold the used portions of all virtual address spaces. In this case, the virtual memory system selects less frequently used portions of memory and pushes them out to secondary storage (e.g., disk) in order to optimize the use of physical memory.
Finally, the virtual memory system plays one of the roles of an elementary school teacher: keeping the children in his care from interfering with each other's private things. Hardware facilities in the memory management unit perform this function by preventing a process from accessing memory outside its own address space.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Paging and Swapping
Paging and swapping are terms that are often used interchangeably, but they are quite distinct. A system that is paging is writing selected, infrequently used pages of memory to disk, while a system that is swapping is writing entire processes from memory to disk. Let's say that you are working on your automobile, and you only have a small amount of space available for tools. Paging is equivalent to putting the 8 mm socket back in the toolchest so you have enough room for a pair of pliers; swapping is like putting your entire socket set away.
Many people feel that their systems should never nontrivially page (that is, perform paging on a pre-Solaris 8 system that is not simply filesystem activity). It's important to realize that paging and swapping allow the system to continue getting work done despite adverse memory conditions. Paging is not necessarily indicative of a problem; it is the action of the page scanner to try and increase the size of the free list by moving inactive pages to disk. A process, as a general rule, spends about 80% of its time running about 20% of its code; since the entire process doesn't need to be in memory at once, writing some pages out to disk won't affect performance substantially. Performance only begins to suffer when a memory shortage continues or worsens.
Historically, Unix systems implemented a time-based swapping mechanism, whereby a process that was idle for more than 20 seconds would be swapped out; this isn't done anymore. Swapping is now used only to address the most severe memory shortages. If you come across a system where vmstat reports a nonzero swap queue, the only conclusion you can draw is that, at some indeterminate time in the past, the system was short enough on memory to swap out a process.
Memory shortages also tend to appear worse than they are because of the nature of the memory reclamation mechanism. When a system is desperately swapping jobs out, it tries to avoid a major performance decrease by keeping active jobs in memory for as long as possible. Unfortunately, programs that directly interact with users (such as shells, editors, or anything else that is dependent on user-supplied input) are inactive while waiting for the user to type something. As a result, these interactive processes are likely to be targeted as good places for memory reclamation efforts to be targeted. For example, when you pause momentarily in tying a command (such as when you're looking through the process table trying to find the ID of the process that is hogging all your memory so you can kill it), your shell will have to make the long trip from disk back into memory before your characters can be echoed back. Even worse, the disk subsystem is probably under heavy load from all the paging and swapping activity! The upshot is that in memory shortages, interactive performance falls through the floor.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Consumers of Memory
Memory is consumed by four things: the kernel, filesystem caches, processes, and intimately shared memory. When the system starts, it takes a small amount (generally less than 4 MB) of memory for itself. As it dynamically loads modules and requires additional memory, it claims pages from the free list. These pages are locked in physical memory, and cannot be paged out except in the most severe of memory shortages. Sometimes, on a system that is very short of memory, you can hear a pop from the speaker. This is actually the speaker being turned off as the audio device driver is being unloaded from the kernel. However, a module won't be unloaded if a process is actually using the device; otherwise, the disk driver could be paged out, causing difficulties. Occasionally, however, a system will experience a kernel memory allocation error. While there is a limit on the size of kernel memory, the problem is caused by the kernel trying to get memory when the free list is completely exhausted. Since the kernel cannot always wait for memory to become available, this can cause operations to fail rather than be delayed. One of the subsystems that cannot wait for memory is the streams facility; if a large number of users try to log into a system at the same time, some logins may fail. Starting with Solaris 2.5.1, changes were made to expand the free list on large systems, which helps prevent the free list from ever being totally empty.
Processes have private memory to hold their stack space, heap, and data areas. The only way to see how much memory a process is actively using is to use /usr/proc/bin/pmap -x process-id, which is available in Solaris 2.6 and later releases.
Intimately shared memory is a technique for allowing the sharing of low-level kernel information about pages, rather than by sharing the memory pages themselves. This is a significant optimization in that it removes a great deal of redundant mapping information. It is of primary use in database applications such as Oracle, which benefit from having a very large shared memory cache. There are three special things worth noting about intimately shared memory. First, all the intimately shared memory is locked, and cannot ever be paged out. Second, the memory management structures that are usually created independently for each process are only created once, and shared between all processes. Third, the kernel tries to find large pieces of contiguous physical memory (4 MB) that can be used as large pages, which substantially reduces MMU overhead.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Tools for Memory Performance Analysis
Tools for memory performance analysis can be classed under three basic issues: how fast is memory, how constrained is memory in a given system, and how much memory does a specific process consume? In this section, I examine some tools for approaching each of these questions.
In general, monitoring memory performance is a function of monitoring memory restraints. Tools for providing benchmarks as to how fast the memory subsystem is do exist; they are largely of academic interest, as it is unlikely that much tuning will be able to increase these numbers. The one exception to this rule is that users can carefully tune interleaving as appropriate. Most systems handle interleaving by purely physical means, so you may have to purchase additional memory: consult your system hardware manual for more information. Nonetheless, it is often important to be aware of relative memory subsystem performance in order to make meaningful comparisons.

Section 4.5.1.1: STREAM

The STREAM tool is simple; it measures the time required to copy regions of memory. This measures "real-world" sustainable bandwidth, not the theoretical "peak bandwidth" that most computer vendors provide. It was developed by John McCalpin while he was a professor at the University of Delaware.
The benchmark itself is easy to run in single-processor mode (the multiprocessor mode is quite a bit more complex; consult the benchmark documentation for current details). Here's an example from an Ultra 2 Model 2200:
$ ./stream
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 1000000, Offset = 0
Total memory required = 22.9 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 40803 microseconds.
   (= 40803 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:         226.1804       0.0709       0.0707       0.0716
Scale:        227.6123       0.0704       0.0703       0.0705
Add:          276.5741       0.0869       0.0868       0.0871
Triad:        239.6189       0.1003       0.1002       0.1007
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Concluding Thoughts
There's a simple moral to this story: your memory requirements depend heavily on how many users there are and what they are doing. Large technical applications such as computational fluid dynamics or protein structure calculations can require incredible amounts of memory. It is easy to imagine