Chapter 1. Cluster Architecture
Computing speed isn’t just a convenience. Faster computers allow us to solve larger problems, and to find solutions more quickly, with greater accuracy, and at a lower cost. All this adds up to a competitive advantage. In the sciences, this may mean the difference between being the first to publish and not publishing. In industry, it may determine who’s first to the patent office.
Traditional high-performance clusters have proved their worth in a variety of uses—from predicting the weather to industrial design, from molecular dynamics to astronomical modeling. High-performance computing (HPC) has created a new approach to science—modeling is now a viable and respected alternative to the more traditional experiential and theoretical approaches.
Clusters are also playing a greater role in business. High performance is a key issue in data mining or in image rendering. Advances in clustering technology have led to high-availability and load-balancing clusters. Clustering is now used for mission-critical applications such as web and FTP servers. For example, Google uses an ever-growing cluster composed of tens of thousands of computers.
Modern Computing and the Role of Clusters
Because of the expanding role that clusters are playing in distributed computing, it is worth considering this question briefly. There is a great deal of ambiguity, and the terms used to describe clusters and distributed computing are often used inconsistently. This chapter doesn’t provide a detailed taxonomy—it doesn’t include a discussion of Flynn’s taxonomy or of cluster topologies. This has been done quite well a number of times and too much of it would be irrelevant to the purpose of this book. However, this chapter does try to explain the language used. If you need more general information, see the Appendix A for other sources. High Performance Computing, Second Edition (O’Reilly), by Dowd and Severance is a particularly readable introduction.
When computing, there are three basic approaches to improving performance—use a better algorithm, use a faster computer, or divide the calculation among multiple computers. A very common analogy is that of a horse-drawn cart. You can lighten the load, you can get a bigger horse, or you can get a team of horses. (We’ll ignore the option of going into therapy and learning to live with what you have.) Let’s look briefly at each of these approaches.
First, consider what you are trying to calculate. All too often, improvements in computing hardware are taken as a license to use less efficient algorithms, to write sloppy programs, or to perform meaningless or redundant calculations rather than carefully defining the problem. Selecting appropriate algorithms is a key way to eliminate instructions and speed up a calculation. The quickest way to finish a task is to skip it altogether.
If you need only a modest improvement in performance, then buying a faster computer may solve your problems, provided you can find something you can afford. But just as there is a limit on how big a horse you can buy, there are limits on the computers you can buy. You can expect rapidly diminishing returns when buying faster computers. While there are no hard and fast rules, it is not unusual to see a quadratic increase in cost with a linear increase in performance, particularly as you move away from commodity technology.
The third approach is parallelism, i.e., executing instructions simultaneously. There are a variety of ways to achieve this. At one end of the spectrum, parallelism can be integrated into the architecture of a single CPU (which brings us back to buying the best computer you can afford). At the other end of the spectrum, you may be able to divide the computation up among different computers on a network, each computer working on a part of the calculation, all working at the same time. This book is about that approach—harnessing a team of horses.
The traditional classification of computers based on size and performance, i.e., classifying computers as microcomputers, workstations, minicomputers, mainframes, and supercomputers, has become obsolete. The ever-changing capabilities of computers means that today’s microcomputers now outperform the mainframes of the not-too-distant past. Furthermore, this traditional classification scheme does not readily extend to parallel systems and clusters. Nonetheless, it is worth looking briefly at the capabilities and problems associated with more traditional computers, since these will be used to assemble clusters. If you are working with a team of horses, it is helpful to know something about a horse.
Regardless of where we place them in the traditional classification, most computers today are based on an architecture often attributed to the Hungarian mathematician John von Neumann. The basic structure of a von Neumann computer is a CPU connected to memory by a communications channel or bus. Instructions and data are stored in memory and are moved to and from the CPU across the bus. The overall speed of a computer depends on both the speed at which its CPU can execute individual instructions and the overhead involved in moving instructions and data between memory and the CPU.
Several technologies are currently used to speed up the processing speed of CPUs. The development of reduced instruction set computer (RISC) architectures and post-RISC architectures has led to more uniform instruction sets. This eliminates cycles from some instructions and allows a higher clock-rate. The use of RISC technology and the steady increase in chip densities provide great benefits in CPU speed.
Superscalar architectures and pipelining have also increased processor speeds. Superscalar architectures execute two or more instructions simultaneously. For example, an addition and a multiplication instruction, which use different parts of the CPU, might be executed at the same time. Pipelining overlaps the different phase of instruction execution like an assembly line. For example, while one instruction is executed, the next instruction can be fetched from memory or the results from the previous instructions can be stored.
Memory bandwidth , basically the rate at which bits are transferred from memory over the bus, is a different story. Improvements in memory bandwidth have not kept up with CPU improvements. It doesn’t matter how fast the CPU is theoretically capable of running if you can’t get instructions and data into or out of the CPU fast enough to keep the CPU busy. Consequently, memory access has created a performance bottleneck for the classical von Neumann architecture: the von Neumann bottleneck .
Computer architects and manufacturers have developed a number of techniques to minimize the impact of this bottleneck. Computers use a hierarchy of memory technology to improve overall performance while minimizing cost. Frequently used data is placed in very fast cache memory, while less frequently used data is placed in slower but cheaper memory. Another alternative is to use multiple processors so that memory operations are spread among the processors. If each processor has its own memory and its own bus, all the processors can access their own memory simultaneously.
Traditionally, supercomputers have been pipelined, superscalar processors with a single CPU. These are the “big iron” of the past, often requiring “forklift upgrades” and multiton air conditioners to prevent them from melting from the heat they generate. In recent years we have come to augment that definition to include parallel computers with hundreds or thousands of CPUs, otherwise known as multiprocessor computers. Multiprocessor computers fall into two basic categories—centralized multiprocessors (or single enclosure multiprocessors) and multicomputers.
With centralized multiprocessors, there are two architectural approaches based on how memory is managed—uniform memory access (UMA) and nonuniform memory access (NUMA) machines. With UMA machines, also called symmetric multiprocessors (SMP) , there is a common shared memory. Identical memory addresses map, regardless of the CPU, to the same location in physical memory. Main memory is equally accessible to all CPUs, as shown in Figure 1-1. To improve memory performance, each processor has its own cache.
There are two closely related difficulties when designing a UMA machine. The first problem is synchronization. Communications among processes and access to peripherals must be coordinated to avoid conflicts. The second problem is cache consistency . If two different CPUs are accessing the same location in memory and one CPU changes the value stored in that location, then how is the cache entry for the other CPU updated? While several techniques are available, the most common is snooping . With snooping, each cache listens to all memory accesses. If a cache contains a memory address that is being written to in main memory, the cache updates its copy of the data to remain consistent with main memory.
A closely related architecture is used with NUMA machines. Roughly, with this architecture, each CPU maintains its own piece of memory, as shown in Figure 1-2. Effectively, memory is divided among the processors, but each process has access to all the memory. Each individual memory address, regardless of the processor, still references the same location in memory. Memory access is nonuniform in the sense that some parts of memory will appear to be much slower than other parts of memory since the bank of memory “closest” to a processor can be accessed more quickly by that processor. While this memory arrangement can simplify synchronization, the problem of memory coherency increases.
Operating system support is required with either multiprocessor scheme. Fortunately, most modern operating systems, including Linux, provide support for SMP systems, and support is improving for NUMA architectures.
When dividing a calculation among processors, an important concern is granularity , or the smallest piece that a computation can be broken into for purposes of sharing among different CPUs. Architectures that allow smaller pieces of code to be shared are said to have a finer granularity (as opposed to a coarser granularity). The granularity of each of these architectures is the thread. That is, the operating system can place different threads from the same process on different processors. Of course, this implies that, if your computation generates only a single thread, then that thread can’t be shared between processors but must run on a single CPU. If the operating system has nothing else for the other processors to do, they will remain idle and you will see no benefit from having multiple processors.
A third architecture worth mentioning in passing is processor array , which, at one time, generated a lot of interest. A processor array is a type of vector computer built with a collection of identical, synchronized processing elements. Each processor executes the same instruction on a different element in a data array.
Numerous issues have arisen with respect to processor arrays. While some problems map nicely to this architecture, most problems do not. This severely limits the general use of processor arrays. The overall design doesn’t work well for problems with large serial components. Processor arrays are typically designed around custom VLSI processors, resulting in much higher costs when compared to more commodity-oriented multiprocessor designs. Furthermore, processor arrays typically are single user, adding to the inherent cost of the system. For these and other reasons, processor arrays are no longer popular.
A multicomputer configuration, or cluster, is a group of computers that work together. A cluster has three basic elements—a collection of individual computers, a network connecting those computers, and software that enables a computer to share work among the other computers via the network.
For most people, the most likely thing to come to mind when speaking of multicomputers is a Beowulf cluster . Thomas Sterling and Don Becker at NASA’s Goddard Space Flight Center built a parallel computer out of commodity hardware and freely available software in 1994 and named their system Beowulf. While this is perhaps the best-known type of multicomputer, a number of variants now exist.
First, both commercial multicomputers and commodity clusters are available. Commodity clusters, including Beowulf clusters, are constructed using commodity, off-the-shelf (COTS) computers and hardware. When constructing a commodity cluster, the norm is to use freely available, open source software. This translates into an extremely low cost that allows people to build a cluster when the alternatives are just too expensive. For example, the “Big Mac” cluster built by Virginia Polytechnic Institute and State University was initially built using 1100 dual-processor Macintosh G5 PCs. It achieved speeds on the order of 10 teraflops, making it one of the fastest supercomputers in existence. But while supercomputers in that class usually take a couple of years to construct and cost in the range of $100 million to $250 million, Big Mac was put together in about a month and at a cost of just over $5 million. (A list of the fastest machines can be found at http://www.top500.org. The site also maintains a list of the top 500 clusters.)
In commodity clusters, the software is often mix-and-match. It is not unusual for the processors to be significantly faster than the network. The computers within a cluster can be dedicated to that cluster or can be standalone computers that dynamically join and leave the cluster. Typically, the term Beowulf is used to describe a cluster of dedicated computers, often with minimal hardware. If no one is going to use a node as a standalone machine, there is no need for that node to have a dedicated keyboard, mouse, video card, or monitor. Node computers may or may not have individual disk drives. (Beowulf is a politically charged term that is avoided in this book.) While a commodity cluster may consist of identical, high-performance computers purchased specifically for the cluster, they are often a collection of recycled cast-off computers, or a pile-of-PCs (POP) .
Commercial clusters often use proprietary computers and software. For example, a SUN Ultra is not generally thought of as a COTS computer, so an Ultra cluster would typically be described as a proprietary cluster. With proprietary clusters, the software is often tightly integrated into the system, and the CPU performance and network performance are well matched. The primary disadvantage of commercial clusters is, as you no doubt guessed, their cost. But if money is not a concern, then IBM, Sun Microsystems, or any number of other companies will be happy to put together a cluster for you. (The salesman will probably even take you to lunch.)
A network of workstations (NOW) , sometimes called a cluster of workstations (COW) , is a cluster composed of computers usable as individual workstations. A computer laboratory at a university might become a NOW on the weekend when the laboratory is closed. Or office machines might join a cluster in the evening after the daytime users leave.
Software is an integral part of any cluster. A discussion of cluster software will constitute the bulk of this book. Support for clustering can be built directly into the operating system or may sit above the operating system at the application level, often in user space. Typically, when clustering support is part of the operating system, all nodes in the cluster need to have identical or nearly identical kernels; this is called a single system image (SSI) . At best, the granularity is the process. With some software, you may need to run distinct programs on each node, resulting in even coarser granularity. Since each computer in a cluster has its own memory (unlike a UMA or NUMA computer), identical addresses on individual CPUs map different physical memory locations. Communication is more involved and costly.
It’s tempting to think of a cluster as just a bunch of interconnected machines, but when you begin constructing a cluster, you’ll need to give some thought to the internal structure of the cluster. This will involve deciding what roles the individual machines will play and what the interconnecting network will look like.
The simplest approach is a symmetric cluster . With a symmetric cluster (Figure 1-3) each node can function as an individual computer. This is extremely straightforward to set up. You just create a subnetwork with the individual machines (or simply add the computers to an existing network) and add any cluster-specific software you’ll need. You may want to add a server or two depending on your specific needs, but this usually entails little more than adding some additional software to one or two of the nodes. This is the architecture you would typically expect to see in a NOW, where each machine must be independently usable.
There are several disadvantages to a symmetric cluster. Cluster management and security can be more difficult. Workload distribution can become a problem, making it more difficult to achieve optimal performance.
For dedicated clusters, an asymmetric architecture is more common. With asymmetric clusters (Figure 1-4) one computer is the head node or frontend . It serves as a gateway between the remaining nodes and the users. The remaining nodes often have very minimal operating systems and are dedicated exclusively to the cluster. Since all traffic must pass through the head, asymmetric clusters tend to provide a high level of security. If the remaining nodes are physically secure and your users are trusted, you’ll only need to harden the head node.
The head often acts as a primary server for the remainder of the clusters. Since, as a dual-homed machine, it will be configured differently from the remaining nodes, it may be easier to keep all customizations on that single machine. This simplifies the installation of the remaining machines. In this book, as with most descriptions of clusters, we will use the term public interface to refer to the network interface directly connected to the external network and the term private interface to refer to the network interface directly connected to the internal network.
The primary disadvantage of this architecture comes from the performance limitations imposed by the cluster head. For this reason, a more powerful computer may be used for the head. While beefing up the head may be adequate for small clusters, its limitations will become apparent as the size of the cluster grows. An alternative is to incorporate additional servers within the cluster. For example, one of the nodes might function as an NFS server, a second as a management station that monitors the health of the clusters, and so on.
I/O represents a particular challenge. It is often desirable to distribute a shared filesystem across a number of machines within the cluster to allow parallel access. Figure 1-5 shows a more fully specified cluster.
Network design is another key issue. With small clusters, a simple switched network may be adequate. With larger clusters, a fully connected network may be prohibitively expensive. Numerous topologies have been studied to minimize connections (costs) while maintaining viable levels of performance. Examples include hyper-tree, hyper-cube, butterfly, and shuffle-exchange networks. While a discussion of network topology is outside the scope of this book, you should be aware of the issue.
Heterogeneous networks are not uncommon. Although not shown in the figure, it may be desirable to locate the I/O servers on a separate parallel network. For example, some clusters have parallel networks allowing administration and user access through a slower network, while communications for processing and access to the I/O servers is done over a high-speed network.
Types of Clusters
Originally, “clusters” and “high-performance computing” were synonymous. Today, the meaning of the word “cluster” has expanded beyond high-performance to include high-availability (HA) clusters and load-balancing (LB) clusters . In practice, there is considerable overlap among these—they are, after all, all clusters. While this book will focus primarily on high-performance clusters, it is worth taking a brief look at high-availability and load-balancing clusters.
High-availability clusters, also called failover clusters, are often used in mission-critical applications. If you can’t afford the lost business that will result from having your web server go down, you may want to implement it using a HA cluster. The key to high availability is redundancy. An HA cluster is composed of multiple machines, a subset of which can provide the appropriate service. In its purest form, only a single machine or server is directly available—all other machines will be in standby mode. They will monitor the primary server to insure that it remains operational. If the primary server fails, a secondary server takes its place.
The idea behind a load-balancing cluster is to provide better performance by dividing the work among multiple computers. For example, when a web server is implemented using LB clustering, the different queries to the server are distributed among the computers in the clusters. This might be accomplished using a simple round-robin algorithm. For example, Round-Robin DNS could be used to map responses to DNS queries to the different IP addresses. That is, when a DNS query is made, the local DNS server returns the addresses of the next machine in the cluster, visiting machines in a round-robin fashion. However, this approach can lead to dynamic load imbalances. More sophisticated algorithms use feedback from the individual machines to determine which machine can best handle the next task.
Keep in mind, the term “load-balancing” means different things to different people. A high-performance cluster used for scientific calculation and a cluster used as a web server would likely approach load-balancing in entirely different ways. Each application has different critical requirements.
To some extent, any cluster can provide redundancy, scalability, and improved performance, regardless of its classification. Since load-balancing provides greater availability, it is not unusual to see both load-balancing and high-availability in the same cluster. The Linux Virtual Server Project (LVSR) is an example of combining these two approaches. An LVSR server is a high-availability server implemented by distributing tasks among a number of real servers. Interested readers are encouraged to visit the web pages for the Linux Virtual Server Project (http://www.linux-vs.org) and the High-Availability Linux Project (http://www.linux-ha.org) and to read the relevant HOWTOs. OSCAR users will want to visit the High-Availability OSCAR web site http://www.openclustergroup.org/HA-OSCAR/.
Distributed Computing and Clusters
While the term parallel is often used to describe clusters, they are more correctly described as a type of distributed computing . Typically, the term parallel computing refers to tightly coupled sets of computation. Distributed computing is usually used to describe computing that spans multiple machines or multiple locations. When several pieces of data are being processed simultaneously in the same CPU, this might be called a parallel computation, but would never be described as a distributed computation. Multiple CPUs within a single enclosure might be used for parallel computing, but would not be an example of distributed computing. When talking about systems of computers, the term parallel usually implies a homogenous collection of computers, while distributed computing typically implies a more heterogeneous collection. Computations that are done asynchronously are more likely to be called distributed than parallel. Clearly, the terms parallel and distributed lie at either end of a continuum of possible meanings. In any given instance, the exact meanings depend upon the context. The distinction is more one of connotations than of clearly established usage.
Since cluster computing is just one type of distributed computing, it is worth briefly mentioning the alternatives. The primary distinction between clusters and other forms of distributed computing is the scope of the interconnecting network and the degree of coupling among the individual machines. The differences are often ones of degree.
Clusters are generally restricted to computers on the same subnetwork or LAN. The term grid computing is frequently used to describe computers working together across a WAN or the Internet. The idea behind the term “grid” is to invoke a comparison between a power grid and a computational grid. A computational grid is a collection of computers that provide computing power as a commodity. This is an active area of research and has received (deservedly) a lot of attention from the National Science Foundation. The most significant differences between cluster computing and grid computing are that computing grids typically have a much larger scale, tend to be used more asynchronously, and have much greater access, authorization, accounting, and security concerns. From an administrative standpoint, if you build a grid, plan on spending a lot of time dealing with security-related issues. Grid computing has the potential of providing considerably more computing power than individual clusters since a grid may combine a large number of clusters.
Peer-to-peer computing provides yet another approach to distributed computing. Again this is an ambiguous term. Peer-to-peer may refer to sharing cycles, to the communications infrastructure, or to the actual data distributed across a WAN or the Internet. Peer-to-peer cycle sharing is best exemplified by SETI@Home , a project to analyze radio telescope data for signs of extraterrestrial intelligence. Volunteers load software onto their Internet-connected computers. To the casual PC or Mac user, the software looks like a screensaver. When a computer becomes idle, the screensaver comes on and the computer begins analyzing the data. If the user begins using the computer again, the screensaver closes and the data analysis is suspended. This approach has served as a model for other research, including the analysis of cancer and AIDS data.
Data or file-sharing peer-to-peer networks are best exemplified by Napster, Gnutella, or Kazaa technologies. With some peer-to-peer file-sharing schemes, cycles may also be provided for distributed computations. That is, by signing up and installing the software for some services, you may be providing idle cycles to the service for other uses beyond file sharing. Be sure you read the license before you install the software if you don’t want your computers used in this way.
Other entries in the distributed computing taxonomy include federated clusters and constellations . Federated clusters are clusters of clusters, while constellations are clusters where the number of CPUs is greater than the number of nodes. A four-node cluster of SGI Altrix computers with 128 CPUs per node is a constellation. Peer-to-peer, grids, federated clusters, and constellations are outside the scope of this book.
While clusters have a lot to offer, they are not panaceas. There is a limit to how much adding another computer to a problem will speed up a calculation. In the ideal situation, you might expect a calculation to go twice as fast on two computers as it would on one. Unfortunately, this is the limiting case and you can only approach it.
Any calculation can be broken into blocks of code or instructions that can be classified in one of two exclusive ways. Either a block of code can be parallelized and shared among two or more machines, or the code is essentially serial and the instructions must be executed in the order they are written on a single machine. Any code that can’t be parallelized won’t benefit from any additional processors you may have.
There are several reasons why some blocks of code can’t be parallelized and must be executed in a specific order. The most obvious example is I/O, where the order of operations is typically determined by the availability, order, and format of the input and the format of the desired output. If you are generating a report at the end of a program, you won’t want the characters or lines of output printed at random.
Another reason some code can’t be parallelized comes from the data dependencies within the code. If you use the value of x to calculate the value of y, then you’ll need to calculate x before you calculate y. Otherwise, you won’t know what value to use in the calculation. Basically, to be able to parallelize two instructions, neither can depend on the other. That is, the order in which the two instructions finish must not matter.
Thus, any program can be seen as a series of alternating sections—sections that can be parallelized and effectively run on different machines interspersed with sections that must be executed as written and that effectively can only be run on a single machine. If a program spends most of its time in code that is essentially serial, parallel processing will have limited value for this code. In this case, you will be better served with a faster computer than with parallel computers. If you can’t change the algorithm, big iron is the best approach for this type of problem.
As just noted, the amount of code that must be executed serially limits how much of a speedup you can expect from parallel execution. This idea has been formalized by what is known as Amdahl's Law , named after Gene Amdahl, who first stated the law in the late sixties. In a nutshell, Amdahl’s Law states that the serial portion of a program will be the limiting factor in how much you can speed up the execution of the program using multiple processors.
An example should help clarify Amdahl’s Law. Let’s assume you have a computation that takes 10 hours to complete on a currently available computer and that 90 percent of your code can be parallelized. In other words, you are spending one hour doing instructions that must be done serially and nine hours doing instructions that can be done in parallel. Amdahl’s Law states that you’ll never be able to run this code on this class of computers in less than one hour, regardless of how many additional computers you have available. To see this, imagine that you had so many computers that you could execute all the parallel code instantaneously. You would still have the serial code to execute, which has to be done on a single computer, and it would still take an hour.
In practice, you won’t have an unlimited number of processors, so your total time will always be longer. Figure 1-6 shows the amount of time needed for this example, depending on the number of processors you have available.
You should also remember that Amdahl’s law is an ideal. In practice, there is the issue of the overhead introduced by parallelizing the code. For example, coordinating communications among the various processes will require additional code. This adds to the overall execution time. And if there is contention for the network, this can stall processes, further slowing the calculation. In other words, Amdahl’s Law is the best speedup you can hope for, but not the actual speedup you’ll see.
What can you do if you need to do this calculation in less than one hour? As I noted earlier, you have three choices when you want to speed up a calculation—better algorithms, faster computers, or more computers. If more computers won’t take you all the way, your remaining choices are better algorithms and faster computers. If you can rework your code so that a larger fraction can be done in parallel, you’ll see an increased benefit from a parallel approach. Otherwise, you’ll need to dig deep into your pocket for faster computers.
Surprisingly, a fair amount of controversy still surrounds what should be obvious once you think about it. This stems in large part from the misapplication of Amdahl’s Law over the years. For example, Amdahl’s Law has been misused as an argument favoring faster computers over parallel computing.
The most common misuse is based on the assumption that the amount of speedup is independent of the size of the problem. Amdahl’s Law simply does not address how problems scale. The fraction of the code that must be executed serially usually changes as the size of the problem changes. So, it is a mistake to assume that a problem’s speedup factor will be the same when the scale of the problem changes. For instance, if you double the length of a simulation, you may find that the serial portions of the simulation, such as the initialization and report phases, are basically unchanged, while the parallelizable portion of the code is what doubles. Hence, the fraction of the time spent in the serial code will decrease and Amdahl’s Law will specify a greater speedup. This is good news! After all, it’s when problems get bigger that we most need the speedup. For most problems, the speedup factor will depend upon the problem size. As the problem size changes, so does the speedup factor. The amount will depend on the nature of the individual problem, but typically, the speedup will increase as the size of the problem increases. As the problem size grows, it is not unusual to the see a linear increase in the amount of time spent in the serial portion of the code and a quadratic increase in the amount of time spent in the parallelizable portion of the code. Unfortunately, if you only apply Amdahl’s Law to the smaller problem size, you’ll underestimate the benefit of a parallel approach.
Having said this, it is important to remember that Amdahl’s Law does clearly state a limitation of parallel computing. But this limitation varies not only from problem to problem, but with the size of the problem as well.
One last word about the limitations of clusters—the limitations are often tied to a particular approach. It is often possible to mix approaches and avoid limitations. For example, in constructing your clusters, you’ll want to use the best computers you can afford. This will lessen the impact of inherently serial code. And don’t forget to look at your algorithms!
The material covered in this book reflects three of my biases, of which you should be aware. I have tried to write a book to help people get started with clusters. As such, I have focused primarily on mainstream, high-performance computing, using open source software. Let me explain why.
First, there are many approaches and applications for clusters. I do not believe that it is feasible for any book to address them all, even if a less-than-exhaustive approach is used. In selecting material for this book, I have tried to use the approaches and software that are the most useful for the largest number of people. I feel that it is better to cover a limited number of approaches than to try to say too much and risk losing focus. However, I have tried to justify my decisions and point out options along the way so that if your needs don’t match my assumptions, you’ll at least have an idea where to start looking.
Second, in keeping with my goal of addressing mainstream applications of clusters, the book primarily focuses on high-performance computing. This is the application from which clusters grew and remains one of their dominant uses. Since high availability and load balancing tend to be used with mission-critical applications, they are beyond the scope of a book focusing on getting started with clusters. You really should have some basic experience with generic clusters before moving on to such mission-critical applications. And, of course, improved performance lies at the core of all the other uses for clusters.
Finally, I have focused on open source software. There are a number of proprietary solutions available, some of which are excellent. But given the choice between comparable open source software and proprietary software, my preference is for open source. For clustering, I believe that high-quality, robust open source software is readily available and that there is little justification for considering proprietary software for most applications.
While I’ll cover the basics of clusters here, you would do well to study the specifics of clusters that closely match your applications as well. There are a number of well-known clusters that have been described in detail. A prime example is Google, with literally tens of thousands of computers. Others include clusters at Fermilab, Argonne National Laboratory (Chiba City cluster), and Oak Ridge National Laboratory. Studying the architecture of clusters similar to what you want to build should provide additional insight. Hopefully, this book will leave you well prepared to do just that.
One last comment—if you keep reading, I promise not to mention horses again.
 If you think back to English lit, you will recall that the epic hero Beowulf was described as having “the strength of many.”
 While Amdahl’s Law is the most widely known and most useful metric for describing parallel performance, there are others. These include Gustafson-Barsus’s, Sun’s, and Ni’s Laws and the Karp-Flat and the Isoefficiency Metrics.
 For those of you who love algebra, the speedup factor is equal to 1/(s + p/N), where s is the fraction of the code that is inherently serial, p is the fraction of the code that can be parallelized, and N is the number of processors available. Clearly, p + s = 1. As the number of processors becomes very large, p/N becomes very small, and the speedup becomes essentially 1/s. So if s is 0.1, the largest speedup you can expect is a factor of 10, no matter how many processors you have available.