Chapter 4. Eight Simple Rules for Designing Multithreaded Applications
Since it is right there in the title of this book, the following sentence shouldn’t come as any surprise: Concurrent programming is still more art than science. This chapter gives eight simple rules that you can add to your toolkit of threading design methods. I’ve tried to organize the rules in a semichronological way, but there’s no hard and fast order to the rules. It’s like being confronted with, “No running by the pool,” and, “No diving in the shallow end.” Both good ideas, but not diving can come before not running or vice versa.
By following these rules, you will have more success in writing the best and most efficient threaded implementation of your applications. You may recognize some of these, since I’ve mentioned a few of them in previous chapters. In upcoming chapters, when discussing the design and implementation of specific algorithms, I’ll try to drop in a relevant reference to one or more of these eight rules to show that they’re not just here to fill out an extra chapter.
Rule 1: Identify Truly Independent Computations
I’ve already covered this first rule seven ways to Sunday, but since it’s the crux of the whole matter, it bears repeating at least one more time. You can’t execute anything concurrently unless the operations that would be executed can be run independently of each other. I can easily think of different real-world instances of independent actions being performed to satisfy a single goal. Consider, for example, a DVD rental warehouse. Orders for movies are collected and then distributed to the workers, who go out to where all the disks are stored and find copies to satisfy their assigned orders. When one worker pulls out a classic musical comedy, it does not interfere with another worker who is looking for the latest science fiction masterpiece, nor will it interfere with a worker trying to locate episodes from the second season of a popular crime drama series (I assume that any conflicts resulting from unavailable inventory have been dealt with before orders were transmitted to the warehouse). Also, the packaging and mailing of each order will not interfere with disk searches or the shipping and handling of any other order.
There are cases in which you will have exclusively sequential computations that cannot be made concurrent; many of these will be dependencies between loop iterations or steps that must be carried out in a specific order. A list of common situations was covered earlier in What’s Not Parallel.
Rule 2: Implement Concurrency at the Highest Level Possible
There are two directions you can take when approaching the threading of a serial code. These are bottom-up and top-down. When initially analyzing your code, you are looking for the computational hotspots that account for the most execution time. Running those portions in parallel will give you the best chance of achieving the maximum performance possible.
In a bottom-up approach, you consider threading the hotspots in your code directly. If this is not possible, search up the call stack of the application to determine whether there is another place in the code that can execute the hotspots in parallel. If your hotspot is the innermost loop of a nested loop structure, examine each successive layer of loop nesting, from the innermost to the outermost, to see whether that level can be made concurrent. Even if it is possible to employ concurrency at the hotspot code, you should still look to see whether it would be possible to implement that concurrency at a point in the code higher up in the call stack. This can increase the granularity of the execution done by each thread.
To illustrate this rule, consider threading a video encoding application. If your hotspot is the computation of individual pixels, you can look to parallelize the loop(s) that deal with each pixel computation within a single frame of video. Looking further “up” from this, you might find that the loop over the frames of video can be executed concurrently by independently processing groups of frames. If the video encoding application is expected to process multiple videos, expressing your concurrency by assigning a different stream to each thread will be the highest level of possible concurrency.
The other approach to threading is top-down, where you first consider the whole application and what the computation is coded to accomplish (all the parts of the application that combine to realize that computation). While there is no obvious concurrency, distill the parts of the computation that still contain execution of the hotspot into successively smaller parts until you can identify independent computations.
For the video encoding application, if your hotspot is the computation of individual pixels, the top-down approach would first consider that the application handles encoding of multiple, independent video streams (which all include the pixel computations). If you can parallelize the application there, you’ve found your highest level. If not, working “down” to the individual pixel will take you through frames within a single stream and then to pixels within a frame.
The objective of this rule is to find the highest level where concurrency can be implemented so that your hotspot of code will be executed concurrently. This is all predicated on the belief that “higher” levels in the layers of your algorithms will equal more (independent) work, much like the way that layers of a parfait accumulate mass the higher up in the glass you go. Placing concurrency at the highest possible level around a hotspot is one of the best ways to achieve that all-important coarse-grained division of work to be assigned to threads.
Rule 3: Plan Early for Scalability to Take Advantage of Increasing Numbers of Cores
As I’m writing this, quad-core processors are becoming the default multicore chip. The number of cores available in future processors will only increase. Thus, you should plan for such processor increases within your software. Scalability is the measure of an application’s ability to handle changes, typically increases, in system resources (e.g., number of cores, memory size, bus speed) or data set sizes. In the face of more cores being available, you must write flexible code that can take advantage of different numbers of cores.
To paraphrase C. Northcote Parkinson, “Data expands to fill the processing power available.” This means that as the amount of computational power increases (more cores), the more likely it will be that the data to be processed will expand. There are always more computations to be done. Whether it is increasing the model fidelity in scientific simulations, processing an HD stream instead of standard video, or searching through multiple and larger databases, if you are given additional processing resources, someone will always have more data to process.
Designing and implementing concurrency by data decomposition methods will give you more scalable solutions. Task decomposition solutions will suffer from the fact that the number of independent functions or code segments in an application is likely limited and fixed during execution. After each independent task has a thread and core to execute on, increasing the number of threads to take advantage of more cores will not increase performance of the application. Since data sizes are more likely to increase than the number of independent computations in an application, data decomposition designs will have the best chance for scalability.
Even though an application has been written with threads assigned to independent functions, when the input workload increases, you may still be able to utilize more threads. Consider building a grocery store where there are a finite number of separate tasks to be done. If the developer buys adjacent land and the floor space of the store to be built is doubled, you can expect extra workers to be assigned within some of those tasks. That is, extra painters, extra roofers, extra floor tilers, and extra electricians can be used. Therefore, you should be aware of the data decomposition possibilities that can arise from increased data sets, even within solutions that have been decomposed by tasks, and plan for the use of extra threads on extra cores.
Rule 4: Make Use of Thread-Safe Libraries Wherever Possible
If your hotspot computations can be executed through a library call, you should strongly consider using an equivalent library function instead of executing handwritten code. Even for serial applications, it’s never a good idea to “reinvent the wheel” by writing code that performs calculations already encapsulated by optimized library routines. Many libraries, such as the Intel Math Kernel Library (MKL) and Intel Integrated Performance Primitives (IPP), have functions that are threaded to take advantage of multicore processors.
Even more important than using threaded library routines, though, is ensuring that all library calls used are thread-safe. If you have replaced the hotspot in your serial code with a call to a library function, it may still be the case that some point higher in the call tree of your application can be divided into independent computations. When you have concurrent computations executing library function calls, especially third-party libraries, routines that reference and update shared variables within the library may cause data races. Check the library documentation for the thread-safety of any library you are using within concurrent execution. When writing and using your own library routines that will be executed concurrently, be sure the routines are reentrant. If this is not possible, you will need to add synchronization in order to protect access to shared resources.
Rule 5: Use the Right Threading Model
If threaded libraries are insufficient to cover all the concurrency of an application and you must employ user-controlled threads, don’t use explicit threads if an implicit threading model (e.g., OpenMP or Intel Threading Building Blocks) has all the functionality you need. Explicit threads do allow for finer control of the threading implementation. However, if you are only parallelizing compute-intensive loops or don’t need the extra flexibility you can get with explicit threads, there’s probably no reason to do more work than necessary. The more complex the implementation, the easier it will be to make a mistake and the harder it will be to maintain such code later.
OpenMP is focused on data decomposition methods, especially targeted to threading loops that range over large data sets. Even if this is the only type of parallelism that you can introduce into an application, there may be external requirements (such as engineering practices dictated by your employer or management preferences) that will prohibit your use of OpenMP. In that case, you will need to implement your threading with an approved (explicit) model. In such a situation, I recommend that you use OpenMP to prototype the planned concurrency and estimate the potential performance gains, possible scalability, and how much effort will be needed to thread the serial code with explicit threads.
Rule 6: Never Assume a Particular Order of Execution
With serial computations, it is easy to predict the statement that will be executed following any other statement in a program. On the other hand, execution order of threads is nondeterministic and controlled by the OS scheduler. This means that there is no reliable way of predicting the order of threads running from one execution to another, or even which thread will be scheduled to run next. This is done primarily to hide execution latency within an application, especially when run on a system with fewer cores than threads. If a thread blocks because it needs memory that is not located in cache or to process an I/O request, the scheduler will swap out the blocked thread and swap in a thread that is ready to run.
Data races are a direct result of this scheduling nondeterminism. If you assume that one thread will write a value into a shared variable before another thread will read that value, you may be right all of the time, you may be right some of the time, or you may be right none of the time. Sometimes, if you’re lucky, the order of thread execution remains unchanged on a specific platform each and every time you run an application. Every difference between systems (bit locations on the disk or memory speed or frequency of the AC power coming out of the wall sockets) has the potential to alter the thread schedule. Code that relies on a particular order of execution among threads that is enforced through nothing more than positive thinking may be plagued by problems such as data races and deadlock.
From a performance perspective, it is best to allow threads to run as unencumbered as possible, like greyhounds or thoroughbreds in a race. Don’t try to enforce a particular order of execution unless it is absolutely necessary. You need to recognize those times when it is absolutely necessary, and implement some form of synchronization to coordinate the execution order of threads relative to each other.
Consider a relay race team. The first runner starts off running as fast as possible. However, to successfully complete the race, the second, third, and anchor runners must wait to receive the baton before they can begin to run their assigned portions of the race. The baton passing is a synchronization between consecutive runners that controls the order of “execution” between stages in the race.
Rule 7: Use Thread-Local Storage Whenever Possible or Associate Locks to Specific Data
Synchronization is overhead that does not contribute to the furtherance of the computation, except to guarantee the correct answers are produced from the parallel execution of an application. Synchronization is a necessary evil. Even so, you should actively seek to keep the amount of synchronization to a minimum. You can do this by using storage that is local to threads or using exclusive memory locations (such as an array element indexed by thread ID).
Temporary work variables are rarely shared between threads, and should be declared or allocated locally to each thread. Variables that hold partial results for each thread should also be local to threads. Combining the partial results into a shared location will require some synchronization. Ensuring that the shared updates are done as infrequently as possible will keep the amount of overhead to a minimum. If you are using explicit threads, you can use the available thread-local storage APIs to enable the persistence of data local to threads from one threaded region to another or from one threaded function call to the next execution of the same function.
If local storage for each thread is not a valid option and you must coordinate access to shared resources through synchronization objects (such as a lock), be sure to properly associate (or “attach”) locks to data items. The easiest way to do this is to have a one to one (1:1) relationship of locks to data items. If you have multiple shared variables that are always accessed together, use a single lock to allow exclusive access to all critical regions involving these variables. In later chapters, I’ll discuss some of the tradeoffs and alternative synchronization techniques that you can employ, especially if you have to protect access to a large collection of data (for example, an array of 10,000 items).
However you decide to associate locks with data items, never
associate more than one lock to a single data object. Segal’s Law states,
“A man with a watch knows what time it is. A man with two watches is never
sure.” If two different lock objects—say,
lockB—protect access to the same variable, one
part of the code may use
access while another section of code can use
lockB. Threads executing in these two code
portions will create a data race, since each will assume it has exclusive
access to the contested data.
Rule 8: Dare to Change the Algorithm for a Better Chance of Concurrency
For comparing performance of applications, both serial and concurrent, the bottom line is wall clock execution time. When choosing between two or more algorithms, programmers may rely on the asymptotic order of execution. This metric will almost always correlate with an application’s relative performance to another. That is, with everything else held constant, an application that uses an O(n log n) algorithm (like Quicksort) will run faster than an O(n2) algorithm (such as selection sort) and will generate the same results.
In concurrent applications, algorithms with a better asymptotic order of execution will run faster, too. Nonetheless, there will be times when the best serial algorithm will not be amenable to parallelization. If you cannot easily turn a hotspot into threaded code (and you can’t find a point higher in the call stack of the hotspot that can be made concurrent), you should consider using a suboptimal serial algorithm to transform, rather than the algorithm currently in the code.
For example, consider the linear algebra operation for the multiplication of two square matrixes. Strassen’s Algorithm has one of the best asymptotic orders of execution, O(n2.81). This is better than the O(n3) of the traditional triple-nested loop algorithm. Strassen’s method divides up each of the matrixes into four chunks (or submatrixes) and uses seven recursive calls to multiply the n/2 × n/2 submatrixes. To parallelize these recursive calls, you could create a new thread to execute each of the seven independent submatrix multiplications. The number of threads will increase exponentially (much like the wives, sacks, cats, and kittens coming from St. Ives). As the submatrixes get smaller and smaller, the granularity of the assigned work given to a newly created thread will get finer and finer. When the submatrixes achieve a given size, switch to a serial algorithm.
A much easier means to parallelize matrix multiplication is to use the asymptotically inferior triple-nested loop algorithm. There are several ways to perform a data decomposition on the matrixes (divide by rows, divide by columns, or divide by blocks) and assign the necessary computations to threads. You can do this using OpenMP pragmas at one of the loop levels or by using explicit threads that implement the division of the loop indexes as needed. Less code modification is required for the simpler serial algorithm, and the structure of the code would likely be left more intact than it would be if you attempted to thread Strassen’s Algorithm. Better yet, follow Simple Rule 4 and use a concurrent library function that performs the matrix-matrix multiplication.
I’ve given you eight simple rules that you should keep in mind when designing the threading that will transform a serial application into a concurrent version. By following the rules presented here, I’ve been able to more easily create concurrent solutions that are more robust, less likely to contain threading problems, and that move toward optimal performance with less development time. I’m sure you will, too.