By James Reinders
Book Price: $34.99 USD
£21.99 GBP
PDF Price: $27.99
Cover | Table of Contents | Colophon
parallel_for loop is tedious with existing threading packages. Writing an efficient scalable program is much harder. Scalability embodies the concept that a program should see benefits in performance as the number of processor cores increases.parallel_for loop is tedious with existing threading packages. Writing an efficient scalable program is much harder. Scalability embodies the concept that a program should see benefits in performance as the number of processor cores increases.
X = 44. Thread A executes X = X + 10. Thread B executes X = X - 12. If we add locking () so that only Thread A or Thread B can execute its statement at a time, we end up with X = 42. If both threads try to obtain a lock at the same time, one will be excluded and will have to wait before the lock is granted. shows how long Thread B might have to wait if it requested the lock at the same time as Thread A but did not get the lock because Thread A had it first.
X = 32 or X = 54. X = 42 can still occur as well (). Three results are now possible because each statement reads X, does a computation, and writes to X. Without locking, there is no guarantee that a thread reads the value of X before or after the other thread writes a value.
(A+B+C+D)as ((A+B)+(C+D))enables A+B and C+D to be computed in parallel, but the final sum may be slightly different from other evaluations such as (((A+B)+C)+D). Even the parallel results can differ from run to run, depending on the order of the operations.parallel_for and parallel_reduceparallel_scan
y[i]=y [i-1] op x[i])parallel_for recursively. The recursion is implicit because of Threading Building Blocks' inherent concept of splitting, as embodied in the parallel iterator.parallel_for template with a recursive range instead of using a recursion template, you will understand a great deal about how to apply Threading Building Blocks to your applications.tbb namespace. For brevity's sake, the namespace is explicit in the first mention of a component in this book, but implicit afterward.tbb::task_scheduler_init object. A thread may have more than one of these objects initialized at a time. The task scheduler shuts down when all task_scheduler_init objects terminate. By default, the constructor for task_ scheduler_init does the initialization and the destructor does the termination. Thus, declaring a task_scheduler_init in main(), as in , both starts and shuts down the scheduler.#include "tbb/task_scheduler_init.h"
using namespace tbb;
int main() {
task_scheduler_init init;
...
return 0;
}using directive in the example enables you to use the library identifiers without having to write out the namespace prefix tbb before each identifier. The rest of the examples assume that such a using directive is present.task_scheduler_init objects if your program creates threads itself using another interface.task_scheduler_init takes an optional parameter that specifies the number of desired threads, including the calling thread. The optional parameter can be one of the following:task_scheduler_init::automatic, which is the default when the parameter is not specified. It exists for the sake of the method task_scheduler_ init::initializeparallel_for
and
parallel_reduce
parallel_scan
y[i] = y[i-1] op x[i])Foo to each element of an array, and it is safe to process each element concurrently. shows the sequential code to do this.void SerialApplyFoo( float a[], size_t n ) {
for( size_t i=0; i>n; ++i )
Foo(a[i]);
}size_t, and it goes from 0 to n-1. The template function tbb::parallel_for breaks this iteration space into chunks and runs each chunk on a separate thread.operator() processes a chunk. declares the body object.#include "tbb/blocked_range.h"
class ApplyFoo {
float *const my_a;
public:
void operator()( const blocked_range<size_t>& r ) const {
float *a = my_a;
for( size_t i=r.begin(); i!=r.end(); ++i )
Foo(a[i]);
}
ApplyFoo( float a[] ) :
my_a(a)
{}
};operator().A blocked_range<T> is a template class provided by the library. It describes a one-dimensional iteration space over type T. Class parallel_for works with other kinds of iteration spaces, too. The library provides blocked_range2d for two-dimensional spaces. A little later in this chapter, in the section "Advanced Topic: Other Kinds of Iteration Spaces," I will explain how you can define your own spaces.ApplyFoo needs member fields that remember all the local variables that were defined outside the original loop but were used inside it. Usually, the constructor for the body object will initialize these fields, though |
Pseudosignature
|
Semantics
|
|---|---|
X::X(X& x, split)
|
Split
x into two parts, one reassigned to x and the other to the newly constructed object. |
split, which is defined by the library. The dummy argument distinguishes the splitting constructor from a copy constructor. After the constructor runs, x and the newly constructed object should represent the two pieces of the original x. The library uses splitting constructors in two contexts:blocked_range and blocked_range2d represent splittable ranges. For each of these, splitting partitions the range into two subranges.parallel_reduce, parallel_scan, simple_partitioner, and auto_partitioner must be splittable. For each of these, splitting results in two bodies that can run concurrently.#include "tbb/tbb_stddef.h" class split;
split is used to distinguish a splitting constructor from a copy constructor. namespace tbb {
class split {
};
}Range type R. |
Pseudosignature
|
Semantics
|
|---|---|
R::R( const R& )
|
Copy constructor
|
R::~R()
|
Destructor
|
bool R::empty() const
|
True if range is empty |
bool R::is_divisible() const
|
True if range can be partitioned into two subranges |
R::R( R& r, split )
|
Split
r into two subranges |
Range can be recursively subdivided into two parts. It is recommended that the division be into nearly equal parts, but it is not required. Splitting as evenly as possible typically yields the best parallelism. Ideally, a range is recursively splittable until the parts represent portions of work that are more efficient to execute serially rather than split furtherparallel_while
pipeline
parallel_sort
O(n log n) on a single processor and approaching O(N) as more processors are used. When worker threads are available, parallel_sort creates subtasks that may be executed concurrently.parallel_while
pipeline
parallel_while
pipeline
tbb::parallel_while.parallel_while in a situation where the serial form is as shown in .void SerialApplyFooToList( Item*root ) {
for( Item* ptr=root; ptr!=NULL; ptr=ptr->next )
Foo(pointer->data);
}Foo takes at least a few thousand instructions to run, you can get parallel speedup by converting the loop to use parallel_while. Unlike the templates described earlier, parallel_while is a class, not a function, and it requires two user-defined objects. The first object defines the stream of items. The object must have a method, pop_if_ present, such that when bool b =pop_if_present(v) is invoked, it sets v to the next iteration value if there is one and returns true. If there are no more iterations, it returns concurrent_queue<T> implements a concurrent queue with values of type T. Multiple threads may simultaneously push and pop elements from the queue.concurrent_queue is that if a thread pushes multiple values, and another thread pops those same values, they will be popped in the same order that they were pushed.push method. There are blocking and nonblocking flavors of pop:pop_if_present
pop
pop(item) and not while(!pop_if_present(item)) continue; because pop uses processor resources more efficiently than the loop.concurrent_queue::size_type is a signed integral type, not unsigned. This is because concurrent_queue::size() is defined as the number of push operations started minus the number of pop operations started. If pops out-number pushes, size() becomes negative. For example, if a concurrent_queue is empty and there are n pending pop operations, size() returns –n. This provides an easy way for producers to know how many consumers are waiting on the queue. In particular, consumers may find the empty() method useful; it is defined to be true if and only if size() is not positive.concurrent_queue<T> is unbounded. It may hold any number of values until memory runs out. It can be bounded by setting the queue capacity with the set_capacity method. Setting the capacity causes push to block until there is room in the queue. Bounded queues are slower than unbounded queues, so if there is a constraint elsewhere in your program that prevents the queue from becoming too large, it is better not to set the capacity.concurrent_vector<T> is a dynamically growable array of items of type T for which it is safe to simultaneously access elements in the vector while growing it. However, be careful not to let another task access an element that is under construction or is otherwise being modified. For safe concurrent growing, concurrent_vector has two methods for resizing that support common uses of dynamic arrays: grow_by and grow_to_at_least. The index of the first element is 0. The method grow_by(n) enables you to safely append n consecutive elements to a vector, and returns the index of the first appended element. Each element is initialized with T(). So for instance, safely appends a C string to a shared vector.void Append( concurrent_vector<char>& vector, const char* string ) {
size_t n = strlen(string)+1;
memcpy( &vector[vector.grow_by(n)], string, n+1 );
}grow_to_at_least(n) grows a vector to size n if it is shorter. Con-current calls to grow_by and grow_to_at_least do not necessarily return in the order that elements are appended to the vector.size() method returns the number of elements in the vector, which may include elements that are still undergoing concurrent construction by grow_by and grow_to_ at_least. Also, it is safe to use iterators while the concurrent_vector is being grown, as long as the iterators never go past the current value of end(). However, the iterator may reference an element undergoing concurrent construction. You must synchronize construction and access of an element.concurrent_vector<T> never moves an element until the array is cleared, which can be an advantage over the STL std::vector (which can move elements to resize the vector), even for single-threaded code. However, concurrent_vector does have more overhead than std::vector. Use concurrent_vector only if you really need to dynamically resize it while other accesses are (or might be) in flight, or if you require that an element never move.concurrent_vector are concurrency-safe with respect to growing, but are not safe for clearing or destroying a vector. Never invoke