By James Reinders
Book Price: $34.99 USD
£21.99 GBP
PDF Price: $23.99
Cover | Table of Contents | Colophon
[ A ], [ B ], [ C ], [ D ], [ E ], [ F ], [ G ], [ H ], [ I ], [ J ], [ K ], [ L ], [ M ], [ N ], [ O ], [ P ], [ Q ], [ R ], [ S ], [ T ], [ U ], [ V ], [ W ],
A[ Top ]
A-B-A problem, 125
abstraction, 8, 25-27
accessor, 96
class, 97
acquire method for locks, 113
Aha! factor in examples, 177
algorithm structures, 26
algorithm templates, 29-64, 65-79, 169
aligned_space template class, 109
allocator arguments, 81
Allocator Concept, 106
Amdahl, Gene, 14, 292
Amdahl's Law, 14-18
Gustafson's observations and, 16-18
Application Layer Gateway (ALG), 242
assembly languages of parallelism, 4, 136
atomic and constructors, 125
atomic operations, 22, 112, 122-129
convoying, minimizing, 118
interleaving, 123
mutual exclusion, versus, 123
priority inversion, minimizing, 118
thread-safe reference counting, 123
atomic template class, 127
auto_partitioner class, 38, 45, 52, 59
automata, 192
automatic grain size, 38
B[ Top ]
bibliography, 292-295
blocked_range, 35
template class, 54
blocked_range2d, 49
template class, 56-58
blocking, 133
blocking style, 141
children, with, 145
body object, 33
Boost Threads, 4
breadth-first execution, 142
breadth-first theft and depth-first work, 144
C[ Top ]
C++, 1, 5
cache, 8, 289
aligned allocator, 103
cooling, 290
lines, 102
calloc, replacing, 104
Chare Kernel, xvi, 283
Charm++, xvi, 284
Cilk, 285
coarse-grained parallelism, 11, 20
compare_and_swap, 124
comparison sort (see parallel_sort)
compatibility, 3
Concepts
generic programming, in, 287
modeling a concept, 288
Sortable, 287
concurrency, 21, 23, 26, 171
concurrent access, 96
concurrent operations, 88
erase, 98
find, 98
insert, 98
concurrent_hash_map, 91-99, 209-213
template class, 94
concurrent_queue, 81-83
iterating for debugging, 82
template class, 83-86
unbounded, 82
concurrent_vector, 86-90
template class, 87
const_accessor, 96
const_iterator, 82
const_range_type, 99
containers, 80-100
context switching, 290
continuation passing, 141, 145-153
children, with, 146
task, 147
convoying, 118, 290
Conway, John Horton, 190
CopyConstructible Concept, 288
correctness, 8, 23-24
CountStrings, 209-213
D[ Top ]
data parallelism, 9
data-parallel programming, 3, 136
deadlock, 24, 117
debugging, 165, 169
enabling, 172
release versus, 172, 174
decomposition, 8, 9-13
delete operators, replacing, 105
design patterns, 25-27
design spaces
algorithm structures, 26
finding concurrency, 26
implementation mechanisms, 27
supporting structures, 27
destructor, 32
Divide and Conquer, 26
domain decomposition, 266
do-nothing subroutines, 174
Dummy Tasks to the Rescue, 230
dynamic scheduling, 5
E[ Top ]
ECMA CLI parallel profile, 285
embarrassingly parallel, 9, 18
empty tasks, 151, 230
empty_task, 151
class, 166
erase from hash map, 98
errata, 177
Event-Based Coordination, 26
examples, 177-280
concurrent_hash_map, 209-213
download the source code, 177
filter classes, 247-255
game threading, 262
parallel_for, 180-189
parallel_reduce, 199-205
pipeline, 233
explicit synchronization, 22
explicit task destruction, 160
F[ Top ]
fair mutex, 115
fair scheduling, 135
fairness and efficiency, 135
false sharing, 102
feeding two from the same task in a pipeline, 233-236
fences, 126
fetch-and-add, 123
fetch-and-store, 123
Fibonacci numbers, 137-140
filter class, 70, 72, 77
examples, 247-255
final_scan_tag class, 63
find in hash map, 98
finding concurrency, 26
fine-grained locking, 80, 111
fine-grained parallelism, 12, 20
FJTask, 284
for (see parallel_for)
Fortran, 5, 21, 286
forwarding class, 255
frame loops, 265
free, replacing, 104
function object, 43
G[ Top ]
Game of Life, 190-199
automata, 192
implementation, 193-198
cells, live or dead, 191
generation, 192
implementation, 192
Gardner, Martin, 190
gateway class, 254
Gaussian elimination, 223
generic programming, 1, 4, 286
Concepts, 287
models, 288
pseudosignatures, 287
geometric decomposition, 26
get_next_packet class, 251
grain size, 13, 30, 36-39, 179
automatic, 38
ParallelSum and, 200
guided scheduling, 5
Gustafson, John, 16
Gustafson's observations, 16-18
H[ Top ]
half-open intervals, 35
happy parallelism, 9
hash map (see concurrent_hash_map)
hash table, 91
HashCompare, 93
high coding overhead, 136
High Performance Computing (HPC), 285
highly concurrent containers, 80-100
history and related projects, 283-291
hot in cache data, 290
hotspots, 276
I[ Top ]
implementation mechanisms, 27
implicit synchronization, 236
initializing and terminating the library, 30-32
initializing the library, 137
insert in hash map, 98
instruction pointer, 20
Integrated Performance Primitives (IPP) library, 3
Intel Thread Checker, 169, 173
Intel Thread Profiler, 169
intuition, 8
intuitively obvious locking, 111
IP addresses, local and worldwide, 239
iteration spaces, 45, 66, 286
J[ Top ]
Java Specification Request #166 (JSR-166), 284
join method, 41
reverse_join, 51
split-join sequence, 42
K[ Top ]
keys to success, 169-176
L[ Top ]
lambda functions, 291
languages, Cilk, 285
latency, 75
lazy copying, 152
Lea, Doug, 284
libraries, 283
Chare Kernel, 283
ECMA CLI parallel profile, 285
Java Specification Request #166 (JSR-166), 284
McRT-Malloc, 285
Standard Template Adaptive Parallel Library (STAPL), 285
Standard Template Library (STL), 284
linear or order of n scaling (O(n)), 16
linear pipelines, 75
linked lists, 66
dynamic arrays versus, 66
load balancing, 136, 268
lock preemption, 290
lock-free algorithms, 80
locks, 22, 80, 110, 112, 171, 178
convoying, 118
deadlock, 117
fine-grained locking, 111
intuitively obvious locking, 111
naming, 114
pathologies, 117
priority inversion, 118
reader locks, 116
upgrading and downgrading, 116
writer locks, 116
logical threads, 134
loop parallelization, 29, 32-52
parallel_for, 33-40
parallel_reduce, 40-45
parallel_scan, 49
rule of thumb, 38
loop templates, 133
M[ Top ]
malloc, 101
replacing, 104-105
Math Kernel Library (MKL), 3
matrix multiply, 183, 223-224
Mattson, Tim, 26, 293
McRT-Malloc, 285
memory, weak memory consistency, 126
memory allocation, 101-109, 257-260
cache lines, 102
false sharing, 102
replacing malloc, calloc, realloc, and free, 260
replacing new and delete, 257-260
memory allocators, 103
memory consistency, 126-127
merge sort, 289
Message Passing Interface (MPI), 4
molecular dynamics (MD), 269
multi-core processors, xix, 7
multitasking, 19
mutex, 80, 112-122, 133
class, 119
Concept, 119
fair, 115
flavors, 114
non-reentrant, 115
reentrant, 115
scalable, 115
scoped locking pattern, 119
sleeping, 115
spinning in user space, 115
typedef and, 113
unfair, 115
mutual exclusion, 21, 110-129
avoiding, 172
N[ Top ]
NAMD application, 269
naming conventions of variables and functions, 176
nested parallelism, 2, 134
network address translation (NAT), 238, 241
network interface controller (NIC), 242
new operators, replacing, 105, 257-261
nonblocking algorithms, 124
nondeterminism, 24
nonlinear pipelines, 75, 233-236
O[ Top ]
object-oriented programming, 25
objects that share internal state, 111
Open Dynamics Engine (ODE), 275-282
OpenMP, 3, 5, 174, 286
Taskqueue, 5, 286
operator( ), 50
output_packet class, 251
overlapping window strategy, 70
oversubscription, 134
P[ Top ]
packet forwarding, 242
packet processing pipeline, 237-255
parallel algorithms for streams, 65-79
parallel iteration (see iteration spaces)
parallel prefix, 49
parallel range, splitting, 177
parallel_for, 29, 33-40, 178
template function, 61
parallel_reduce, 40-45
parallel_for versus, 41
ParallelPrime, 200
ParallelSum, 199
partitioner with, 45
sequential execution, 42
template function, 62
parallel_scan, 49-52
floating point addition and, 51
partitioner with, 51
template function, 63
parallel_sort, 78
comparison sort, 79
template function, 79
parallel_while, 66-69, 83
scalability, 68
template class, 68
ParallelAverage, 180
parallelism, xix, 1, 7-28, 199
achieving, 12
assembly languages of, 4, 136
coarse-grained, 11
data, 9
embarrassingly parallel, 9
fine-grained, 12
glass half empty versus glass half full view, 16
happy parallelism, 9
hybrid, 12
intuition about, 27
long lines, and, 8
loop, 32-52
lots of repetitive work, and, 9
nested, 2, 134
pipelining, 10
recursive, 140
task, 10
ParallelMerge, 184, 289
ParallelPrime, 200
ParallelSum, 199
grain size and, 200
Partitioner Concept, 58
partitioners, 38, 45, 51
patterns, 8, 25-27
Patterns for Parallel Programming, 25
performance, 3, 271
physical threads, 134
physics interaction and update, 271
physics threads, 269
pipeline, 66, 83, 233
class, 70, 76
components for the local network router, 241-247
pipelines, 26, 70, 74, 237-255
pipelining, 10-13, 69-78
latency, 75
nonlinear, 75
throughput, 74
tokens, 71
pop_if_present for parallel_while, 66-69
pop_if_present from queue, 82, 84-85
port mapping, 248
portability, 137
POSIX threads (pthreads), 4, 136
pre_scan_tag class, 63
preemptive scheduler, 19
prime (see ParallelPrime)
priority inversion, 118
pseudosignatures, 287
real signatures versus, 288
valid expressions versus, 288
push on a queue, 82
Q[ Top ]
queue (see concurrent_queue)
queues, when not to use, 83
queuing, 116
queuing_mutex, 116
class, 120
queuing_rw_mutex, 116
class, 122
quick sort, 215-219, 289
R[ Top ]
race conditions, 22, 24, 169
Range Concept, 53
range_type, 99
raw threads, 134
reader locks, 116
ReaderWriterMutex Concept, 120
ready pool, 143, 153
realloc, replacing, 104
recursive, 30
blocked, 48
chain reaction, 147
functions, 178, 218
parallel ranges (pRanges), 285
parallelism, 140
range specifications, 52-64
ranges, 286
recursive ranges
splitting, 6, 178
reduce (see parallel_reduce)
reduction operation, 40
reentrant mutex, 115
relaxed sequential execution, 169
model, 170
release, 113
release libraries, 174
reverse_join, 51
Robison, Arch, 283
S[ Top ]
scalability, 2, 7, 13-19, 30, 136, 264
weak scaling, 18
scalable
allocators, 81, 103
memory allocation, 101-109
memory allocator, 169
mutex, 115
template class, 107
scaling, 8, 13
scan (see parallel_scan)
scheduler bypass, 149
scoped lock, 114
scoped locking pattern, 119
seismic wave simulation (wave propagation), 181
sequential execution, 170
sequential overhead, 18
serial parts of a parallel system, 16
serial versus parallel algorithms, 18
SerialMatrixMultiply, 183
shared memory, 178, 233
Sieve of Eratosthenes, 201
simple_partitioner class, 59
sort (see parallel_sort; quick sort)
spawn_and_wait_for_all, 147
speedup, 13-19, 69
spin_mutex, 113, 115
class, 120
spin_rw_mutex, 116
class, 122
split class, 53
split/join patterns, 141
blocking style, 141
continuation passing, 141
Splittable Concept, 52
Splittable Ranges, 52
splitting, 29
splitting constructor, 41, 45, 52
stack, 20
Standard Template Adaptive Parallel Library (STAPL), 285
Standard Template Library (STL), xix, 4, 21, 33, 80, 284
static scheduling, 5
Stepanov, Alexander, 286
Strassen multiply algorithm, 223
StrassenMultiply, 223-224
Stroustrup, Bjarne, 286
substring matching, 186
SubstringFinder, 186-189
supporting structures, 27
synchronization, 13, 20, 24, 163-164, 170, 178, 236
barrier, 265
explicit, 22
T[ Top ]
task class, 155
allocation, 158
context, 165
debugging, 165
depth, 162
derivation, 158
destruction, 161
recycling, 161
synchronization, 163
task graph, 142
task parallelism, 10, 26
task recurrence patterns, 145-147
task scheduler, 31, 133-168, 179, 218
blocking style with children, 145
breadth-first execution, 142
breadth-first theft and depth-first work, 144
continuation passing, 147
continuation-passing style with children, 145
empty tasks, 151
explicit task destruction, 160
how it works, 142
interface, 278
interfaces, 153-164
lazy copying, 152
load balancing and, 136
making best use of the scheduler, 147-153
minimize space, 142
overview, 140
ready pool, 143, 153
recursive chain reaction, 147
recycling, 150
parent as a child, 146
parent as the continuation, 145
scheduler bypass, 149
strike when the cache is hot, 142
tree-structured task graphs, 147
task stealing, 6, 30, 218, 266, 268
task_list class, 167
task_scheduler_init, 31
class, 153
tasks versus threads, 2
_ _TBB prefix, 176
tbb namespace, 30
tbb::internal namespace, 176
tbb::split, 46
TBB_DO_ASSERT macro, 172
TBB_DO_THREADING_TOOLS macro, 173
templates, 2
terminating the library, 30-32
Think Parallel, 7, 169
thread management, 21
Threading Building Blocks
benefits, 2
download, 1
overview, 1-6
threading packages, other, 174
threads, 8, 19-21
architecture, 263
physics interactions, 263
rendering, 263
game example, 262
instruction pointer, 20
Intel Thread Profiler, 169
logical, 134
physical, 134
physics, 269
POSIX, 136
processes versus, 20
programming for parallelism, 20
raw native, 134
ready pool, 143
stack, 20
testing, 169
Windows, 136
thread-safe, 21
code, 110
libraries, 21
tick_count, 130-132
class, 131
time slicing, 290
cache cooling and, 290
context switching and, 290
convoying and, 290
hot in cache data, 290
lock preemption and, 290
timing, 130-132
tree-structured task graphs, 147
U[ Top ]
undersubscription, 134
unfair mutex, 115
unstable sort, 78
using directive, 31
V[ Top ]
variable swap, 170
variables and functions, naming conventions, 176
vector (see concurrent_vector)
VTune Performance Analyzer, 276
W[ Top ]
weak memory consistency, 126
weak scaling, 18
while (see parallel_while)
whole vector operations, 88
whole-table operations, 95
Windows threads, 136
work stealing, 5, 285
worldwide and local IP addresses, 239
writer locks, 116
Return to Intel Threading Building Blocks