BUY THIS BOOK
Add to Cart

Print Book $34.99


Add to Cart

PDF $23.99

What is this?

Add to UK Cart

Print Book £21.99

What is this?

Looking to Reprint or License this content?

Intel Threading Building Blocks
Intel Threading Building Blocks Outfitting C++ for Multi-core Processor Parallelism

By James Reinders
Book Price: $34.99 USD
£21.99 GBP
PDF Price: $23.99

Cover | Table of Contents | Colophon


Index


[ A ], 
[ B ], 
[ C ], 
[ D ], 
[ E ], 
[ F ], 
[ G ], 
[ H ], 
[ I ], 
[ J ], 
[ K ], 
[ L ], 
[ M ], 
[ N ], 
[ O ], 
[ P ], 
[ Q ], 
[ R ], 
[ S ], 
[ T ], 
[ U ], 
[ V ], 
[ W ], 


A[ Top ]
A-B-A problem, 125
abstraction, 8, 25-27
accessor, 96
      class, 97
acquire method for locks, 113
Aha! factor in examples, 177
algorithm structures, 26
algorithm templates, 29-64, 65-79, 169
aligned_space template class, 109
allocator arguments, 81
Allocator Concept, 106
Amdahl, Gene, 14, 292
Amdahl's Law, 14-18
      Gustafson's observations and, 16-18
Application Layer Gateway (ALG), 242
assembly languages of parallelism, 4, 136
atomic and constructors, 125
atomic operations, 22, 112, 122-129
      convoying, minimizing, 118
      interleaving, 123
      mutual exclusion, versus, 123
      priority inversion, minimizing, 118
      thread-safe reference counting, 123
atomic template class, 127
auto_partitioner class, 38, 45, 52, 59
automata, 192
automatic grain size, 38

B[ Top ]
bibliography, 292-295
blocked_range, 35
      template class, 54
blocked_range2d, 49
      template class, 56-58
blocking, 133
blocking style, 141
      children, with, 145
body object, 33
Boost Threads, 4
breadth-first execution, 142
breadth-first theft and depth-first work, 144

C[ Top ]
C++, 1, 5
cache, 8, 289
      aligned allocator, 103
      cooling, 290
      lines, 102
calloc, replacing, 104
Chare Kernel, xvi, 283
Charm++, xvi, 284
Cilk, 285
coarse-grained parallelism, 11, 20
compare_and_swap, 124
comparison sort (see parallel_sort)
compatibility, 3
Concepts
      generic programming, in, 287
      modeling a concept, 288
      Sortable, 287
concurrency, 21, 23, 26, 171
concurrent access, 96
concurrent operations, 88
      erase, 98
      find, 98
      insert, 98
concurrent_hash_map, 91-99, 209-213
      template class, 94
concurrent_queue, 81-83
      iterating for debugging, 82
      template class, 83-86
      unbounded, 82
concurrent_vector, 86-90
      template class, 87
const_accessor, 96
const_iterator, 82
const_range_type, 99
containers, 80-100
context switching, 290
continuation passing, 141, 145-153
      children, with, 146
      task, 147
convoying, 118, 290
Conway, John Horton, 190
CopyConstructible Concept, 288
correctness, 8, 23-24
CountStrings, 209-213

D[ Top ]
data parallelism, 9
data-parallel programming, 3, 136
deadlock, 24, 117
debugging, 165, 169
      enabling, 172
      release versus, 172, 174
decomposition, 8, 9-13
delete operators, replacing, 105
design patterns, 25-27
design spaces
      algorithm structures, 26
      finding concurrency, 26
      implementation mechanisms, 27
      supporting structures, 27
destructor, 32
Divide and Conquer, 26
domain decomposition, 266
do-nothing subroutines, 174
Dummy Tasks to the Rescue, 230
dynamic scheduling, 5

E[ Top ]
ECMA CLI parallel profile, 285
embarrassingly parallel, 9, 18
empty tasks, 151, 230
empty_task, 151
      class, 166
erase from hash map, 98
errata, 177
Event-Based Coordination, 26
examples, 177-280
      concurrent_hash_map, 209-213
      download the source code, 177
      filter classes, 247-255
      game threading, 262
      parallel_for, 180-189
      parallel_reduce, 199-205
      pipeline, 233
explicit synchronization, 22
explicit task destruction, 160

F[ Top ]
fair mutex, 115
fair scheduling, 135
fairness and efficiency, 135
false sharing, 102
feeding two from the same task in a pipeline, 233-236
fences, 126
fetch-and-add, 123
fetch-and-store, 123
Fibonacci numbers, 137-140
filter class, 70, 72, 77
      examples, 247-255
final_scan_tag class, 63
find in hash map, 98
finding concurrency, 26
fine-grained locking, 80, 111
fine-grained parallelism, 12, 20
FJTask, 284
for (see parallel_for)
Fortran, 5, 21, 286
forwarding class, 255
frame loops, 265
free, replacing, 104
function object, 43

G[ Top ]
Game of Life, 190-199
      automata, 192
            implementation, 193-198
      cells, live or dead, 191
      generation, 192
      implementation, 192
Gardner, Martin, 190
gateway class, 254
Gaussian elimination, 223
generic programming, 1, 4, 286
      Concepts, 287
      models, 288
      pseudosignatures, 287
geometric decomposition, 26
get_next_packet class, 251
grain size, 13, 30, 36-39, 179
      automatic, 38
      ParallelSum and, 200
guided scheduling, 5
Gustafson, John, 16
Gustafson's observations, 16-18

H[ Top ]
half-open intervals, 35
happy parallelism, 9
hash map (see concurrent_hash_map)
hash table, 91
HashCompare, 93
high coding overhead, 136
High Performance Computing (HPC), 285
highly concurrent containers, 80-100
history and related projects, 283-291
hot in cache data, 290
hotspots, 276

I[ Top ]
implementation mechanisms, 27
implicit synchronization, 236
initializing and terminating the library, 30-32
initializing the library, 137
insert in hash map, 98
instruction pointer, 20
Integrated Performance Primitives (IPP) library, 3
Intel Thread Checker, 169, 173
Intel Thread Profiler, 169
intuition, 8
intuitively obvious locking, 111
IP addresses, local and worldwide, 239
iteration spaces, 45, 66, 286

J[ Top ]
Java Specification Request #166 (JSR-166), 284
join method, 41
      reverse_join, 51
      split-join sequence, 42

K[ Top ]
keys to success, 169-176

L[ Top ]
lambda functions, 291
languages, Cilk, 285
latency, 75
lazy copying, 152
Lea, Doug, 284
libraries, 283
      Chare Kernel, 283
      ECMA CLI parallel profile, 285
      Java Specification Request #166 (JSR-166), 284
      McRT-Malloc, 285
      Standard Template Adaptive Parallel Library (STAPL), 285
      Standard Template Library (STL), 284
linear or order of n scaling (O(n)), 16
linear pipelines, 75
linked lists, 66
      dynamic arrays versus, 66
load balancing, 136, 268
lock preemption, 290
lock-free algorithms, 80
locks, 22, 80, 110, 112, 171, 178
      convoying, 118
      deadlock, 117
      fine-grained locking, 111
      intuitively obvious locking, 111
      naming, 114
      pathologies, 117
      priority inversion, 118
      reader locks, 116
      upgrading and downgrading, 116
      writer locks, 116
logical threads, 134
loop parallelization, 29, 32-52
      parallel_for, 33-40
      parallel_reduce, 40-45
      parallel_scan, 49
      rule of thumb, 38
loop templates, 133

M[ Top ]
malloc, 101
      replacing, 104-105
Math Kernel Library (MKL), 3
matrix multiply, 183, 223-224
Mattson, Tim, 26, 293
McRT-Malloc, 285
memory, weak memory consistency, 126
memory allocation, 101-109, 257-260
      cache lines, 102
      false sharing, 102
      replacing malloc, calloc, realloc, and free, 260
      replacing new and delete, 257-260
memory allocators, 103
memory consistency, 126-127
merge sort, 289
Message Passing Interface (MPI), 4
molecular dynamics (MD), 269
multi-core processors, xix, 7
multitasking, 19
mutex, 80, 112-122, 133
      class, 119
      Concept, 119
      fair, 115
      flavors, 114
      non-reentrant, 115
      reentrant, 115
      scalable, 115
      scoped locking pattern, 119
      sleeping, 115
      spinning in user space, 115
      typedef and, 113
      unfair, 115
mutual exclusion, 21, 110-129
      avoiding, 172

N[ Top ]
NAMD application, 269
naming conventions of variables and functions, 176
nested parallelism, 2, 134
network address translation (NAT), 238, 241
network interface controller (NIC), 242
new operators, replacing, 105, 257-261
nonblocking algorithms, 124
nondeterminism, 24
nonlinear pipelines, 75, 233-236

O[ Top ]
object-oriented programming, 25
objects that share internal state, 111
Open Dynamics Engine (ODE), 275-282
OpenMP, 3, 5, 174, 286
      Taskqueue, 5, 286
operator( ), 50
output_packet class, 251
overlapping window strategy, 70
oversubscription, 134

P[ Top ]
packet forwarding, 242
packet processing pipeline, 237-255
parallel algorithms for streams, 65-79
parallel iteration (see iteration spaces)
parallel prefix, 49
parallel range, splitting, 177
parallel_for, 29, 33-40, 178
      template function, 61
parallel_reduce, 40-45
      parallel_for versus, 41
      ParallelPrime, 200
      ParallelSum, 199
      partitioner with, 45
      sequential execution, 42
      template function, 62
parallel_scan, 49-52
      floating point addition and, 51
      partitioner with, 51
      template function, 63
parallel_sort, 78
      comparison sort, 79
      template function, 79
parallel_while, 66-69, 83
      scalability, 68
      template class, 68
ParallelAverage, 180
parallelism, xix, 1, 7-28, 199
      achieving, 12
      assembly languages of, 4, 136
      coarse-grained, 11
      data, 9
      embarrassingly parallel, 9
      fine-grained, 12
      glass half empty versus glass half full view, 16
      happy parallelism, 9
      hybrid, 12
      intuition about, 27
      long lines, and, 8
      loop, 32-52
      lots of repetitive work, and, 9
      nested, 2, 134
      pipelining, 10
      recursive, 140
      task, 10
ParallelMerge, 184, 289
ParallelPrime, 200
ParallelSum, 199
      grain size and, 200
Partitioner Concept, 58
partitioners, 38, 45, 51
patterns, 8, 25-27
Patterns for Parallel Programming, 25
performance, 3, 271
physical threads, 134
physics interaction and update, 271
physics threads, 269
pipeline, 66, 83, 233
      class, 70, 76
      components for the local network router, 241-247
pipelines, 26, 70, 74, 237-255
pipelining, 10-13, 69-78
      latency, 75
      nonlinear, 75
      throughput, 74
      tokens, 71
pop_if_present for parallel_while, 66-69
pop_if_present from queue, 82, 84-85
port mapping, 248
portability, 137
POSIX threads (pthreads), 4, 136
pre_scan_tag class, 63
preemptive scheduler, 19
prime (see ParallelPrime)
priority inversion, 118
pseudosignatures, 287
      real signatures versus, 288
      valid expressions versus, 288
push on a queue, 82

Q[ Top ]
queue (see concurrent_queue)
queues, when not to use, 83
queuing, 116
queuing_mutex, 116
      class, 120
queuing_rw_mutex, 116
      class, 122
quick sort, 215-219, 289

R[ Top ]
race conditions, 22, 24, 169
Range Concept, 53
range_type, 99
raw threads, 134
reader locks, 116
ReaderWriterMutex Concept, 120
ready pool, 143, 153
realloc, replacing, 104
recursive, 30
      blocked, 48
      chain reaction, 147
      functions, 178, 218
      parallel ranges (pRanges), 285
      parallelism, 140
      range specifications, 52-64
      ranges, 286
recursive ranges
      splitting, 6, 178
reduce (see parallel_reduce)
reduction operation, 40
reentrant mutex, 115
relaxed sequential execution, 169
      model, 170
release, 113
release libraries, 174
reverse_join, 51
Robison, Arch, 283

S[ Top ]
scalability, 2, 7, 13-19, 30, 136, 264
      weak scaling, 18
scalable
      allocators, 81, 103
      memory allocation, 101-109
      memory allocator, 169
      mutex, 115
      template class, 107
scaling, 8, 13
scan (see parallel_scan)
scheduler bypass, 149
scoped lock, 114
scoped locking pattern, 119
seismic wave simulation (wave propagation), 181
sequential execution, 170
sequential overhead, 18
serial parts of a parallel system, 16
serial versus parallel algorithms, 18
SerialMatrixMultiply, 183
shared memory, 178, 233
Sieve of Eratosthenes, 201
simple_partitioner class, 59
sort (see parallel_sort; quick sort)
spawn_and_wait_for_all, 147
speedup, 13-19, 69
spin_mutex, 113, 115
      class, 120
spin_rw_mutex, 116
      class, 122
split class, 53
split/join patterns, 141
      blocking style, 141
      continuation passing, 141
Splittable Concept, 52
Splittable Ranges, 52
splitting, 29
splitting constructor, 41, 45, 52
stack, 20
Standard Template Adaptive Parallel Library (STAPL), 285
Standard Template Library (STL), xix, 4, 21, 33, 80, 284
static scheduling, 5
Stepanov, Alexander, 286
Strassen multiply algorithm, 223
StrassenMultiply, 223-224
Stroustrup, Bjarne, 286
substring matching, 186
SubstringFinder, 186-189
supporting structures, 27
synchronization, 13, 20, 24, 163-164, 170, 178, 236
      barrier, 265
      explicit, 22

T[ Top ]
task class, 155
      allocation, 158
      context, 165
      debugging, 165
      depth, 162
      derivation, 158
      destruction, 161
      recycling, 161
      synchronization, 163
task graph, 142
task parallelism, 10, 26
task recurrence patterns, 145-147
task scheduler, 31, 133-168, 179, 218
      blocking style with children, 145
      breadth-first execution, 142
      breadth-first theft and depth-first work, 144
      continuation passing, 147
      continuation-passing style with children, 145
      empty tasks, 151
      explicit task destruction, 160
      how it works, 142
      interface, 278
      interfaces, 153-164
      lazy copying, 152
      load balancing and, 136
      making best use of the scheduler, 147-153
      minimize space, 142
      overview, 140
      ready pool, 143, 153
      recursive chain reaction, 147
      recycling, 150
            parent as a child, 146
            parent as the continuation, 145
      scheduler bypass, 149
      strike when the cache is hot, 142
      tree-structured task graphs, 147
task stealing, 6, 30, 218, 266, 268
task_list class, 167
task_scheduler_init, 31
      class, 153
tasks versus threads, 2
_ _TBB prefix, 176
tbb namespace, 30
tbb::internal namespace, 176
tbb::split, 46
TBB_DO_ASSERT macro, 172
TBB_DO_THREADING_TOOLS macro, 173
templates, 2
terminating the library, 30-32
Think Parallel, 7, 169
thread management, 21
Threading Building Blocks
      benefits, 2
      download, 1
      overview, 1-6
threading packages, other, 174
threads, 8, 19-21
      architecture, 263
            physics interactions, 263
            rendering, 263
      game example, 262
      instruction pointer, 20
      Intel Thread Profiler, 169
      logical, 134
      physical, 134
      physics, 269
      POSIX, 136
      processes versus, 20
      programming for parallelism, 20
      raw native, 134
      ready pool, 143
      stack, 20
      testing, 169
      Windows, 136
thread-safe, 21
      code, 110
      libraries, 21
tick_count, 130-132
      class, 131
time slicing, 290
      cache cooling and, 290
      context switching and, 290
      convoying and, 290
      hot in cache data, 290
      lock preemption and, 290
timing, 130-132
tree-structured task graphs, 147

U[ Top ]
undersubscription, 134
unfair mutex, 115
unstable sort, 78
using directive, 31

V[ Top ]
variable swap, 170
variables and functions, naming conventions, 176
vector (see concurrent_vector)
VTune Performance Analyzer, 276

W[ Top ]
weak memory consistency, 126
weak scaling, 18
while (see parallel_while)
whole vector operations, 88
whole-table operations, 95
Windows threads, 136
work stealing, 5, 285
worldwide and local IP addresses, 239
writer locks, 116


Return to Intel Threading Building Blocks