
110 Designing Scientific Applications on GPUs
2 - Sync. CPU GPU
data transfer
2 - Async. GPU parallel
computation
1 - Synchronous
internode
CPU comms
2 - Sync. GPU CPU
data transfer
.......
waiting
MPI
comms
GPU
thread
exec.
CPU GPU
Multithreaded CPU program
CPU main thread
next instructions
1 - thread
comm
2 - thread
comput
CPU thread creation
CPU thread
synchonization barrier
FIGURE 7.2. Overlap of internode CPU communications with a sequence of
CPU/GPU data transfers and GPU computations.
insures an implicit synchronization of all operations involving the same
GPU stream, like the default stream in this example. The transfer of the
results has to wait un