
120 Designing Scientific Applications on GPUs
can be more complex to maintain, but such extra development cost is justified
if we are looking for better performance.
7.3 General scheme of asynchronous parallel code with
computation/communication overlapping
In the previous section, we have seen how to efficiently implement overlap
of computations (CPU and GPU) with communications (GPU transfers and
internode communications). However, we have previously shown that for some
parallel iterative algorithms, it is sometimes even more efficient to use an
asynchronous scheme of iterations [3, 4, 11]. In that case, the nodes do not
wait for each other but they perform ...