Development methodologies for GPU and cluster of GPUs 115
tion on line 49. This synchronization is not mandatory, but it will make the
implementation more robust and will facilitate the debugging steps: all GPU
computations run by the OpenMP thread number 1 will be achieved before
this thread enters a new loop iteration, or before the computation loop has
ended.
If a partial result has to be transferred from GPU to CPU memory at
the end of each loop iteration (for example, the result of one reduction per
iteration), this transfer is achieved synchronously on the default stream (no
particular stream is specified) on lines 51–54. Availability of the result val-
ues is ensured by the synchronization implemented on line 49. However, if a
partial result