
240 Designing Scientific Applications on GPUs
where C
ins
denotes the number of cycles to execute instruction ins ∈
{add, div, mult, cmp}.
Each thread has to load N
el
variables to compute its partial sum of squared
variables. The thread computing the division also loads the coefficient c
j
. This
must be done for the N
col
columns with which a block has to deal. We must
also take into account that the scheduler hides some latency by swapping the
warps, so the total latency C
latency
must be divided by the number of warps
N
W
. Thus, the number of cycles relative to memory accesses is given by
C
Accesses
=
N
col
· (N
el
+ 1) · C
latency
N
W
(10.6)
At the end of the execution