
360 Designing Scientific Applications on GPUs
15.3 Single GPU implementation
In this section we describe the steps taken to enable Ludwig for the GPU.
There are a number of crucial issues: first, the minimization of data traffic
between host and device; second, the optimal mapping of available parallelism
onto the architecture; and third, the issue of memory coalescing. We discuss
each of these in turn.
While the most important section of the LB in terms of floating-point
performance is the collision stage, this cannot be the only consideration for
a GPU implementation. It is essential to offload all computational activity
which involves the main data structures ...