
6.4. INTRODUCTION TO NVIDIA GPUS AND CUDA 167
case. Moreover, as noted above, the long latency of global memory may
be solvable by having a lot of threads that the hardware can timeshare to
hide that latency; while one warp is fetching data from memory, another
warp can be executing, thus not losing time due to the long fetch delay.
For these reasons, CUDA programmers typically employ a large number of
threads, each of which does only a small amount of work—again, quite a
contrast to something like OpenMP.
6.4.2.5 Grid Configuration Choices
In choosing the number of blocks and the number of threads per block, one
typically knows the number of threads