Coalesced versus uncoalesced global memory access

To effectively use global memory, it is important to understand the concept of warp in the CUDA programming model, which we have ignored so far. The warp is a unit of thread scheduling/execution in SMs. Once a block has been assigned to an SM, it is divided into a 32 -thread unit known as a warp. This is the basic execution unit in CUDA programming.

To demonstrate the concept of a warp, let's look at an example. If two blocks get assigned to an SM and each block has 128 threads, then the number of warps within a block is 128/32 = 4 warps and the total number of warps on the SM is 4 * 2 = 8 warps. The following diagram shows how a CUDA block gets divided and scheduled on a GPU SM:

How the ...

Get Learn CUDA Programming now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.