Parallel reduction with warp primitives

Let's see how this can benefit our parallel reduction implementation. This recipe will use the shfl_down() function in Cooperative Groups, and shfl_down_sync() in warp primitive functions. The following figure shows how shift down operation works with shfl_down_sync():

In this collective operation, CUDA threads in a warp can shift a specified register value to another thread in the same warp and synchronize with it. To be specific, the collective operation has two steps (the third one is optional):

  1. Identifying, masking, or ballot sourcing CUDA threads in a warp that will have an operation.
  2. Letting CUDA ...

