Parallel reduction with warp primitives

Let's see how this can benefit our parallel reduction implementation. This recipe will use the shfl_down() function in Cooperative Groups, and shfl_down_sync() in warp primitive functions. The following figure shows how shift down operation works with shfl_down_sync():

In this collective operation, CUDA threads in a warp can shift a specified register value to another thread in the same warp and synchronize with it. To be specific, the collective operation has two steps (the third one is optional):

  1. Identifying, masking, or ballot sourcing CUDA threads in a warp that will have an operation.
  2. Letting CUDA ...

Get Learn CUDA Programming now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.