We will now use Nsight to step through some code to help us better understand some of the CUDA GPU architecture, and how branching within a kernel is handled. This will give us some insight about how to write more efficient CUDA kernels. By branching, we mean how the GPU handles control flow statements such as if, else, or switch within a CUDA kernel. In particular, we are interested in how branch divergence is handled within a kernel, which is what happens when one thread in a kernel satisfies the conditions to be an if statement, while another doesn't and is an else statement: they are divergent because they are executing different pieces of code.
Let's write a small CUDA-C ...