In this chapter, we reviewed the major aspects of application performance on a CUDA device: global memory access coalescing, memory parallelism, control flow divergence, dynamic resource partitioning and instruction mixes. Each of these aspects is rooted in the hardware limitations of the devices. Based on these concepts, we introduce techniques for analyzing the code for memory coalescing, channel/bank utilization, and control divergence. More importantly, we introduce techniques for converting poor performing code into well performing code: corner-turning, active thread index consolidation, and thread granularity coarsening.
Compute-bound; memory-bound; bottleneck; memory bandwidth; DRAM burst; ...