November 2016
Intermediate to advanced
576 pages
18h 22m
English
In this chapter, we reviewed the major aspects of application performance on a CUDA device: global memory access coalescing, memory parallelism, control flow divergence, dynamic resource partitioning and instruction mixes. Each of these aspects is rooted in the hardware limitations of the devices. Based on these concepts, we introduce techniques for analyzing the code for memory coalescing, channel/bank utilization, and control divergence. More importantly, we introduce techniques for converting poor performing code into well performing code: corner-turning, active thread index consolidation, and thread granularity coarsening.
Compute-bound; memory-bound; bottleneck; memory bandwidth; DRAM burst; ...