This chapter introduces key concepts of the data parallel execution model in CUDA. It first gives an overview of the multidimensional organization of CUDA threads, blocks, and grids. It then elaborates on the use of thread indexes and block indexes to map threads to different parts of the data, which is illustrated with a 2D image blur example. It then introduces barrier synchronization as a mechanism to coordinate the execution of threads within a block. This is followed by an introduction to the concept of resource queries. The chapter ends with an introduction to the concept of transparent scaling, thread scheduling, and latency tolerance.
Execution configuration parameters; ...