May 2022
Intermediate to advanced
580 pages
18h 32m
English
This chapter introduces the on-chip memory architecture of GPUs, the concept of memory-bound applications, and techniques for improving the performance of memory-bound applications. The chapter uses matrix multiplication to illustrate opportunities for reducing the number of global memory accesses. It then introduces the tiling technique by which barrier synchronization is used to coordinate the timing of executing threads for improved locality and reduced global memory accesses. However, the tiling techniques involve additional complexities in boundary checks. The chapter uses matrix multiplication to illustrate the additional boundary checks that are needed for a tiled kernel to be applicable ...
Read now
Unlock full access