November 2016
Intermediate to advanced
576 pages
18h 22m
English
This chapter introduces the concepts of memory bound application. It uses matrix multiplication to illustrate opportunities for reducing the number of global memory accesses. It then introduces the tiling technique where barrier synchronization is used to coordinate the timing of executing threads for improved locality and reduced global memory accesses. The tiling techniques, however, involve additional complexities in boundary checks. The chapter uses matrix multiplication to illustrate the additional boundary checks needed for a tiled kernel to be applicable to arbitrary matrix sizes. The chapter concludes with an overview of how usage of shared memory and registers can affect the number of thread ...