Memory architecture and data locality
Abstract
This chapter introduces the on-chip memory architecture of GPUs, the concept of memory-bound applications, and techniques for improving the performance of memory-bound applications. The chapter uses matrix multiplication to illustrate opportunities for reducing the number of global memory accesses. It then introduces the tiling technique by which barrier synchronization is used to coordinate the timing of executing threads for improved locality and reduced global memory accesses. However, the tiling techniques involve additional complexities in boundary checks. The chapter uses matrix multiplication to illustrate the additional boundary checks that are needed for a tiled kernel to be applicable ...
Get Programming Massively Parallel Processors, 4th Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.