Chapter 4

Memory and data locality

Abstract

This chapter introduces the concepts of memory bound application. It uses matrix multiplication to illustrate opportunities for reducing the number of global memory accesses. It then introduces the tiling technique where barrier synchronization is used to coordinate the timing of executing threads for improved locality and reduced global memory accesses. The tiling techniques, however, involve additional complexities in boundary checks. The chapter uses matrix multiplication to illustrate the additional boundary checks needed for a tiled kernel to be applicable to arbitrary matrix sizes. The chapter concludes with an overview of how usage of shared memory and registers can affect the number of thread ...

Get Programming Massively Parallel Processors, 3rd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.