Chapter 5

CUDA Memories

Chapter Outline

5.1 Importance of Memory Access Efficiency

5.2 CUDA Device Memory Types

5.3 A Strategy for Reducing Global Memory Traffic

5.4 A Tiled Matrix–Matrix Multiplication Kernel

5.5 Memory as a Limiting Factor to Parallelism

5.6 Summary

5.7 Exercises

So far, we have learned to write a CUDA kernel function that is executed by a massive number of threads. The data to be processed by these threads is first transferred from the host memory to the device global memory. The threads then access their portion of the data from the global memory using their block IDs and thread IDs. We have also learned more details of the assignment and scheduling of threads for execution. Although this is a very good start, these simple ...

Get Programming Massively Parallel Processors, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.