Chapter 6. Streams and Events

CUDA is best known for enabling fine-grained concurrency, with hardware facilities that enable threads to closely collaborate within blocks using a combination of shared memory and thread synchronization. But it also has hardware and software facilities that enable more coarse-grained concurrency:

CPU/GPU concurrency: Since they are separate devices, the CPU and GPU can operate independently of each other.

Memcpy/kernel processing concurrency: For GPUs that have one or more copy engines, host↔device memcpy can be performed while the SMs are processing kernels.

Kernel concurrency: SM 2.x-class and later hardware can run up to 4 kernels in parallel.

Multi-GPU concurrency: For problems with enough computational ...

Get The CUDA Handbook: A Comprehensive Guide to GPU Programming now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.