CUDA is best known for enabling fine-grained concurrency, with hardware facilities that enable threads to closely collaborate within blocks using a combination of shared memory and thread synchronization. But it also has hardware and software facilities that enable more coarse-grained concurrency:
• CPU/GPU concurrency: Since they are separate devices, the CPU and GPU can operate independently of each other.
• Memcpy/kernel processing concurrency: For GPUs that have one or more copy engines, host↔device memcpy can be performed while the SMs are processing kernels.
• Kernel concurrency: SM 2.x-class and later hardware can run up to 4 kernels in parallel.
• Multi-GPU concurrency: For problems with enough computational ...