What's in this chapter?
- Understanding the nature of streams and events
- Exploiting grid level concurrency
- Overlapping kernel execution and data transfer
- Overlapping CPU and GPU execution
- Understanding synchronization mechanisms
- Avoiding unwanted synchronization
- Adjusting stream priorities
- Registering device callback functions
- Displaying application execution timelines with the NVIDIA Visual Profiler
Generally speaking, there are two levels of concurrency in CUDA C programming:
- Kernel level concurrency
- Grid level concurrency
Up to this point, your focus has been solely on kernel level concurrency, in which a single task, or kernel, is executed in parallel by many threads on the GPU. Several ways to improve kernel performance have been covered from the programming model, execution model, and memory model points-of-view. You have developed your ability to dissect and analyze your kernel's behavior using the command-line profiler.
This chapter will examine grid level concurrency. In grid level concurrency, multiple kernel launches are executed simultaneously on a single device, often leading to better device utilization. In this chapter, you will learn how to use CUDA streams to implement grid level concurrency. ...