book

The CUDA Handbook: A Comprehensive Guide to GPU Programming

by Nicholas Wilt

June 2013

Intermediate to advanced

528 pages

13h 11m

English

Addison-Wesley Professional

Read now

Unlock full access

1.1. Our Approach1.2. Code1.3. Administrative Items1.4. Road Map

2.1. CPU Configurations2.2. Integrated GPUs2.3. Multiple GPUs2.4. Address Spaces in CUDA2.5. CPU/GPU Interactions2.6. GPU Architecture2.7. Further Reading
3.1. Software Layers3.2. Devices and Initialization3.3. Contexts3.4. Modules and Functions3.5. Kernels (Functions)3.6. Device Memory3.7. Streams and Events3.8. Host Memory3.9. CUDA Arrays and Texturing3.10. Graphics Interoperability3.11. The CUDA Runtime and CUDA Driver API
4.1. nvcc—CUDA Compiler Driver4.2. ptxas—the PTX Assembler4.3. cuobjdump4.4. nvidia-smi4.5. Amazon Web Services
5.1. Host Memory5.2. Global Memory5.3. Constant Memory5.4. Local Memory5.5. Texture Memory5.6. Shared Memory5.7. Memory Copy
6.1. CPU/GPU Concurrency: Covering Driver Overhead6.2. Asynchronous Memcpy6.3. CUDA Events: CPU/GPU Synchronization6.4. CUDA Events: Timing6.5. Concurrent Copying and Kernel Processing6.6. Mapped Pinned Memory6.7. Concurrent Kernel Processing6.8. GPU/GPU Synchronization: cudaStreamWaitEvent()6.9. Source Code Reference
7.1. Overview7.2. Syntax7.3. Blocks, Threads, Warps, and Lanes7.4. Occupancy7.5. Dynamic Parallelism
8.1. Memory8.2. Integer Support8.3. Floating-Point Support8.4. Conditional Code8.5. Textures and Surfaces8.6. Miscellaneous Instructions8.7. Instruction Sets
9.1. Overview9.2. Peer-to-Peer9.3. UVA: Inferring Device from Address9.4. Inter-GPU Synchronization9.5. Single-Threaded Multi-GPU9.6. Multithreaded Multi-GPU
10.1. Overview10.2. Texture Memory10.3. 1D Texturing10.4. Texture as a Read Path10.5. Texturing with Unnormalized Coordinates10.6. Texturing with Normalized Coordinates10.7. 1D Surface Read/Write10.8. 2D Texturing10.9. 2D Texturing: Copy Avoidance10.10. 3D Texturing10.11. Layered Textures10.12. Optimal Block Sizing and Performance10.13. Texturing Quick References
11.1. Device Memory11.2. Asynchronous Memcpy11.3. Streams11.4. Mapped Pinned Memory11.5. Performance and Summary
12.1. Overview12.2. Two-Pass Reduction12.3. Single-Pass Reduction12.4. Reduction with Atomics12.5. Arbitrary Block Sizes12.6. Reduction Using Arbitrary Data Types12.7. Predicate Reduction12.8. Warp Reduction with Shuffle
13.1. Definition and Variations13.2. Overview13.3. Scan and Circuit Design13.4. CUDA Implementations13.5. Warp Scans13.6. Stream Compaction13.7. References (Parallel Scan Algorithms)13.8. Further Reading (Parallel Prefix Sum Circuits)
14.1. Introduction14.2. Naïve Implementation14.3. Shared Memory14.4. Constant Memory14.5. Warp Shuffle14.6. Multiple GPUs and Scalability14.7. CPU Optimizations14.8. Conclusion14.9. References and Further Reading
15.1. Overview15.2. Naïve Texture-Texture Implementation15.3. Template in Constant Memory15.4. Image in Shared Memory15.5. Further Optimizations15.6. Source Code15.7. Performance and Further Reading15.8. Further Reading
A.1. TimingA.2. ThreadingA.3. Driver API FacilitiesA.4. ShmoosA.5. Command Line ParsingA.6. Error Handling

Content preview from The CUDA Handbook: A Comprehensive Guide to GPU Programming

Chapter 5. Memory

To maximize performance, CUDA uses different types of memory, depending on the expected usage. Host memory refers to the memory attached to the CPU(s) in the system. CUDA provides APIs that enable faster access to host memory by page-locking and mapping it for the GPU(s). Device memory is attached to the GPU and accessed by a dedicated memory controller, and, as every beginning CUDA developer knows, data must be copied explicitly between host and device memory in order to be processed by the GPU.

Device memory can be allocated and accessed in a variety of ways.

• Global memory may be allocated statically or dynamically and accessed via pointers in CUDA kernels, which translate to global load/store instructions.

• Constant memory ...