book

The CUDA Handbook: A Comprehensive Guide to GPU Programming

by Nicholas Wilt

June 2013

Intermediate to advanced

528 pages

13h 11m

English

Addison-Wesley Professional

Read now

Unlock full access

1.1. Our Approach1.2. Code1.3. Administrative Items1.4. Road Map

2.1. CPU Configurations2.2. Integrated GPUs2.3. Multiple GPUs2.4. Address Spaces in CUDA2.5. CPU/GPU Interactions2.6. GPU Architecture2.7. Further Reading
3.1. Software Layers3.2. Devices and Initialization3.3. Contexts3.4. Modules and Functions3.5. Kernels (Functions)3.6. Device Memory3.7. Streams and Events3.8. Host Memory3.9. CUDA Arrays and Texturing3.10. Graphics Interoperability3.11. The CUDA Runtime and CUDA Driver API
4.1. nvcc—CUDA Compiler Driver4.2. ptxas—the PTX Assembler4.3. cuobjdump4.4. nvidia-smi4.5. Amazon Web Services
5.1. Host Memory5.2. Global Memory5.3. Constant Memory5.4. Local Memory5.5. Texture Memory5.6. Shared Memory5.7. Memory Copy
6.1. CPU/GPU Concurrency: Covering Driver Overhead6.2. Asynchronous Memcpy6.3. CUDA Events: CPU/GPU Synchronization6.4. CUDA Events: Timing6.5. Concurrent Copying and Kernel Processing6.6. Mapped Pinned Memory6.7. Concurrent Kernel Processing6.8. GPU/GPU Synchronization: cudaStreamWaitEvent()6.9. Source Code Reference
7.1. Overview7.2. Syntax7.3. Blocks, Threads, Warps, and Lanes7.4. Occupancy7.5. Dynamic Parallelism
8.1. Memory8.2. Integer Support8.3. Floating-Point Support8.4. Conditional Code8.5. Textures and Surfaces8.6. Miscellaneous Instructions8.7. Instruction Sets
9.1. Overview9.2. Peer-to-Peer9.3. UVA: Inferring Device from Address9.4. Inter-GPU Synchronization9.5. Single-Threaded Multi-GPU9.6. Multithreaded Multi-GPU
10.1. Overview10.2. Texture Memory10.3. 1D Texturing10.4. Texture as a Read Path10.5. Texturing with Unnormalized Coordinates10.6. Texturing with Normalized Coordinates10.7. 1D Surface Read/Write10.8. 2D Texturing10.9. 2D Texturing: Copy Avoidance10.10. 3D Texturing10.11. Layered Textures10.12. Optimal Block Sizing and Performance10.13. Texturing Quick References
11.1. Device Memory11.2. Asynchronous Memcpy11.3. Streams11.4. Mapped Pinned Memory11.5. Performance and Summary
12.1. Overview12.2. Two-Pass Reduction12.3. Single-Pass Reduction12.4. Reduction with Atomics12.5. Arbitrary Block Sizes12.6. Reduction Using Arbitrary Data Types12.7. Predicate Reduction12.8. Warp Reduction with Shuffle
13.1. Definition and Variations13.2. Overview13.3. Scan and Circuit Design13.4. CUDA Implementations13.5. Warp Scans13.6. Stream Compaction13.7. References (Parallel Scan Algorithms)13.8. Further Reading (Parallel Prefix Sum Circuits)
14.1. Introduction14.2. Naïve Implementation14.3. Shared Memory14.4. Constant Memory14.5. Warp Shuffle14.6. Multiple GPUs and Scalability14.7. CPU Optimizations14.8. Conclusion14.9. References and Further Reading
15.1. Overview15.2. Naïve Texture-Texture Implementation15.3. Template in Constant Memory15.4. Image in Shared Memory15.5. Further Optimizations15.6. Source Code15.7. Performance and Further Reading15.8. Further Reading
A.1. TimingA.2. ThreadingA.3. Driver API FacilitiesA.4. ShmoosA.5. Command Line ParsingA.6. Error Handling

Content preview from The CUDA Handbook: A Comprehensive Guide to GPU Programming

Chapter 9. Multiple GPUs

This chapter describes CUDA’s facilities for multi-GPU programming, including threading models, peer-to-peer, and inter-GPU synchronization. As an example, we’ll first explore inter-GPU synchronization using CUDA streams and events by implementing a peer-to-peer memcpy that stages through portable pinned memory. We then discuss how to implement the N-body problem (fully described in Chapter 14) with single- and multithreaded implementations that use multiple GPUs.

9.1. Overview

Systems with multiple GPUs generally contain multi-GPU boards with a PCI Express bridge chip (such as the GeForce GTX 690) or multiple PCI Express slots, or both, as described in Section 2.3. Each GPU in such a system is separated by PCI Express ...