book

The CUDA Handbook: A Comprehensive Guide to GPU Programming

Name: The CUDA Handbook: A Comprehensive Guide to GPU Programming
Author: Nicholas Wilt
ISBN: 9780133261516

by Nicholas Wilt

June 2013

Intermediate to advanced

528 pages

13h 11m

English

Addison-Wesley Professional

Read now

Unlock full access

About This eBook
Title Page
Copyright Page
Dedication Page
Contents
Preface
Acknowledgments
About the Author
Part I
Chapter 1. Background
1.1. Our Approach1.2. Code1.3. Administrative Items1.4. Road Map

Chapter 2. Hardware Architecture
2.1. CPU Configurations2.2. Integrated GPUs2.3. Multiple GPUs2.4. Address Spaces in CUDA2.5. CPU/GPU Interactions2.6. GPU Architecture2.7. Further Reading
Chapter 3. Software Architecture
3.1. Software Layers3.2. Devices and Initialization3.3. Contexts3.4. Modules and Functions3.5. Kernels (Functions)3.6. Device Memory3.7. Streams and Events3.8. Host Memory3.9. CUDA Arrays and Texturing3.10. Graphics Interoperability3.11. The CUDA Runtime and CUDA Driver API
Chapter 4. Software Environment
4.1. nvcc—CUDA Compiler Driver4.2. ptxas—the PTX Assembler4.3. cuobjdump4.4. nvidia-smi4.5. Amazon Web Services
Part II
Chapter 5. Memory
5.1. Host Memory5.2. Global Memory5.3. Constant Memory5.4. Local Memory5.5. Texture Memory5.6. Shared Memory5.7. Memory Copy
Chapter 6. Streams and Events
6.1. CPU/GPU Concurrency: Covering Driver Overhead6.2. Asynchronous Memcpy6.3. CUDA Events: CPU/GPU Synchronization6.4. CUDA Events: Timing6.5. Concurrent Copying and Kernel Processing6.6. Mapped Pinned Memory6.7. Concurrent Kernel Processing6.8. GPU/GPU Synchronization: cudaStreamWaitEvent()6.9. Source Code Reference
Chapter 7. Kernel Execution
7.1. Overview7.2. Syntax7.3. Blocks, Threads, Warps, and Lanes7.4. Occupancy7.5. Dynamic Parallelism
Chapter 8. Streaming Multiprocessors
8.1. Memory8.2. Integer Support8.3. Floating-Point Support8.4. Conditional Code8.5. Textures and Surfaces8.6. Miscellaneous Instructions8.7. Instruction Sets
Chapter 9. Multiple GPUs
9.1. Overview9.2. Peer-to-Peer9.3. UVA: Inferring Device from Address9.4. Inter-GPU Synchronization9.5. Single-Threaded Multi-GPU9.6. Multithreaded Multi-GPU
Chapter 10. Texturing
10.1. Overview10.2. Texture Memory10.3. 1D Texturing10.4. Texture as a Read Path10.5. Texturing with Unnormalized Coordinates10.6. Texturing with Normalized Coordinates10.7. 1D Surface Read/Write10.8. 2D Texturing10.9. 2D Texturing: Copy Avoidance10.10. 3D Texturing10.11. Layered Textures10.12. Optimal Block Sizing and Performance10.13. Texturing Quick References
Part III
Chapter 11. Streaming Workloads
11.1. Device Memory11.2. Asynchronous Memcpy11.3. Streams11.4. Mapped Pinned Memory11.5. Performance and Summary
Chapter 12. Reduction
12.1. Overview12.2. Two-Pass Reduction12.3. Single-Pass Reduction12.4. Reduction with Atomics12.5. Arbitrary Block Sizes12.6. Reduction Using Arbitrary Data Types12.7. Predicate Reduction12.8. Warp Reduction with Shuffle
Chapter 13. Scan
13.1. Definition and Variations13.2. Overview13.3. Scan and Circuit Design13.4. CUDA Implementations13.5. Warp Scans13.6. Stream Compaction13.7. References (Parallel Scan Algorithms)13.8. Further Reading (Parallel Prefix Sum Circuits)
Chapter 14. N-Body
14.1. Introduction14.2. Naïve Implementation14.3. Shared Memory14.4. Constant Memory14.5. Warp Shuffle14.6. Multiple GPUs and Scalability14.7. CPU Optimizations14.8. Conclusion14.9. References and Further Reading
Chapter 15. Image Processing: Normalized Correlation
15.1. Overview15.2. Naïve Texture-Texture Implementation15.3. Template in Constant Memory15.4. Image in Shared Memory15.5. Further Optimizations15.6. Source Code15.7. Performance and Further Reading15.8. Further Reading
Appendix A. The CUDA Handbook Library
A.1. TimingA.2. ThreadingA.3. Driver API FacilitiesA.4. ShmoosA.5. Command Line ParsingA.6. Error Handling
Glossary / TLA Decoder
Index

Content preview from The CUDA Handbook: A Comprehensive Guide to GPU Programming

Chapter 8. Streaming Multiprocessors

The streaming multiprocessors (SMs) are the part of the GPU that runs our CUDA kernels. Each SM contains the following.

• Thousands of registers that can be partitioned among threads of execution

• Several caches:

– Shared memory for fast data interchange between threads

– Constant cache for fast broadcast of reads from constant memory

– Texture cache to aggregate bandwidth from texture memory

– L1 cache to reduce latency to local or global memory

• Warp schedulers that can quickly switch contexts between threads and issue instructions to warps that are ready to execute

• Execution cores for integer and floating-point operations:

– Integer and single-precision floating point operations

– Double-precision floating ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9780133261516Purchase book

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

The CUDA Handbook: A Comprehensive Guide to GPU Programming

by Nicholas Wilt

Chapter 8. Streaming Multiprocessors

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.