book

Programming Massively Parallel Processors, 2nd Edition

by David B. Kirk, Wen-mei W. Hwu

December 2012

Intermediate to advanced

514 pages

13h 6m

English

Morgan Kaufmann

Read now

Unlock full access

Cover image
Title page
Table of Contents
Copyright
Preface
Target AudienceHow to Use the BookOnline Supplements
Acknowledgements
Dedication
Chapter 1. Introduction
1.1 Heterogeneous Parallel Computing1.2 Architecture of a Modern GPU1.3 Why More Speed or Parallelism?1.4 Speeding Up Real Applications1.5 Parallel Programming Languages and Models1.6 Overarching Goals1.7 Organization of the BookReferences
Chapter 2. History of GPU Computing
2.1 Evolution of Graphics Pipelines2.2 GPGPU: An Intermediate Step2.3 GPU ComputingReferences and Further Reading
Chapter 3. Introduction to Data Parallelism and CUDA C
3.1 Data Parallelism3.2 CUDA Program Structure3.3 A Vector Addition Kernel3.4 Device Global Memory and Data Transfer3.5 Kernel Functions and Threading3.6 Summary3.7 ExercisesReferences

Chapter 4. Data-Parallel Execution Model
4.1 Cuda Thread Organization4.2 Mapping Threads to Multidimensional Data4.3 Matrix-Matrix Multiplication—A More Complex Kernel4.4 Synchronization and Transparent Scalability4.5 Assigning Resources to Blocks4.6 Querying Device Properties4.7 Thread Scheduling and Latency Tolerance4.8 Summary4.9 Exercises
Chapter 5. CUDA Memories
5.1 Importance of Memory Access Efficiency5.2 CUDA Device Memory Types5.3 A Strategy for Reducing Global Memory Traffic5.4 A Tiled Matrix–Matrix Multiplication Kernel5.5 Memory as a Limiting Factor to Parallelism5.6 Summary5.7 Exercises
Chapter 6. Performance Considerations
6.1 Warps and Thread Execution6.2 Global Memory Bandwidth6.3 Dynamic Partitioning of Execution Resources6.4 Instruction Mix and Thread Granularity6.5 Summary6.6 ExercisesReferences
Chapter 7. Floating-Point Considerations
7.1 Floating-Point Format7.2 Representable Numbers7.3 Special Bit Patterns and Precision in IEEE Format7.4 Arithmetic Accuracy and Rounding7.5 Algorithm Considerations7.6 Numerical Stability7.7 Summary7.8 ExercisesReferences
Chapter 8. Parallel Patterns: Convolution: With an Introduction to Constant Memory and Caches
8.1 Background8.2 1D Parallel Convolution—A Basic Algorithm8.3 Constant Memory and Caching8.4 Tiled 1D Convolution with Halo Elements8.5 A Simpler Tiled 1D Convolution—General Caching8.6 Summary8.7 Exercises
Chapter 9. Parallel Patterns: Prefix Sum: An Introduction to Work Efficiency in Parallel Algorithms
9.1 Background9.2 A Simple Parallel Scan9.3 Work Efficiency Considerations9.4 A Work-Efficient Parallel Scan9.5 Parallel Scan for Arbitrary-Length Inputs9.6 Summary9.7 ExercisesReference
Chapter 10. Parallel Patterns: Sparse Matrix–Vector Multiplication: An Introduction to Compaction and Regularization in Parallel Algorithms
10.1 Background10.2 Parallel SpMV Using CSR10.3 Padding and Transposition10.4 Using Hybrid to Control Padding10.5 Sorting and Partitioning for Regularization10.6 Summary10.7 ExercisesReferences
Chapter 11. Application Case Study: Advanced MRI Reconstruction
11.1 Application Background11.2 Iterative Reconstruction11.3 Computing FHD11.4 Final Evaluation11.5 ExercisesReferences
Chapter 12. Application Case Study: Molecular Visualization and Analysis
12.1 Application Background12.2 A Simple Kernel Implementation12.3 Thread Granularity Adjustment12.4 Memory Coalescing12.5 Summary12.6 ExercisesReferences
Chapter 13. Parallel Programming and Computational Thinking
13.1 Goals of Parallel Computing13.2 Problem Decomposition13.3 Algorithm Selection13.4 Computational Thinking13.5 Summary13.6 ExercisesReferences
Chapter 14. An Introduction to OpenCLTM
14.1 Background14.2 Data Parallelism Model14.3 Device Architecture14.4 Kernel Functions14.5 Device Management and Kernel Launch14.6 Electrostatic Potential Map in OpenCL14.7 Summary14.8 ExercisesReferences
Chapter 15. Parallel Programming with OpenACC
15.1 OpenACC Versus CUDA C15.2 Execution Model15.3 Memory Model15.4 Basic OpenACC Programs15.5 Future Directions of OpenACC15.6 Exercises
Chapter 16. Thrust: A Productivity-Oriented Library for CUDA
16.1 Background16.2 Motivation16.3 Basic Thrust Features16.4 Generic Programming16.5 Benefits of Abstraction16.6 Programmer Productivity16.7 Best Practices16.8 ExercisesReferences
Chapter 17. CUDA FORTRAN
17.1 CUDA FORTRAN and CUDA C Differences17.2 A First CUDA FORTRAN Program17.3 Multidimensional Array in CUDA FORTRAN17.4 Overloading Host/Device Routines With Generic Interfaces17.5 Calling CUDA C Via Iso_C_Binding17.6 Kernel Loop Directives and Reduction Operations17.7 Dynamic Shared Memory17.8 Asynchronous Data Transfers17.9 Compilation and Profiling17.10 Calling Thrust from CUDA FORTRAN17.11 Exercises
Chapter 18. An Introduction to C++ AMP
18.1 Core C++ AMP Features18.2 Details of the C++ AMP Execution Model18.3 Managing Accelerators18.4 Tiled Execution18.5 C++ AMP Graphics Features18.6 Summary18.7 Exercises
Chapter 19. Programming a Heterogeneous Computing Cluster
19.1 Background19.2 A Running Example19.3 MPI Basics19.4 MPI Point-to-Point Communication Types19.5 Overlapping Computation and Communication19.6 MPI Collective Communication19.7 Summary19.8 ExercisesReference
Chapter 20. CUDA Dynamic Parallelism
20.1 Background20.2 Dynamic Parallelism Overview20.3 Important Details20.4 Memory Visibility20.5 A Simple Example20.6 Runtime Limitations20.7 A More Complex Example20.8 SummaryReference
Chapter 21. Conclusion and Future Outlook
21.1 Goals Revisited21.2 Memory Model Evolution21.3 Kernel Execution Control Evolution21.4 Core Performance21.5 Programming Environment21.6 Future OutlookReferences
Appendix A. Matrix Multiplication Host-Only Version Source Code
Appendix OutlineA.1 matrixmul.cuA.2 matrixmul_gold.cppA.3 matrixmul.hA.4 assist.hA.5 Expected Output
Appendix B. GPU Compute Capabilities
Appendix OutlineB.1 GPU Compute Capability TablesB.2 Memory Coalescing Variations
Index

Overview

Programming Massively Parallel Processors: A Hands-on Approach, Second Edition, teaches students how to program massively parallel processors. It offers a detailed discussion of various techniques for constructing parallel programs. Case studies are used to demonstrate the development process, which begins with computational thinking and ends with effective and efficient parallel programs.

This guide shows both student and professional alike the basic concepts of parallel programming and GPU architecture. Topics of performance, floating-point format, parallel patterns, and dynamic parallelism are covered in depth. This revised edition contains more parallel programming examples, commonly-used libraries such as Thrust, and explanations of the latest tools. It also provides new coverage of CUDA 5.0, improved performance, enhanced development tools, increased hardware support, and more; increased coverage of related technology, OpenCL and new material on algorithm patterns, GPU clusters, host programming, and data parallelism; and two new case studies (on MRI reconstruction and molecular visualization) that explore the latest applications of CUDA and GPUs for scientific research and high-performance computing.

This book should be a valuable resource for advanced students, software engineers, programmers, and hardware engineers.

New coverage of CUDA 5.0, improved performance, enhanced development tools, increased hardware support, and more
Increased coverage of related technology, OpenCL and new material on algorithm patterns, GPU clusters, host programming, and data parallelism
Two new case studies (on MRI reconstruction and molecular visualization) explore the latest applications of CUDA and GPUs for scientific research and high-performance computing

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Programming Massively Parallel Processors, 3rd Edition

Publisher Resources

ISBN: 9780124159921

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills