book

Programming Massively Parallel Processors, 4th Edition

Name: Programming Massively Parallel Processors, 4th Edition
ISBN: 9780323984638

by Wen-mei W. Hwu, David B. Kirk, Izzat El Hajj

May 2022

Intermediate to advanced

580 pages

18h 32m

English

Morgan Kaufmann

Read now

Unlock full access

Cover image
Title page
Table of Contents
Copyright
Dedication
Foreword
Preface
How to use the bookA two-phased approachTying it all together: the final projectThe design documentThe project report and symposiumClass competitionCourse resources
Acknowledgments
Chapter 1. Introduction
AbstractChapter Outline1.1 Heterogeneous parallel computing1.2 Why more speed or parallelism?1.3 Speeding up real applications1.4 Challenges in parallel programming1.5 Related parallel programming interfaces1.6 Overarching goals1.7 Organization of the bookReferences
Part I: Fundamental Concepts

Chapter 2. Heterogeneous data parallel computing
AbstractChapter Outline2.1 Data parallelism2.2 CUDA C program structure2.3 A vector addition kernel2.4 Device global memory and data transfer2.5 Kernel functions and threading2.6 Calling kernel functions2.7 Compilation2.8 SummaryExercisesReferences
Chapter 3. Multidimensional grids and data
AbstractChapter Outline3.1 Multidimensional grid organization3.2 Mapping threads to multidimensional data3.3 Image blur: a more complex kernel3.4 Matrix multiplication3.5 SummaryExercises
Chapter 4. Compute architecture and scheduling
AbstractChapter Outline4.1 Architecture of a modern GPU4.2 Block scheduling4.3 Synchronization and transparent scalability4.4 Warps and SIMD hardware4.5 Control divergence4.6 Warp scheduling and latency tolerance4.7 Resource partitioning and occupancy4.8 Querying device properties4.9 SummaryExercisesReferences
Chapter 5. Memory architecture and data locality
AbstractChapter Outline5.1 Importance of memory access efficiency5.2 CUDA memory types5.3 Tiling for reduced memory traffic5.4 A tiled matrix multiplication kernel5.5 Boundary checks5.6 Impact of memory usage on occupancy5.7 SummaryExercises
Chapter 6. Performance considerations
AbstractChapter Outline6.1 Memory coalescing6.2 Hiding memory latency6.3 Thread coarsening6.4 A checklist of optimizations6.5 Knowing your computation’s bottleneck6.6 SummaryExercisesReferences
Part II: Parallel Patterns
Chapter 7. Convolution: An introduction to constant memory and caching
AbstractChapter Outline7.1 Background7.2 Parallel convolution: a basic algorithm7.3 Constant memory and caching7.4 Tiled convolution with halo cells7.5 Tiled convolution using caches for halo cells7.6 SummaryExercises
Chapter 8. Stencil
AbstractChapter Outline8.1 Background8.2 Parallel stencil: a basic algorithm8.3 Shared memory tiling for stencil sweep8.4 Thread coarsening8.5 Register tiling8.6 SummaryExercises
Chapter 9. Parallel histogram: An introduction to atomic operations and privatization
AbstractChapter Outline9.1 Background9.2 Atomic operations and a basic histogram kernel9.3 Latency and throughput of atomic operations9.4 Privatization9.5 Coarsening9.6 Aggregation9.7 SummaryExercisesReferences
Chapter 10. Reduction: And minimizing divergence
AbstractChapter Outline10.1 Background10.2 Reduction trees10.3 A simple reduction kernel10.4 Minimizing control divergence10.5 Minimizing memory divergence10.6 Minimizing global memory accesses10.7 Hierarchical reduction for arbitrary input length10.8 Thread coarsening for reduced overhead10.9 SummaryExercises
Chapter 11. Prefix sum (scan): An introduction to work efficiency in parallel algorithms
AbstractChapter Outline11.1 Background11.2 Parallel scan with the Kogge-Stone algorithm11.3 Speed and work efficiency consideration11.4 Parallel scan with the Brent-Kung algorithm11.5 Coarsening for even more work efficiency11.6 Segmented parallel scan for arbitrary-length inputs11.7 Single-pass scan for memory access efficiency11.8 SummaryExercisesReferences
Chapter 12. Merge: An introduction to dynamic input data identification
AbstractChapter Outline12.1 Background12.2 A sequential merge algorithm12.3 A parallelization approach12.4 Co-rank function implementation12.5 A basic parallel merge kernel12.6 A tiled merge kernel to improve coalescing12.7 A circular buffer merge kernel12.8 Thread coarsening for merge12.9 SummaryExercisesReferences
Part III: Advanced Patterns and Applications
Chapter 13. Sorting
AbstractChapter Outline13.1 Background13.2 Radix sort13.3 Parallel radix sort13.4 Optimizing for memory coalescing13.5 Choice of radix value13.6 Thread coarsening to improve coalescing13.7 Parallel merge sort13.8 Other parallel sort methods13.9 SummaryExercisesReferences
Chapter 14. Sparse matrix computation
AbstractChapter Outline14.1 Background14.2 A simple SpMV kernel with the COO format14.3 Grouping row nonzeros with the CSR format14.4 Improving memory coalescing with the ELL format14.5 Regulating padding with the hybrid ELL-COO format14.6 Reducing control divergence with the JDS format14.7 SummaryExercisesReferences
Chapter 15. Graph traversal
AbstractChapter Outline15.1 Background15.2 Breadth-first search15.3 Vertex-centric parallelization of breadth-first search15.4 Edge-centric parallelization of breadth-first search15.5 Improving efficiency with frontiers15.6 Reducing contention with privatization15.7 Other optimizations15.8 SummaryExercisesReferences
Chapter 16. Deep learning
AbstractChapter Outline16.1 Background16.2 Convolutional neural networks16.3 Convolutional layer: a CUDA inference kernel16.4 Formulating a convolutional layer as GEMM16.5 CUDNN library16.6 SummaryExercisesReferences
Chapter 17. Iterative magnetic resonance imaging reconstruction
AbstractChapter Outline17.1 Background17.2 Iterative reconstruction17.3 Computing FHD17.4 SummaryExercisesReferences
Chapter 18. Electrostatic potential map
AbstractChapter Outline18.1 Background18.2 Scatter versus gather in kernel design18.3 Thread coarsening18.4 Memory coalescing18.5 Cutoff binning for data size scalability18.6 SummaryExercisesReferences
Chapter 19. Parallel programming and computational thinking
AbstractChapter Outline19.1 Goals of parallel computing19.2 Algorithm selection19.3 Problem decomposition19.4 Computational thinking19.5 SummaryReferences
Part IV: Advanced Practices
Chapter 20. Programming a heterogeneous computing cluster: An introduction to CUDA streams
AbstractChapter Outline20.1 Background20.2 A running example20.3 Message passing interface basics20.4 Message passing interface point-to-point communication20.5 Overlapping computation and communication20.6 Message passing interface collective communication20.7 CUDA aware message passing interface20.8 SummaryExercisesReferences
Chapter 21. CUDA dynamic parallelism
AbstractChapter Outline21.1 Background21.2 Dynamic parallelism overview21.3 An example: Bezier curves21.4 A recursive example: quadtrees21.5 Important considerations21.6 SummaryExercisesA21.1 Support code for quadtree exampleReferences
Chapter 22. Advanced practices and future evolution
AbstractChapter Outline22.1 Model of host/device interaction22.2 Kernel execution control22.3 Memory bandwidth and compute throughput22.4 Programming environment22.5 Future outlookReferences
Chapter 23. Conclusion and outlook
AbstractChapter Outline23.1 Goals revisited23.2 Future outlook
Appendix A. Numerical considerations
A.1 Floating-point data representationA.2 Representable numbersA.3 Special bit patterns and precision in IEEE formatA.4 Arithmetic accuracy and roundingA.5 Algorithm considerationsA.6 Linear solvers and numerical stabilityA.7 SummaryExercises
Index

Content preview from Programming Massively Parallel Processors, 4th Edition

Chapter 5

Memory architecture and data locality

Abstract

This chapter introduces the on-chip memory architecture of GPUs, the concept of memory-bound applications, and techniques for improving the performance of memory-bound applications. The chapter uses matrix multiplication to illustrate opportunities for reducing the number of global memory accesses. It then introduces the tiling technique by which barrier synchronization is used to coordinate the timing of executing threads for improved locality and reduced global memory accesses. However, the tiling techniques involve additional complexities in boundary checks. The chapter uses matrix multiplication to illustrate the additional boundary checks that are needed for a tiled kernel to be applicable ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Programming Massively Parallel Processors, 3rd Edition

Publisher Resources

ISBN: 9780323984638

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Programming Massively Parallel Processors, 4th Edition

by Wen-mei W. Hwu, David B. Kirk, Izzat El Hajj

Memory architecture and data locality

Abstract

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.