O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

OpenACC for Programmers: Concepts and Strategies, First Edition

Book Description

The Complete Guide to OpenACC for Massively Parallel Programming

 

Scientists and technical professionals can use OpenACC to leverage the immense power of modern GPUs without the complexity traditionally associated with programming them. OpenACC™ for Programmers is one of the first comprehensive and practical overviews of OpenACC for massively parallel programming.

 

This book integrates contributions from 19 leading parallel-programming experts from academia, public research organizations, and industry. The authors and editors explain each key concept behind OpenACC, demonstrate how to use essential OpenACC development tools, and thoroughly explore each OpenACC feature set.

 

Throughout, you’ll find realistic examples, hands-on exercises, and case studies showcasing the efficient use of OpenACC language constructs. You’ll discover how OpenACC’s language constructs can be translated to maximize application performance, and how its standard interface can target multiple platforms via widely used programming languages.

 

Each chapter builds on what you’ve already learned, helping you build practical mastery one step at a time, whether you’re a GPU programmer, scientist, engineer, or student. All example code and exercise solutions are available for download at GitHub.

  • Discover how OpenACC makes scalable parallel programming easier and more practical
  • Walk through the OpenACC spec and learn how OpenACC directive syntax is structured
  • Get productive with OpenACC code editors, compilers, debuggers, and performance analysis tools
  • Build your first real-world OpenACC programs
  • Exploit loop-level parallelism in OpenACC, understand the levels of parallelism available, and maximize accuracy or performance
  • Learn how OpenACC programs are compiled
  • Master OpenACC programming best practices
  • Overcome common performance, portability, and interoperability challenges
  • Efficiently distribute tasks across multiple processors

Register your product at informit.com/register for convenient access to downloads, updates, and/or corrections as they become available.

Table of Contents

  1. Title Page
  2. Copyright Page
  3. Dedication Page
  4. Contents
  5. Foreword
  6. Preface
  7. Acknowledgments
  8. About the Contributors
  9. Chapter 1: OpenACC in a Nutshell
    1. 1.1 OpenACC Syntax
      1. 1.1.1 Directives
      2. 1.1.2 Clauses
      3. 1.1.3 API Routines and Environment Variables
    2. 1.2 Compute Constructs
      1. 1.2.1 Kernels
      2. 1.2.2 Parallel
      3. 1.2.3 Loop
      4. 1.2.4 Routine
    3. 1.3 The Data Environment
      1. 1.3.1 Data Directives
      2. 1.3.2 Data Clauses
      3. 1.3.3 The Cache Directive
      4. 1.3.4 Partial Data Transfers
    4. 1.4 Summary
    5. 1.5 Exercises
  10. Chapter 2: Loop-Level Parallelism
    1. 2.1 Kernels Versus Parallel Loops
    2. 2.2 Three Levels of Parallelism
      1. 2.2.1 Gang, Worker, and Vector Clauses
      2. 2.2.2 Mapping Parallelism to Hardware
    3. 2.3 Other Loop Constructs
      1. 2.3.1 Loop Collapse
      2. 2.3.2 Independent Clause
      3. 2.3.3 Seq and Auto Clauses
      4. 2.3.4 Reduction Clause
    4. 2.4 Summary
    5. 2.5 Exercises
  11. Chapter 3: Programming Tools for OpenACC
    1. 3.1 Common Characteristics of Architectures
    2. 3.2 Compiling OpenACC Code
    3. 3.3 Performance Analysis of OpenACC Applications
      1. 3.3.1 Performance Analysis Layers and Terminology
      2. 3.3.2 Performance Data Acquisition
      3. 3.3.3 Performance Data Recording and Presentation
      4. 3.3.4 The OpenACC Profiling Interface
      5. 3.3.5 Performance Tools with OpenACC Support
      6. 3.3.6 The NVIDIA Profiler
      7. 3.3.7 The Score-P Tools Infrastructure for Hybrid Applications
      8. 3.3.8 TAU Performance System
    4. 3.4 Identifying Bugs in OpenACC Programs
    5. 3.5 Summary
    6. 3.6 Exercises
  12. Chapter 4: Using OpenACC for Your First Program
    1. 4.1 Case Study
      1. 4.1.1 Serial Code
      2. 4.1.2 Compiling the Code
    2. 4.2 Creating a Naive Parallel Version
      1. 4.2.1 Find the Hot Spot
      2. 4.2.2 Is It Safe to Use kernels?
      3. 4.2.3 OpenACC Implementations
    3. 4.3 Performance of OpenACC Programs
    4. 4.4 An Optimized Parallel Version
      1. 4.4.1 Reducing Data Movement
      2. 4.4.2 Extra Clever Tweaks
      3. 4.4.3 Final Result
    5. 4.5 Summary
    6. 4.6 Exercises
  13. Chapter 5: Compiling OpenACC
    1. 5.1 The Challenges of Parallelism
      1. 5.1.1 Parallel Hardware
      2. 5.1.2 Mapping Loops
      3. 5.1.3 Memory Hierarchy
      4. 5.1.4 Reductions
      5. 5.1.5 OpenACC for Parallelism
    2. 5.2 Restructuring Compilers
      1. 5.2.1 What Compilers Can Do
      2. 5.2.2 What Compilers Can’t Do
    3. 5.3 Compiling OpenACC
      1. 5.3.1 Code Preparation
      2. 5.3.2 Scheduling
      3. 5.3.3 Serial Code
      4. 5.3.4 User Errors
    4. 5.4 Summary
    5. 5.5 Exercises
  14. Chapter 6: Best Programming Practices
    1. 6.1 General Guidelines
      1. 6.1.1 Maximizing On-Device Computation
      2. 6.1.2 Optimizing Data Locality
    2. 6.2 Maximize On-Device Compute
      1. 6.2.1 Atomic Operations
      2. 6.2.2 Kernels and Parallel Constructs
      3. 6.2.3 Runtime Tuning and the If Clause
    3. 6.3 Optimize Data Locality
      1. 6.3.1 Minimum Data Transfer
      2. 6.3.2 Data Reuse and the Present Clause
      3. 6.3.3 Unstructured Data Lifetimes
      4. 6.3.4 Array Shaping
    4. 6.4 A Representative Example
      1. 6.4.1 Background: Thermodynamic Tables
      2. 6.4.2 Baseline CPU Implementation
      3. 6.4.3 Profiling
      4. 6.4.4 Acceleration with OpenACC
      5. 6.4.5 Optimized Data Locality
      6. 6.4.6 Performance Study
    5. 6.5 Summary
    6. 6.6 Exercises
  15. Chapter 7: OpenACC and Performance Portability
    1. 7.1 Challenges
    2. 7.2 Target Architectures
      1. 7.2.1 Compiling for Specific Platforms
      2. 7.2.2 x86_64 Multicore and NVIDIA
    3. 7.3 OpenACC for Performance Portability
      1. 7.3.1 The OpenACC Memory Model
      2. 7.3.2 Memory Architectures
      3. 7.3.3 Code Generation
      4. 7.3.4 Data Layout for Performance Portability
    4. 7.4 Code Refactoring for Performance Portability
      1. 7.4.1 HACCMK
      2. 7.4.2 Targeting Multiple Architectures
      3. 7.4.3 OpenACC over NVIDIA K20x GPU
      4. 7.4.4 OpenACC over AMD Bulldozer Multicore
    5. 7.5 Summary
    6. 7.6 Exercises
  16. Chapter 8: Additional Approaches to Parallel Programming
    1. 8.1 Programming Models
      1. 8.1.1 OpenACC
      2. 8.1.2 OpenMP
      3. 8.1.3 CUDA
      4. 8.1.4 OpenCL
      5. 8.1.5 C++ AMP
      6. 8.1.6 Kokkos
      7. 8.1.7 RAJA
      8. 8.1.8 Threading Building Blocks
      9. 8.1.9 C++17
      10. 8.1.10 Fortran
    2. 8.2 Programming Model Components
      1. 8.2.1 Parallel Loops
      2. 8.2.2 Parallel Reductions
      3. 8.2.3 Tightly Nested Loops
      4. 8.2.4 Hierarchical Parallelism (Non-Tightly Nested Loops)
      5. 8.2.5 Task Parallelism
      6. 8.2.6 Data Allocation
      7. 8.2.7 Data Transfers
    3. 8.3 A Case Study
      1. 8.3.1 Serial Implementation
      2. 8.3.2 The OpenACC Implementation
      3. 8.3.3 The OpenMP Implementation
      4. 8.3.4 The CUDA Implementation
      5. 8.3.5 The Kokkos Implementation
      6. 8.3.6 The TBB Implementation
      7. 8.3.7 Some Performance Numbers
    4. 8.4 Summary
    5. 8.5 Exercises
  17. Chapter 9: OpenACC and Interoperability
    1. 9.1 Calling Native Device Code from OpenACC
      1. 9.1.1 Example: Image Filtering Using DFTs
      2. 9.1.2 The host_data Directive and the use_device Clause
      3. 9.1.3 API Routines for Target Platforms
    2. 9.2 Calling OpenACC from Native Device Code
    3. 9.3 Advanced Interoperability Topics
      1. 9.3.1 acc_map_data
      2. 9.3.2 Calling CUDA Device Routines from OpenACC Kernels
    4. 9.4 Summary
    5. 9.5 Exercises
  18. Chapter 10: Advanced OpenACC
    1. 10.1 Asynchronous Operations
      1. 10.1.1 Asynchronous OpenACC Programming
      2. 10.1.2 Software Pipelining
    2. 10.2 Multidevice Programming
      1. 10.2.1 Multidevice Pipeline
      2. 10.2.2 OpenACC and MPI
    3. 10.3 Summary
    4. 10.4 Exercises
  19. Chapter 11: Innovative Research Ideas Using OpenACC, Part I
    1. 11.1 Sunway OpenACC
      1. 11.1.1 The SW26010 Manycore Processor
      2. 11.1.2 The Memory Model in the Sunway TaihuLight
      3. 11.1.3 The Execution Model
      4. 11.1.4 Data Management
      5. 11.1.5 Summary
    2. 11.2 Compiler Transformation of Nested Loops for Accelerators
      1. 11.2.1 The OpenUH Compiler Infrastructure
      2. 11.2.2 Loop-Scheduling Transformation
      3. 11.2.3 Performance Evaluation of Loop Scheduling
      4. 11.2.4 Other Research Topics in OpenUH
  20. Chapter 12: Innovative Research Ideas Using OpenACC, Part II
    1. 12.1 A Framework for Directive-Based High-Performance Reconfigurable Computing
      1. 12.1.1 Introduction
      2. 12.1.2 Baseline Translation of OpenACC-to-FPGA
      3. 12.1.3 OpenACC Extensions and Optimization for Efficient FPGA Programming
      4. 12.1.4 Evaluation
      5. 12.1.5 Summary
    2. 12.2 Programming Accelerated Clusters Using XcalableACC
      1. 12.2.1 Introduction to XcalableMP
      2. 12.2.2 XcalableACC: XcalableMP Meets OpenACC
      3. 12.2.3 Omni Compiler Implementation
      4. 12.2.4 Performance Evaluation on HA-PACS
      5. 12.2.5 Summary
  21. Index
  22. Register Your Product