O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Hands-On GPU Programming with Python and CUDA

Book Description

Build real-world applications with Python 2.7, CUDA 9, and CUDA 10. We suggest the use of Python 2.7 over Python 3.x, since Python 2.7 has stable support across all the libraries we use in this book.

Key Features

  • Expand your background in GPU programming—PyCUDA, scikit-cuda, and Nsight
  • Effectively use CUDA libraries such as cuBLAS, cuFFT, and cuSolver
  • Apply GPU programming to modern data science applications

Book Description

Hands-On GPU Programming with Python and CUDA hits the ground running: you'll start by learning how to apply Amdahl's Law, use a code profiler to identify bottlenecks in your Python code, and set up an appropriate GPU programming environment. You'll then see how to “query” the GPU's features and copy arrays of data to and from the GPU's own memory.

As you make your way through the book, you'll launch code directly onto the GPU and write full blown GPU kernels and device functions in CUDA C. You'll get to grips with profiling GPU code effectively and fully test and debug your code using Nsight IDE. Next, you'll explore some of the more well-known NVIDIA libraries, such as cuFFT and cuBLAS.

With a solid background in place, you will now apply your new-found knowledge to develop your very own GPU-based deep neural network from scratch. You'll then explore advanced topics, such as warp shuffling, dynamic parallelism, and PTX assembly. In the final chapter, you'll see some topics and applications related to GPU programming that you may wish to pursue, including AI, graphics, and blockchain.

By the end of this book, you will be able to apply GPU programming to problems related to data science and high-performance computing.

What you will learn

  • Launch GPU code directly from Python
  • Write effective and efficient GPU kernels and device functions
  • Use libraries such as cuFFT, cuBLAS, and cuSolver
  • Debug and profile your code with Nsight and Visual Profiler
  • Apply GPU programming to datascience problems
  • Build a GPU-based deep neuralnetwork from scratch
  • Explore advanced GPU hardware features, such as warp shuffling

Who this book is for

Hands-On GPU Programming with Python and CUDA is for developers and data scientists who want to learn the basics of effective GPU programming to improve performance using Python code. You should have an understanding of first-year college or university-level engineering mathematics and physics, and have some experience with Python as well as in any C-based programming language such as C, C++, Go, or Java.

Downloading the example code for this book You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

Table of Contents

  1. Title Page
  2. Copyright and Credits
    1. Hands-On GPU Programming with Python and CUDA
  3. Dedication
  4. About Packt
    1. Why subscribe?
    2. Packt.com
  5. Contributors
    1. About the author
    2. About the reviewer
    3. Packt is searching for authors like you
  6. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
      1. Download the example code files
      2. Download the color images
      3. Conventions used
    4. Get in touch
      1. Reviews
  7. Why GPU Programming?
    1. Technical requirements
    2. Parallelization and Amdahl's Law
      1. Using Amdahl's Law
      2. The Mandelbrot set
    3. Profiling your code
      1. Using the cProfile module
    4. Summary
    5. Questions
  8. Setting Up Your GPU Programming Environment
    1. Technical requirements
    2. Ensuring that we have the right hardware
      1. Checking your hardware (Linux)
      2. Checking your hardware (windows)
    3. Installing the GPU drivers
      1. Installing the GPU drivers (Linux)
      2. Installing the GPU drivers (Windows)
    4. Setting up a C++ programming environment
      1. Setting up GCC, Eclipse IDE, and graphical dependencies (Linux)
      2. Setting up Visual Studio (Windows)
      3. Installing the CUDA Toolkit
        1. Installing the CUDA Toolkit (Linux)
        2. Installing the CUDA Toolkit (Windows)
    5. Setting up our Python environment for GPU programming
      1. Installing PyCUDA (Linux)
      2. Creating an environment launch script (Windows)
      3. Installing PyCUDA (Windows)
      4. Testing PyCUDA
    6. Summary
    7. Questions
  9. Getting Started with PyCUDA
    1. Technical requirements
    2. Querying your GPU
      1. Querying your GPU with PyCUDA
    3. Using PyCUDA's gpuarray class
      1. Transferring data to and from the GPU with gpuarray
      2. Basic pointwise arithmetic operations with gpuarray
        1. A speed test
    4. Using PyCUDA's ElementWiseKernel for performing pointwise computations
      1. Mandelbrot revisited
      2. A brief foray into functional programming
      3. Parallel scan and reduction kernel basics
    5. Summary
    6. Questions
  10. Kernels, Threads, Blocks, and Grids
    1. Technical requirements
    2. Kernels
      1. The PyCUDA SourceModule function
    3. Threads, blocks, and grids
      1. Conway's game of life
    4. Thread synchronization and intercommunication
      1. Using the __syncthreads() device function
      2. Using shared memory
    5. The parallel prefix algorithm
      1. The naive parallel prefix algorithm
      2. Inclusive versus exclusive prefix
      3. A work-efficient parallel prefix algorithm
        1. Work-efficient parallel prefix (up-sweep phase)
        2. Work-efficient parallel prefix (down-sweep phase)
      4. Work-efficient parallel prefix — implementation 
    6. Summary
    7. Questions
  11. Streams, Events, Contexts, and Concurrency
    1. Technical requirements
    2. CUDA device synchronization
      1. Using the PyCUDA stream class
      2. Concurrent Conway's game of life using CUDA streams
    3. Events
      1. Events and streams
    4. Contexts
      1. Synchronizing the current context
      2. Manual context creation
      3. Host-side multiprocessing and multithreading
      4. Multiple contexts for host-side concurrency
    5. Summary
    6. Questions
  12. Debugging and Profiling Your CUDA Code
    1. Technical requirements
    2. Using printf from within CUDA kernels
      1. Using printf for debugging
    3. Filling in the gaps with CUDA-C
    4. Using the Nsight IDE for CUDA-C development and debugging
      1. Using Nsight with Visual Studio in Windows
      2. Using Nsight with Eclipse in Linux
      3. Using Nsight to understand the warp lockstep property in CUDA
    5. Using the NVIDIA nvprof profiler and Visual Profiler
    6. Summary
    7. Questions
  13. Using the CUDA Libraries with Scikit-CUDA
    1. Technical requirements
    2. Installing Scikit-CUDA
    3. Basic linear algebra with cuBLAS
      1. Level-1 AXPY with cuBLAS
      2. Other level-1 cuBLAS functions
      3. Level-2 GEMV in cuBLAS
      4. Level-3 GEMM in cuBLAS for measuring GPU performance
    4. Fast Fourier transforms with cuFFT
      1. A simple 1D FFT
      2. Using an FFT for convolution
      3. Using cuFFT for 2D convolution 
    5. Using cuSolver from Scikit-CUDA
      1. Singular value decomposition (SVD)
      2. Using SVD for Principal Component Analysis (PCA)
    6. Summary
    7. Questions
  14. The CUDA Device Function Libraries and Thrust
    1. Technical requirements
    2. The cuRAND device function library
      1. Estimating π with Monte Carlo
    3. The CUDA Math API
      1. A brief review of definite integration
      2. Computing definite integrals with the Monte Carlo method
      3. Writing some test cases
    4. The CUDA Thrust library
      1. Using functors in Thrust
    5. Summary
    6. Questions
  15. Implementation of a Deep Neural Network
    1. Technical requirements
    2. Artificial neurons and neural networks
      1. Implementing a dense layer of artificial neurons
    3. Implementation of the softmax layer
    4. Implementation of Cross-Entropy loss
    5. Implementation of a sequential network
      1. Implementation of inference methods
      2. Gradient descent
      3. Conditioning and normalizing data
    6. The Iris dataset
    7. Summary
    8. Questions
  16. Working with Compiled GPU Code
    1. Launching compiled code with Ctypes
      1. The Mandelbrot set revisited (again)
        1. Compiling the code and interfacing with Ctypes
    2. Compiling and launching pure PTX code
    3. Writing wrappers for the CUDA Driver API
      1. Using the CUDA Driver API
    4. Summary
    5. Questions
  17. Performance Optimization in CUDA
    1. Dynamic parallelism
      1. Quicksort with dynamic parallelism
    2. Vectorized data types and memory access
    3. Thread-safe atomic operations
    4. Warp shuffling
    5. Inline PTX assembly
    6. Performance-optimized array sum 
    7. Summary
    8. Questions
  18. Where to Go from Here
    1. Furthering your knowledge of CUDA and GPGPU programming
      1. Multi-GPU systems
      2. Cluster computing and MPI
      3. OpenCL and PyOpenCL
    2. Graphics
      1. OpenGL
      2. DirectX 12
      3. Vulkan
    3. Machine learning and computer vision
      1. The basics
      2. cuDNN
      3. Tensorflow and Keras
      4. Chainer
      5. OpenCV
    4. Blockchain technology
    5. Summary
    6. Questions
  19. Assessment
    1. Chapter 1, Why GPU Programming?
    2. Chapter 2, Setting Up Your GPU Programming Environment
    3. Chapter 3, Getting Started with PyCUDA
    4. Chapter 4, Kernels, Threads, Blocks, and Grids
    5. Chapter 5, Streams, Events, Contexts, and Concurrency
    6. Chapter 6, Debugging and Profiling Your CUDA Code
    7. Chapter 7, Using the CUDA Libraries with Scikit-CUDA
    8. Chapter 8, The CUDA Device Function Libraries and Thrust
    9. Chapter 9, Implementation of a Deep Neural Network
    10. Chapter 10, Working with Compiled GPU Code
    11. Chapter 11, Performance Optimization in CUDA
    12. Chapter 12, Where to Go from Here
  20. Other Books You May Enjoy
    1. Leave a review - let other readers know what you think