book

GPU-Accelerated Computing with Python 3 and CUDA

Name: GPU-Accelerated Computing with Python 3 and CUDA
ISBN: 9781803245423

by Niels Cautaerts, Hossein Ghorbanfekr

March 2026

Intermediate

534 pages

12h 51m

English

Packt Publishing

Read now

Unlock full access

Preface
Free benefits with your book
Part 1: Fundamentals of GPU programming with CUDA in Python 3
Chapter 1: Why GPU Programming with CUDA in Python 3?
Technical requirementsWhat is GPGPU?What is CUDA?Why GPGPU is attractiveWhen is GPGPU useful?Application areas of GPGPUEstimating the benefits of parallelizationUsing Amdahl's law for speedup estimationProfiling code for generating a Julia set fractalPerforming more advanced profiling with cProfile and ScaleneUnderstanding the factors that limit parallelismMeasuring the benefits of parallelization empiricallySummaryQuestionsAnswers
Chapter 2: Setting Up a GPU Programming Environment Locally and in the Cloud
Technical requirementsSetting up a local development environmentLinux (Ubuntu 24.04 LTS)Debugging the driverWindows 10WSLSetting up a remote development environmentGoogle ColabLambda LabsAWS EC2SummaryQuestionsAnswers
Chapter 3: Writing and Executing CUDA Kernels with Numba-CUDA
Technical requirementsWriting and executing a CUDA kernelThe CUDA programming modelWriting a CUDA kernel to compute the Julia set fractalExecuting the kernelDealing with a mismatched grid and problem sizeMaking kernels modular with device functionsAvoiding race conditions in CUDA kernelsHow to use thread synchronizationUsing atomicsLanguage features that can be used in Numba-CUDA kernelsTypes that can be used in a kernelLanguage constructs that can be used in a kernelFunctions that can be called from a kernelAlternative methods for kernel definition and invocationDefining kernels with vectorizeDefining reduction kernelsSummaryQuestionsAnswers
Chapter 4: Profiling and Debugging CUDA Code
Technical requirementsWhy profiling and debugging matterChallenges in GPU profiling and debuggingAsynchronous execution and concurrencyDevice versus host code separationCPU time versus GPU timeKey performance aspects in profiling CUDA applicationsMemory transfer between host and deviceMemory- versus compute-bound kernelsBasic GPU profiling toolsTime profiling using the time and timeit modulesPython profiler: ScaleneLinux-specific GPU monitoring tool: nvtopNVIDIA Nsight profiling toolsProfiling GPU timeline with Nsight SystemsVector addition exampleMatrix multiplication exampleProfiling sections with Nsight ComputeProfiling low-level metrics with Nsight ComputeDebugging Numba-CUDA kernelsUsing the print statementEmulating GPU code on the CPUInspecting JIT compilationChecking resource usageGetting the generated PTX codeSummaryQuestionsAnswers
Part 2: Performance Optimization and Advanced CUDA Topics
Chapter 5: Optimizing the Performance of CUDA Code
How a GPU executes kernelsA basic overview of GPU hardwareHow a kernel is scheduledThe memory hierarchy and how data travelsMaximizing occupancyImproving performance by inspecting PTX assembly codeMaximizing the use of instruction-level parallelism with loop unrollingAvoiding warp divergenceEfficient global memory access patternsAvoiding frequent global memory data accessEffectively using shared memoryExchanging register data using warp shuffle instructionsUsing intrinsic and libdevice functionsUsing cooperative groups instead of multiple kernel launchesSummaryQuestionsAnswers
Chapter 6: Enabling Concurrency Using CUDA Streams
Technical requirementsWhy CUDA streams are neededKey concepts in CUDA streamsStreams and concurrencyImplicit synchronizationGPU work queueCreating and managing streamsAsynchronous data transfersSpeeding up image processing with CUDA streamsDefining CUDA kernelsAverage filterSobel edge detection kernelPreparing memory and CUDA streamsProcessing imagesBenchmarking resultsCUDA eventsTiming individual streamsSynchronizing across streamsMultiple CPU threads with CUDA streamsSummaryQuestionsAnswers
Chapter 7: Scaling to Multiple GPUs
Technical requirementsIntroduction to multi-GPU computingMulti-GPU systems architectureParallelism strategiesData parallelismModel parallelismMulti-GPU techniques with NumbaMulti-GPU matrix multiplication with NumbaDistributed computing with DaskDask overviewSetting up a Dask-CUDA clusterSetting up a Dask cluster locallySetting up a multi-node Dask clusterDask clientCombining Numba-CUDA and DaskDask dashboardDask array APIMulti-GPU computing with JAXMatrix multiplicationDistributed machine learning trainingSummaryQuestionsAnswers

Part 3: Using High-Level Python Libraries for GPU Computation
Chapter 8: Bringing NumPy and SciPy to the GPU with CuPy
Technical requirementsWhat CuPy offersHow to use CuPy arraysCreating CuPy arraysData transfers and converting other objects to CuPy arraysElementwise operations (ufuncs)Reduction operationsScan operationsBroadcastingArray indexing and slicingReshaping, combining, and reordering arraysOther operationsSaving and loading dataHow to use CuPy with other librariesUsing CuPy with NumPyUsing CuPy with NumbaUsing CuPy with SciPyWhen to use CuPyCuPy versus NumPyCuPy versus Numba-CUDAHow to write GPU-agnostic codePerformance tipsBe mindful of dtypesPreallocate, don't concatenateNot every axis is created equalSummaryQuestionsAnswers
Chapter 9: Bringing pandas and scikit-learn to the GPU with Rapids
Technical requirementsAccelerating data science on the GPU with RAPIDSHow to use cuDF: pandas on the GPUIntroducing the DataFrameExchanging data between the host and the deviceImporting and exporting dataExchanging data with other librariesSelecting data in a DataFrameTransforming data in columnsAggregating data across rowsCombining DataFramesDealing with missing elementsReshaping DataFramesWhen to use cuDFHow to write GPU-agnostic DataFrame manipulation codeAlternatives to cuDFHow to use cuML: scikit-learn on the GPUThe basics of machine learningDeciding on an objectiveData exploration and preprocessingSplitting the dataset into training and test setsTraining a modelEvaluating the modelStoring the modelPerformance tipsThe next stepsWhen to use cuMLcuML versus scikit-learncuML versus Pytorch, JAX, or TensorFlowHow to write GPU-agnostic cuML codeSummaryQuestionsAnswers
Chapter 10: Solving Optimization Problems on the GPU with JAX
Introduction to JAX's key featuresJIT compilationAutomatic differentiationAutomatic vectorizationBuilding a linear regression model with JAXLinear regression as an optimization problemCalculating electrical resistance with noisy measurementsBuilding a neural networkHow MLP worksPredicting damped oscillation in an RLC circuitTraining a physics-informed neural networkSummaryQuestionsAnswers
Part 4: Real-World Example Applications
Chapter 11: Solving the Heat Equation on the GPU
Technical requirementsThe heat equationDiscretizing the equationFinite difference methodCFL conditionInitial conditionBoundary conditionsSolving the equationInitializationsCPU implementationGPU implementationParallelizing tasks across threadsEfficient use of memory layoutsPerformance analysisSummaryQuestionsAnswers
Chapter 12: Image Processing and Computer Vision on the GPU
Technical requirementsHow a computer represents imagesThe basics of image processingImplementing a convolutional filter from scratchUsing convolutional filters from high-level librariesCase study: detecting and classifying objects in a noisy imageExploring the problemSegmenting the imageClassifying objectsComparing objects based on shape descriptorsDirect image comparison: template matchingClassifying objects with a CNNSummaryQuestionsAnswers
Chapter 13: Simulating Atomic Interactions on the GPU
Technical requirementsHow MD simulations workSystem initializationParametersParticlesAtomic interactionsThe Lennard-Jones potentialThe Lennard-Jones forceParallelizing force calculationsTime integrationParallelizing the Verlet time integratorData collectionRunning the simulationPerformance analysisBenchmarksProfilingSummaryQuestionsAnswers
Chapter 14: Implementing Your Own Transformer-Based Language Model
Technical requirementsIntroduction to language modelsOverview of transformer-based LLM architectureInitializing constant parametersUnderstanding the attention mechanismScaled dot product attention kernelImplementing the transformer blockMulti-head self-attentionFeed-forward networkThe transformer layerBuilding the language modelEmbeddingsTransformer decoderThe language modelLoading data and tokenizationGetting the datasetTokenizationPreprocessingDataloaderTraining the language modelSummaryQuestionsAnswers
Part 5: Beyond This Book
Chapter 15: Expanding and Deepening Your GPU Programming Knowledge
Technical requirementsAdvanced low-level featuresDetailed breakdown of the different types of memoryRefining the cache behavior of memory operationsUsing Tensor CoresUsing optimized NVIDIA libraries in CUDA kernels with nvmath-pythonUsing CUDA C from Numba-CUDA kernelsPseudo-random number generation inside CUDA kernelsOther CUDA Python librariesSpecialized applicationsDeep learningJAXPyTorchTensorFlowBlockchainBig data and distributed computingEmbeddings and vector-based similarity searchGraph analyticsData visualization and interactive explorationOther GPU programming platformsROCMOpenCLGraphics APIsOpenGLVulkanOther graphics APIsSummaryQuestionsAnswers
Chapter 16: Unlock Your Exclusive Benefits
Unlock this Book's Free Benefits in 3 Easy Steps
Other Books You May Enjoy
Subscribe to Deep Engineering
Index

Content preview from GPU-Accelerated Computing with Python 3 and CUDA

5 Optimizing the Performance of CUDA Code

With CUDA kernel development and profiling covered, it's time to learn how to optimize CUDA code for maximum performance.

This chapter breaks down how GPU hardware executes kernels and what drives performance. The memory hierarchy, instruction scheduling, and thread execution model are dissected to show their role in code efficiency. Understanding these mechanics allows us to target bottlenecks and use hardware more effectively. The discussion also covers actionable techniques, such as better memory access patterns, warp-level programming, and loop unrolling, to push CUDA applications to the limit.

The learning outcomes of this chapter are as follows:

Describe GPU hardware architecture and the key factors ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781803245423

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

GPU-Accelerated Computing with Python 3 and CUDA

by Niels Cautaerts, Hossein Ghorbanfekr

5

Optimizing the Performance of CUDA Code

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.