book

CUDA Programming

by Shane Cook

December 2012

Intermediate to advanced

600 pages

18h 19m

English

Morgan Kaufmann

Read now

Unlock full access

IntroductionVon Neumann ArchitectureCrayConnection MachineCell ProcessorMultinode ComputingThe Early Days of Gpgpu CodingThe Death of the Single-Core SolutionNvidia and CudaGpu HardwareAlternatives to CudaConclusion
IntroductionTraditional Serial CodeSerial/Parallel ProblemsConcurrencyTypes of ParallelismFlynn’s TaxonomySome Common Parallel PatternsConclusion
PC ArchitectureGPU HardwareCPUs and GPUsCompute Levels
IntroductionInstalling the Sdk Under WindowsVisual StudioLinuxMacInstalling a DebuggerCompilation ModelError HandlingConclusion
What it all MeansThreadsBlocksGridsWarpsBlock SchedulingA Practical Example—HistogramsConclusion

IntroductionCachesRegister UsageShared MemoryConstant Memory
Texture MemoryConclusion
IntroductionSerial and Parallel CodeProcessing DatasetsProfilingAn Example Using AESConclusionReferences
IntroductionLocalityMulti-CPU SystemsMulti-GPU SystemsAlgorithms on Multiple GPUSWhich GPU?Single-Node SystemsStreamsMultiple-Node SystemsConclusion
Strategy 1: Parallel/Serial GPU/CPU Problem BreakdownStrategy 2: Memory ConsiderationsStrategy 3: TransfersStrategy 4: Thread Usage, Calculations, and Divergence
Strategy 6: Resource ContentionsStrategy 7: Self-Tuning ApplicationsConclusion
IntroductionLibrariesCUDA Computing SDKDirective-Based ProgrammingWriting Your Own KernelsConclusion
IntroductionCPU ProcessorGPU DevicePCI-E BusGeForce cardsCPU MemoryAir CoolingLiquid CoolingDesktop Cases and MotherboardsMass StoragePower ConsiderationsOperating SystemsConclusion
IntroductionErrors With CUDA DirectivesParallel Programming IssuesAlgorithmic IssuesFinding and Avoiding ErrorsDeveloping for Future GPUsFurther ResourcesConclusionReferences

Content preview from CUDA Programming

Chapter 6 Memory Handling with CUDA

Introduction

In the conventional CPU model we have what is called a linear or flat memory model. This is where any single CPU core can access any memory location without restriction. In practice, for CPU hardware, you typically see a level one (L1), level two (L2), and level three (L3) cache. Those people who have optimized CPU code or come from a high-performance computing (HPC) background will be all too familiar with this. For most programmers, however, it’s something they can easily abstract away.

Abstraction has been a trend in modern programming language, where the programmer is further and further removed from the underlying hardware. While this can lead to higher levels of productivity, as problems ...