Chapter 3



This chapter goes through the steps that one typically takes to optimize a CUDA Fortran application. Reducing the overhead of transferring data between the host and device is discussed first in terms of reducing the amount of data transferred and making such transfers efficient and possible. Once the data are on the device, we discuss at length how to efficiently access these data from kernels, including topics of global memory data coalescing, use of on-chip shared memory, and the read-only constant and texture memories. The topic of launching kernels with enough parallelism, whether in the form of instruction-level or thread-level parallelism, is also discussed. The final section covers instruction optimization. ...

Get CUDA Fortran for Scientists and Engineers now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.