This chapter goes through the steps that one typically takes to optimize a CUDA Fortran application. Reducing the overhead of transferring data between the host and device is discussed first in terms of reducing the amount of data transferred and making such transfers efficient and possible. Once the data are on the device, we discuss at length how to efficiently access these data from kernels, including topics of global memory data coalescing, use of on-chip shared memory, and the read-only constant and texture memories. The topic of launching kernels with enough parallelism, whether in the form of instruction-level or thread-level parallelism, is also discussed. The final section covers instruction optimization. ...
Get CUDA Fortran for Scientists and Engineers now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.