Chapter 7Tuning Instruction-Level Primitives
What's in this chapter?
- Learning about multiple classes of CUDA instructions and their impact on application behavior
- Observing the relative accuracy of single- and double-precision floating-point values
- Experimenting with the performance and accuracy of standard and intrinsic functions
- Uncovering undefined behavior from unsafe memory accesses
- Understanding the significance of arithmetic instructions and the consequences of using them improperly
When making the decision to use CUDA for a particular application, the primary motivator is usually the computational throughput of GPUs. As you learned in previous chapters in this book, in order to achieve high throughput on GPUs you need to understand what factors are limiting peak performance. You have already learned about CUDA tools that can help you determine if your workload is sensitive to latency, bandwidth, or arithmetic operations. Based on this understanding you can generally classify applications into two categories:
- I/O-bound
- Compute-bound
In this chapter, you will focus on tuning compute-bound workloads. The computational throughput of a processor can be measured by the number of operations it performs in a period of ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access