The streaming multiprocessors (SMs) are the part of the GPU that runs our CUDA kernels. Each SM contains the following.
• Thousands of registers that can be partitioned among threads of execution
• Several caches:
– Shared memory for fast data interchange between threads
– Constant cache for fast broadcast of reads from constant memory
– Texture cache to aggregate bandwidth from texture memory
– L1 cache to reduce latency to local or global memory
• Warp schedulers that can quickly switch contexts between threads and issue instructions to warps that are ready to execute
• Execution cores for integer and floating-point operations:
– Integer and single-precision floating point operations
– Double-precision floating ...