Programming with Matrix-Multiply Assist
The Matrix-Multiply Assist (MMA) implementation of various kernels at different levels of precision is described in this chapter. The implementations that are shown use a single accumulator.
3.1 Single-precision GEMM using MMA
The innermost kernel of sgemm_kernel_4x4 shown in Example 3-1 loads four elements of A, loads four elements of B, and performs an outer product MMA operation to produce one 4x4 partial result of C in one accumulator register.
Example 3-1 SGEMM kernel using MMA instructions
.section ...

