Advanced programming concepts
This chapter describes how to optimize the matrix multiplication examples that are shown in this publication. It also explains some of the improved techniques for effectively using the compute units with load reuse and cache blocking.
4.1 Multiple accumulators SGEMM for load value reuse
Consider the example of the sgemm instruction, shown in Example 3-4 on page 25, which uses one accumulator to generate a 4x4 result, where two load operations are required for each gerpp instruction. This type of model restricts the ...

Get Matrix-Multiply Assist Best Practices Guide now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.