
7. Optimizing OpenCL Kernels for the ARM Mali-T600 GPUs 347
This notation allows us to write a single iteration over k as
32n+31
,
i=32n
A[i, k] × 4
,
4m+3
,
j=4m
B[4k +0,j]
× 32,
4m+3
,
j=4m
B[4k +1,j]
× 32,
4m+3
,
j=4m
B[4k +2,j]
× 32,
4m+3
,
j=4m
B[4k +3,j]
× 32.
As a cache line has space for four
float4 elements, we see that the reads from
A read the first quarter of 32 consecutive cache lines and the reads from B read
four full cache lines. To get full cache lines instead, we consider four consecutive
iterations in k together, and we see that those four iterations read 32 full cache
lines from A and 16 full cache lines from B. For the moment, we restrict ourselves ...