We have spent a good amount of time understanding how matrix multiplication works and we've looked at how it looks in its sequential form. Now we're going to attempt to map this to OpenCL in the most direct way.
The implementation technique here makes use of the fact that we create 2D thread blocks where each thread/work item in each dimension will access their respective elements in the row/column dimension.
In this recipe, we are going to use two matrices of dimensions 1024 x 1024 (we call A and B), and we'll multiply these two matrices together to produce a third matrix of 1024 x 1024, we call C.
You may wish to refresh your basic matrix theory at this point to convince yourself ...