Let's go over an example to see how we can approach debugging a CUDA kernel with printf with an example before we move on. There is no exact science to this method, but it is a skill that can be learned through experience. We will start with a CUDA kernel that is for matrix-matrix multiplication, but that has several bugs in it. (The reader is encouraged to go through the code as we go along, which is available as the broken_matrix_ker.py file in the 6 directories within the repository.)
Let's briefly review matrix-matrix multiplication before we continue. Suppose we have two matrices , A and B, and we multiply ...