
162 CHAPTER 6. SHARED-MEMORY: GPUS
• The kernel function, find1elt() in this case, runs on the GPU, and
is so denoted by the prefix global .
• The host code sets up space in the device memory via calls to cud-
aMalloc(), and transfers data from host to device or vice versa by
calling cudaMemcpy(). The data on the device side is global to all
threads.
• The host code launches the kernel via the lines
dim3 dimGrid (n , 1 ) ; // n b l o c k s in the grid
dim3 dimBlock ( 1 , 1 , 1 ) ; // 1 thre a d p er b l o c k
f i n d 1 e l t <<<dimGrid , dimBlock>>>(dm, drs , n ) ;
• Each thread executes the kernel, working on a different row of the
shared input matrix,