
Implementing an efficient convolution operation on GPU 63
FIGURE 5.3. Organization of the prefetching stage of data, for a 5 × 5 mask
and a thread block size of 8 ×4. Threads in both top corners of the top figure
are identified either by a circle or by a star symbol. The image tile, loaded into
shared memory, includes the pixels to be updated by the threads of the block,
as well as its 2-pixel wide halo. Here, circle and star symbols in the image tile
show which pixels are actually loaded into one shared memory vector by its
corresponding thread.
Mask size→
Image size↓ 3 × 3 5 × 5 7 × 7 9 × 9 11 × 11 13 × 13
512 × 512 1394 1176 907 670 567 477
1024 × 1024 ...