
Sample Applications 381
You will notice that the shader program is much longer than the direct
approach. The first part of the program loads the input image values to group-
shared memory. But why is this so complicated? The test image is 1024×768.
For the sake of simplicity in the presentation, suppose R = 1 in which case we
are convolving with a 3 × 3 kernel. Suppose that the number of x-threads is
512 and the number of x-groups is 2. To produce an output at pixel (x
0
,y
0
)
requires accessing pixels with x-value satisfying x
0
−R ≤ x ≤ x
0
+R and with
y-value satisfying y
0
− R ≤ y
0
+ R.If(x
0
,y
0
) is within R pixels of the pixels
represented by the thread group, ...