18 Existing Parallel and Distributed Systems, Challenges, and Solut ions
• optimization of shared memory and register usage, which may increase
the parallelization level;
• loop unrolling;
• launching several kernels and overlapping communication and computa-
tions;
• using multiple GPUs.
OpenCL offers an API similar to that of NVIDIA CUDA but generalized
to run not only on GPUs but also on multicore CPUs. This allows us to use
a modern workstation with several multicore CPUs and one or possibly more
GPUs as a cluster of processing cores for highly parallel multithreaded codes.
Because of that, OpenCL requires some more management code related to
device discovery and handling but the kernel and grid concepts have remained
analogous to NVIDIA CUDA.
Developmen ...