Streaming workloads are among the simplest that can be ported to CUDA: computations where each data element can be computed independently of the others, often with such low computational density that the workload is bandwidth-bound. Streaming workloads do not use many of the hardware resources of the GPU, such as caches and shared memory, that are designed to optimize reuse of data.
Since GPUs give the biggest benefits on workloads with high computational density, it might be useful to review some cases when it still makes sense for streaming workloads to port to GPUs.
• If the input and output are in device memory, it doesn’t make sense to transfer the data back to the CPU just to perform one operation.
• If the ...