The naive implementation might do a good job if we are the only application utilizing the hardware, and transforming each chunk has the same computational cost. However, this is rarely the case; rather, we want a good general purpose parallel implementation.
The following illustrations show the problems we want to avoid. If the computational cost is not equivalent for each chunk, the implementation is limited to the chunk that takes the most time:
If the application and/or the operating system has other processes to handle, ...