Grain size
The third argument, grainsize, specifies the number of iterations for a reasonable size chunk to deal out to a processor. If the iteration space has more than grainsize iterations, parallel_for splits it into separate subranges that are scheduled separately.
The grainsize amortizes parallel scheduling overhead. Having a grainsize independent of the number of processors tends to keep, in common cases, the parallel scheduling overhead in constant proportion to real work. This is because the packaging-and-handling overhead is relatively constant per grain and therefore independent of the number of processors.
The grainsize enables you to avoid excessive parallel overhead. A parallel loop construct incurs overhead cost for every subrange. If the subranges are too small, the overhead may exceed the useful work. By specifying a grain size, you can limit the overhead. The grainsize effectively sets a minimum threshold for parallelization.
Figure 3-1 illustrates the impact of overhead by showing the useful work as lettered squares surrounded by the overhead of a grain of work (the darker surrounding areas). On the left, the problem is broken into four pieces (4X), and on the right, with a finer grain size, the problem is broken into 36 pieces (36X).

Figure 3-1. Packaging versus grain size, same workload
The total work to be done on the system is represented by the light and dark gray ...