Let's think back to our current program and the reason why the locks are preventing us from achieving good performance in terms of speed: all of the active threads in our program interact with the same shared counter, which can only interact with one thread at a time. The solution to this problem is to isolate the interactions with a counter of separate threads. Specifically, the value of the counter that we are keeping track of will not be represented by only a single, shared counter object anymore; instead, we will use many local counters, one per thread/process, in addition to the shared global counter that we originally had.
The basic idea behind this approach is to distribute the work (incrementing ...