Look for Hotspots
The first step was to use a performance analysis tool—the VTune Performance Analyzer in this case—to identify the hotspot in this library. Sampling found a hotspot in the dynamics simulation engine (a hotspot is a place where the program is spending a lot of time). The function containing the hotspot was dInternalStepFast.
dInternalStepFast is a solver function that works with two connected objects. Basically, it applies forces to the objects. There are several loops doing compute-intensive work. If you look more closely at the code, you will see that all these loops have data dependencies and the work is too fine-grained. So, you do not want to apply parallel_for there because it is unlikely that you will get good scalability. What we should do when we are discouraged in this way is look higher in the call tree—in this case, we wanted to see which function was calling dInternalStepFast.
Returning to VTune, we used information from a call graph view to identify the higher-level function that calls dInternalStepFast. That function is dInternalIslandStepFast. The call graph timing information showed that dInternalIslandStepFast itself did not take much time to execute: it goes through the list of objects and computes inertia tensor and rotational force for each one. Then, it calls dInternalStepFast for the object pairs.
Tip
Notice what we are doing: we are walking up the call graph to find as much parallelism as we can. The fact that we started our walk at a hotspot ...