
466 VI Compute
more than twice as fast as the estimate. Because ALU operation time cannot
be shortened, the only reason for superior performance is that the two-level con-
straint solver has faster memory access. This is possible only because of better
utilization of the cache hierarchy.
Cache utilization is low for the global constraint solver because a rigid body is
processed only once in a kernel and the assignment of a constraint to a SIMD is
random for each kernel. Therefore, it cannot reuse any cached data from previous
kernel executions. In contrast, localized constraint solving and in-SIMD batch
dispatch of the two-level solver enable it to