6

Dealing in Practice with Memory Hierarchy Effects and Instruction Level Parallelism

In the first section, we study memory disambiguation mechanisms in some high-performance processors. Such mechanisms, coupled with load/store queues in out-of-order processors, are crucial to improving the exploitation of instruction-level parallelism (ILP), especially for memory-bound scientific codes. Designing ideal memory disambiguation mechanisms is too complex in hardware because it would require precise address bit comparators; thus, microprocessors implement simplified and imprecise ones that perform only partial address comparisons. We study the impact of such simplifications on the sustained performance of some real high-performance processors. Despite all the advanced micro-architecture features of these processors, we demonstrate that memory address disambiguation mechanisms can cause deep program performance loss. We show that, even if data are located in low cache levels and enough ILP exist, the performance degradation may reach a factor of x21 slower if care is not taken on the generated streams of accessed addresses. We propose a possible software (compilation) technique based on the classical (and robust) load/store vectorization.

In the second section, we study cache effects optimization at instruction level for embedded very long instruction word (VLIW) processors. The introduction of caches inside processors provides micro-architectural ways to reduce the memory gap by tolerating ...

Get Advanced Backend Optimization now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.