Appendix 6

Experimental Efficiency of Software Data Preloading and Prefetching for Embedded VLIW

For our experimentation, we used a cycle-accurate simulator provided by STmicroelectronics. The astiss simulator offers the possibility of considering a nonblocking cache. We fixed the number of misinformation status hold registers (MSHR) (the pending loads queue) at 8. We made the choice of eight MSHR, because during experimentation, we observed that the instruction-level parallelism (ILP) and register pressure reach a limit when MSHR is set to eight; a larger MSHR does not yield more performance. We use a simulator for our experiments for many reasons:

– It is not easy to have a physical machine based on a very long instruction word (VLIW) ST231 processor. These processors are not sold for workstations, and are part of embedded systems such as mobile phones, DVD recorders, digital TV, etc. Consequently, we do not have direct access to a workstation for our experiments.
– The ST231 processor has a blocking cache architecture, while we conduct our experimental study on a non-blocking cache. Only simulation allows us to consider a non-blocking cache.
– Our experimental study requires precise performance characterization that is not possible with direct measurement on executions: the hardware performance counters of the ST231 do not allow to characterize processor stalls we are focusing on (stalls due to Dcache misses). Only simulation allows us to precisely measure the reasons of the ...

Get Advanced Backend Optimization now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.