Appendix 6 Experimental Efficiency of Software Data Preloading and Prefetching for Embedded VLIW

For our experimentation, we used a cycle-accurate simulator provided by STmicroelectronics. The astiss simulator offers the possibility of considering a nonblocking cache. We fixed the number of misinformation status hold registers (MSHR) (the pending loads queue) at 8. We made the choice of eight MSHR, because during experimentation, we observed that the instruction-level parallelism (ILP) and register pressure reach a limit when MSHR is set to eight; a larger MSHR does not yield more performance. We use a simulator for our experiments for many reasons:

– It is not easy to have a physical machine based on a very long instruction word (VLIW) ST231 processor. These processors are not sold for workstations, and are part of embedded systems such as mobile phones, DVD recorders, digital TV, etc. Consequently, we do not have direct access to a workstation for our experiments.

– The ST231 processor has a blocking cache architecture, while we conduct our experimental study on a non-blocking cache. Only simulation allows us to consider a non-blocking cache.

– Our experimental study requires precise performance characterization that is not possible with direct measurement on executions: the hardware performance counters of the ST231 do not allow to characterize processor stalls we are focusing on (stalls due to Dcache misses). Only simulation allows us to precisely measure the reasons of the ...

Get Advanced Backend Optimization now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Advanced Backend Optimization by Benoit de Dinechin, Sid Touati

Appendix 6

Experimental Efficiency of Software Data Preloading and Prefetching for Embedded VLIW

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly