Chapter 12High-Performance Optimizations on Tiled Manycore Embedded Systems: A Matrix Multiplication Case Study*

The scaling of complementary metal-oxide-semiconductor (CMOS) transistors into the nanometer regime unveils the possibility of integrating millions of transistors on a single chip. A major challenge for the computer industry is the efficient utilization of this ever-increasing number of on-chip transistors. Increasing clock frequency and single-core architectural innovations, such as deep pipelines, out-of-order execution, and prefetching, to exploit instruction-level parallelism (ILP) for enhancing single-thread performance yields diminishing returns as these innovations/techniques hit the power wall and the ILP wall [358]. Consequently, major segments of the computer industry conclude that future performance improvements must largely come from increasing the number of on-chip processor cores.

The transformation in the computer industry from single-core to multicore and subsequently manycore necessitates efficient exploitation of thread-level parallelism (TLP) for attaining high performance. The terms manycore and massively multicore are sometimes used to refer to multicore architectures with an especially high number of cores (tens or hundreds) [359, 360]. Manycore technologies aim to exploit concurrency, high computational density (CD), workload distribution, or a combination of these methods to attain high performance. The term high performance refers to attaining ...

Get Modeling and Optimization of Parallel and Distributed Embedded Systems now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.