Impact of level-2 cache sharing on the performance and power requirements of homogeneous multicore embedded systems

Computer Sci. and Eng. Dept, Florida Atlantic University, Boca Raton, FL, USA
Microprocessors and Microsystems (Impact Factor: 0.43). 08/2009; 33(5):388-397. DOI: 10.1016/j.micpro.2009.06.001
Source: DBLP


In order to satisfy the needs for increasing computer processing power, there are significant changes in the design process of modern computing systems. Major chip-vendors are deploying multicore or manycore processors to their product lines. Multicore architectures offer a tremendous amount of processing speed. At the same time, they bring challenges for embedded systems which suffer from limited resources. Various cache memory hierarchies have been proposed to satisfy the requirements for different embedded systems. Normally, a level-1 cache (CL1) memory is dedicated to each core. However, the level-2 cache (CL2) can be shared (like Intel Xeon and IBM Cell) or distributed (like AMD Athlon). In this paper, we investigate the impact of the CL2 organization type (shared Vs distributed) on the performance and power consumption of homogeneous multicore embedded systems. We use VisualSim and Heptane tools to model and simulate the target architectures running FFT, MI, and DFT applications. Experimental results show that by replacing a single-core system with an 8-core system, reductions in mean delay per core of 64% for distributed CL2 and 53% for shared CL2 are possible with little additional power (15% for distributed CL2 and 18% for shared CL2) for FFT. Results also reveal that the distributed CL2 hierarchy outperforms the shared CL2 hierarchy for all three applications considered and for other applications with similar code characteristics.

Download full-text


Available from: Abu Asaduzzaman, Jan 29, 2014
  • Source
    • "Computer hardware architecture has a crucial role to play in the performance aspects of the parallelization scheme. The size and manner in which caches are shared among cores [2] [31], and interconnect bandwidth [30] [14] and latencies [2] [8] [9] determine speed-up and scalability. However, these issues are beyond the scope of this study. "
    [Show abstract] [Hide abstract]
    ABSTRACT: We discuss the computational bottlenecks in molecular dynamics (MD) and describe the challenges in parallelizing the computation-intensive tasks. We present a hybrid algorithm using MPI (Message Passing Interface) with OpenMP threads for parallelizing a generalized MD computation scheme for systems with short range interatomic interactions. The algorithm is discussed in the context of nano-indentation of Chromium films with carbon indenters using the Embedded Atom Method potential for Cr–Cr interaction and the Morse potential for Cr–C interactions. We study the performance of our algorithm for a range of MPI–thread combinations and find the performance to depend strongly on the computational task and load sharing in the multi-core processor. The algorithm scaled poorly with MPI and our hybrid schemes were observed to outperform the pure message passing scheme, despite utilizing the same number of processors or cores in the cluster. Speed-up achieved by our algorithm compared favourably with that achieved by standard MD packages.
    Full-text · Article · Jan 2013 · Journal of Parallel and Distributed Computing
  • Source
    • "Recent studies show that the performance of shared and private CL2 organizations depends significantly on the workload. In multi-core systems, stressful workloads with little sharing may perform better in private CL2 organizations (as they reduce the access to contended interconnects and shared cache banks); while lighter workloads with a higher degree of sharing may perform better under a shared CL2 organization [3] [4]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: We investigate the impact of level-1 cache (CL1) parameters, level-2 cache (CL2) parameters, and cache organizations on the power consumption and performance of multi-core systems. We simulate two 4-core architectures - both with private CL1s, but one with shared CL2 and the other one with private CL2s. Simulation results with MPEG4, H.264, matrix inversion, and DFT workloads show that reductions in total power consumption and mean delay per task of up to 42% and 48%, respectively, are possible with optimized CL1s and CL2s. Total power consumption and the mean delay per task depend significantly on the applications including the code size and locality.
    Full-text · Conference Paper · Jan 2011
  • Source
    • "While there is general consensus on L1 private cache organization , for L2 there is still not a dominant paradigm unlike for the high-performance general-purpose processors. [5] clearly proves that trade-offs analysis of L2 on-chip cache architectures for embedded MPSoCs is a hot topic in the computer architecture community. In this paper, the authors perform a complete study to try to determine the best L2 cache architecture in a multi-core system. "
    [Show abstract] [Hide abstract]
    ABSTRACT: On-chip memory organization is one of the most important aspects that can influence the overall system behavior in multi-processor systems. Following the trend set by high-performance processors, high-end embedded cores are moving from single-level on chip caches to a two-level on-chip cache hierarchy. Whereas in the embedded world there is general consensus on L1 private caches, for L2 there is still not a dominant architectural paradigm. Cache architectures that work for high performance computers turn out to be inefficient for embedded systems (mainly due to power-efficiency issues). This paper presents a virtual platform for design space exploration of L2 cache architectures in low-power Multi-Processor-Systems-on-Chip (MPSoCs). The tool contains several L2 caches templates, and new architectures can be easily added using our flexible plugin system. Given a set of constrains for a specific system (power, area, performance), our tool will perform extensive exploration to find the cache organization that best suits our needs. Through some practical experiments, we show how it is possible to select the optimal L2 cache, and how this kind of tool can help designers avoid some common misconceptions. Benchmarking results in the experiments section will show that for a case study with multiple processors running communicating tasks allocated on different cores, the private L2 cache organization still performs better than the shared one.
    Full-text · Article · Jan 2010
Show more