Conference Paper

A fully static scheduling approach for fast cycle accurate systemC simulation of MPSoCs

LIP6/UPMC, Univ. Pierre et Marie Curie, Paris
DOI: 10.1109/ICM.2007.4497671 Conference: Microelectronics, 2007. ICM 2007. Internatonal Conference on
Source: IEEE Xplore

ABSTRACT This paper presents principles and tools to facilitate multi-processor system on chips (MPSoCs) design and modeling, and to speed up cycle accurate SystemC simulation. We describe an effective way to build an hardware architecture virtual prototype, using a library of SystemC simulation models based on communicating synchronous finite state machines. This modeling approach supports a fully static scheduling strategy, based on the analysis of the combinational dependency graph. Our static scheduling algorithm has been implemented in the SystemCASS simulator, and provides speed-up of one order of magnitude versus the standard event-driven SystemC simulation engine. The modeling approach proposed in this paper has been adopted by the SoCLIB French National Project, that is an open modeling and simulation platform for multi-processors system on chips.

  • [Show abstract] [Hide abstract]
    ABSTRACT: Nowadays, single-chip cache-coherent multi-cores up to 100 cores are a reality and many-cores of hundreds of cores are planned in the near future. This technological shift undertaking by the high-end computer-industry is converging with the design motivation of other domains like embedded and HPC industries. In this paper, we propose to investigate the scalability of the same four unmodified, shared-memory, image and signal processing oriented parallel applications on two targets: (i) embedded - TSAR, a single-chip 256-cores based, Cycle-Accurate-Bit-Accurate simulated, cc-NUMA many-core; and (ii) high-end - an AMD Opteron Interlagos, 64-core based, cc-NUMA many-core. Beside our scalability results on both cc-NUMA targets, our contributions include two operating system mechanisms: (i) a distributed, client/server based, scheduler design allowing the kernel to offer scalable inter-threads synchronization mechanisms; and (ii) a kernel-level memory affinity technique named Auto-Next-Touch allowing the kernel to transparently and automatically migrate physical pages in order to enforce the locality of thread's memory accesses. Although these two mechanisms are implemented and evaluated in ALMOS (Advanced Locality Management Operating System) running on the TSAR target, they remain applicable to other shared-memory operating systems.
    Design and Architectures for Signal and Image Processing (DASIP), 2012 Conference on; 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: Cascade is a cycle-based C++ simulation infrastructure used in the design and verification of two successive versions of Anton, a specialized machine designed for high-speed molecular dynamics computation. Cascade was engineered to address the size and speed challenges inherent in simulating massively parallel special-purpose machines. It provides a lightweight programming interface, rich debugging support, tight Verilog integration, fast multithreaded execution, and low memory overhead. Here, we describe the core features of Cascade that proved most valuable for our simulation efforts.
    50th Design Automation Conference; 06/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Nowadays, single-chip cache-coherent multi-cores up to 100 cores are a reality. Many-cores of hundreds of cores are planned in the near future. Due to the large number of cores and for power efficiency reasons (performance per watt), cores become simpler with small caches. To get efficient use of parallelism offered by these architec-tures, applications must be multi-threads. The POSIX Threads (PThreads) standard is the most portable way to use threads across operating systems. It is also used as a low-level layer to support other portable, shared-memory, parallel environments like OpenMP. In this pa-per, we propose to verify experimentally the scalabil-ity of shared-memory, PThreads based, applications, on Cycle-Accurate-Bit-Accurate (CABA) simulated, 512-cores. Using two unmodified highly multi-threads ap-plications, SPLASH-2 FFT, and EPFilter (medical im-ages noise-filtering application provided by Phillips) our study shows a scalability limitation beyond 64 cores for FFT and 256 cores for EPFilter. Based on hardware events counters, our analysis shows: (i) the detected scal-ability limitation is a conceptual problem related to the notion of thread and process; and (ii) the small per-core caches found in many-cores exacerbates the problem. Finally, we present our solution in principle and future work.

Full-text (2 Sources)

Available from
Jan 19, 2015