Conference PaperPDF Available

VIPPE: Native simulation and performance analysis framework for multi-processing embedded systems

Authors:
A preview of the PDF is not available
Article
This paper presents our performance profiling and the optimizations made to the program presented in the MEMOCODE SW design contest to make it faster when running in the Raspberry-Pi platform.
Conference Paper
Recent work has proposed two-phase joint analytical and simulation-based design space exploration (JAS-DSE) approaches. In such approaches, a first analytical phase relies on static performance estimation and either on exhaustive or heuristic search, to perform a very fast filtering of the design space. Then, a second phase obtains the Pareto solutions after an exhaustive simulation of the solutions found as compliant by the analytical phase. However, the capability of such approaches to find solutions close to the actual Pareto set at a reasonable time cost is compromised by current system complexities. This limitation is due to the fact that such approaches do not support an heuristic exploration on the simulation-based phase. It is not straightforward because in the second phase the heuristic is constrained to consider only the custom set of solutions found in the first phase. This set is in general unconnected and irregularly distributed, which prevents the application of existing heuristics. This paper provides as a solution a novel search heuristic called ARS (Adaptive Random Sampling). The ARS strategy enables the application of heuristic search in the two phases of the JAS-DSE flow, by enabling the application of heuristic in the second phase, regardless the type of performance estimation done at each phase. Moreover, it enables the definition of N-phase DSE flows. The paper shows on an experiment focused on predictable multi-core systems how this enhanced JAS-DSE is capable to find more efficient solutions and to tune the trade-off between exploration time and accuracy in finding actual Pareto solutions.
Data
Full-text available
The growing complexity of embedded applications currently causes a trend towards multi-core processors in the embedded domain. Time-consuming detailed simulations make the design of such systems increasingly sophisticated. In this work, applicability of Parallel Discrete Event Simulation (PDES) in the context of cycle-accurate Multi-Processor System-on-Chip (MPSoC) simulation is investigated on the Single-chip Cloud Computer (SCC) from Intel. The presented strategy targets asynchronous parallel model execution where only adjacent model partitions need to synchronize with each other in order to advance in simulation time. Performance of the approach is evaluated by means of a scalable cycle-accurate MPSoC model called HeMPS. For a 8x8 RTL model measurements reveal a speedup versus sequential RTL simulation of 25.3x. When exchanging RTL processing elements by cycle-accurate simulators a speedup of 56.3x versus sequential RTL simulation is obtained. These results promise good suitability of the asynchronous strategy for detailed parallel MPSoC simulation on an architecture like the SCC.
Article
Full-text available
The case for developing and using virtual platforms (VPs) has now been made. If developers of complex HW/SW systems are not using VPs for their current design, complexity of next generation designs demands for their adoption. In addition, the users of these complex systems are asking either for virtual or real platforms in order to develop and validate the software that runs on them, in context with the hardware that is used to deliver some of the functionality. Debugging the erroneous interactions of events and state in a modern platform when things go wrong is hard enough on a VP; on a real platform (such as an emulator or FPGA-based prototype) it can become impossible unless a new level of sophistication is offered. The priority now is to ensure that the capabilities of these platforms meet the requirements of every application domain for electronics and software-based product design. And to ensure that all the use cases are satisfied. A key requirement is to keep pace with Moore's Law and the ever increasing embedded SW complexity by providing novel simulation technologies in every product release. This paper summarizes a special session focused on the latest applications and latest use cases for VPs. It gives an overview of where this technology is going and the impact on complex system design and verification.
Article
Full-text available
This work attempts to provide insight into the problem of executing discrete event simulation in a distributed fashion. The article serves as the state of the art in Parallel Discrete-Event Simulation (PDES) by surveying existing algorithms and analyzing the merits and drawbacks of various techniques. We discuss the main characteristics of existing synchronization methods for parallel and distributed discrete event simulation. The two major categories of synchronization protocols, namely conservative and optimistic, are introduced and various approaches within each category are presented. We also present the latest efforts towards PDES on emerging platforms such as heterogeneous multicore processors, Web services, as well as Grid and Cloud environment.
Chapter
Design Space Exploration for complex, multi-processor embedded Systems demands new modeling, simulation, performance estimation tools and design methodologies. Recently approved as IEEE 1666 standard, SystemC has proven to be a powerful language for system modeling and simulation. This chapter presents M3-SCoPE: a SystemC framework for platform modeling, SW source-code behavioral simulation and performance estimation of multi-processor embedded systems. Using M3-SCoPE, the application SW running on the different processors of the platform can be simulated efficiently in close interaction with the rest of the platform components. In this way, fast and sufficiently accurate performancemetrics of the system are obtained. These metrics are then delivered to the Design Space Exploration (DSE) tools to evaluate the quality of the different configuration in order to select the best ones.
Conference Paper
From a single SoC to a network of embedded devices communicating with a backend cloud-computing server, emerging classes of embedded systems feature an increasing number of heterogeneous components that operate concurrently in a distributed environment. As the scale and complexity of these systems continues to grow, there is a critical need for scalable and efficient simulators. We propose a networked virtual platform as a scalable environment for modeling and simulation. The goal is to support the development and optimization of embedded computing applications by handling heterogeneity at the chip, node, and network level. To illustrate the properties of our approach, we present two very different case studies: the design of an Open MPI scheduler for a heterogeneous distributed embedded system and the development of an application for crowd estimation through the analysis of pictures uploaded from mobile phones.
Conference Paper
With traditional cycle-accurate or instruction-set simulations of processors often being too slow, host-compiled or source-level software execution approaches have recently become popular. Such high-level simulations can achieve order of magnitude speedups, but approaches that can achieve highly accurate characterization of both power and performance metrics are lacking. In this paper, we propose a novel host-compiled simulation approach that provides close to cycle-accurate estimation of energy and timing metrics in a retargetable manner, using flexible, architecture description language (ADL) based reference models. Our automated flow considers typical front- and back-end optimizations by working at the compiler-generated intermediate representation (IR). Path-dependent execution effects are accurately captured through pairwise characterization and backannotation of basic code blocks with all possible predecessors. Results from applying our approach to PowerPC targets running various benchmark suites show that close to native average speeds of 2000 MIPS at more than 98% timing and energy accuracy can be achieved.
Article
Integration of multiple heterogeneous processors into a single system-on-a-chip is a clear trend in embedded devices. Designing and verifying these devices requires high-speed and easy-to-build simulation platforms. Among the software simulation approaches, native simulation is a good candidate since the embedded software is executed natively on the host machine, and no instruction set simulator development effort is necessary. However, existing native simulation approaches are such that the simulated software shares the memory space of the modeled hardware modules and the host operating system, making impractical the support of legacy code running on the target platform. To overcome this issue seldom mentioned in the literature, we propose the addition of a transparent address space translation layer to separate the target address space from the host simulator one. For this, we exploit the hardware-assisted virtualization technology now available on most general-purpose processors. Experiments show that this solution does not degrade the native simulation speed, while keeping the ability to accomplish software performance evaluation.
Article
The design and the programming of heterogeneous future MPSoCs including thousands of processor cores is a hard challenge. Means are necessary to program and simulate the dynamic behavior of such systems in order to dimension the hardware design and to verify the software functionality as well as performance goals. Cycle-accurate simulation of multiple parallel applications simultaneously running on different cores of the architecture would be much too slow and is not the desired level of detail. In this paper, we therefore present a novel high-level simulation approach which tackles the complexity and the heterogeneity of such systems and enables the investigation of a new computing paradigm called invasive computing. Here, the workload and its distribution are not known at compile-time but are highly dynamic and have to be adapted to the status (load, temperature, etc.) of the underlying architecture at run-time. We propose an approach for the modeling of tiled MPSoC architectures and the simulation of resource-aware programming concepts on these. This approach delivers important timing information about the parallel execution and also is taking into account the computational properties of possibly different types of cores.
Article
In this paper, we present a fast cycle-accurate instruction set simulator (CA-ISS) for system-on-chip development based on QEMU and SystemC. Even though most state-of-the-art commercial tools have tried very hard to provide all the levels of details to satisfy the different requirements of the software designer, the hardware designer, and even the system architect, the hardware/software co-simulation speed is dramatically slow when co-simulating the hardware models at the register-transfer level (RTL) with a full-fledged operating system (OS). Our experimental results show that the combination of QEMU and SystemC can make the co-simulation at the CA level much faster than the conventional RTL simulation, even with a full-fledged operating system up and running. Furthermore, the statistics indicate that with every instruction executed and every memory accessed since power-on traced at the CA level, it takes 28m15.804s on average to boot up a full-fledged Linux kernel, even on a personal computer. Compared to the kernel boot time reported by Xilinx and SiCortex, the proposed CA-ISS is about 6.09 times faster compared to “SystemC without trace” of Xilinx and about 30.32 times faster compared to “SystemC models converted from RTL” of SiCortex. The main contributions of this paper are threefold: 1) a hardware/software co-simulation environment capable of running a full-fledged OS at the early stage of the electronic system level design flow at an acceptable simulation speed is proposed; 2) a virtual platform constructed using the proposed CA-ISS as the processor model can be used to estimate the performance of a target system from system perspective, which all the previous works, such as QEMU-SystemC, do not provide; and 3) such a virtual platform also provides the modeling capability from the transaction level down to the CA level or the other way around.
Conference Paper
Due to the growing complexity of multiprocessor systems-on-chip (MPSoCs), there is an increasing demand on efficient design space exploration techniques. In addition to the analysis of diverse hardware architectures, these techniques should assist the designer in the flexible evaluation of various scheduling policies and application mappings while taking effects of the shared on-chip communication infrastructure into account. Most available simulation approaches are either unable to cover all these aspects jointly or have poor simulation performance. In this paper, we present a framework for timing analysis of MPSoC architectures using abstract and yet accurate traces. The traces capture both precise processing latencies and memory access patterns and represent application- and OS-related workload. Performance estimation is performed by an interleaved execution of the traces on a highly configurable multiprocessor platform modeled in our trace-driven SystemC TLM simulator. Using the flexible scheduler model presented in this paper, various mappings and scheduling policies can be rapidly evaluated while considering on-chip interconnect contention and usage of shared resources. Due to the abstraction of the trace-driven simulations, the proposed framework allows for both fast and accurate explorations of MPSoC design alternatives.