Conference Paper

SIMULTime: Context-sensitive timing simulation on intermediate code representation for rapid platform explorations

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Nowadays, product lines are common practice in the embedded systems domain as they allow for substantial reductions in development costs and the time-to-market by a consequent application of design paradigms such as variability and structured reuse management. In that context, accurate and fast timing predictions are essential for an early evaluation of all relevant variants of a product line concerning target platform properties. Context-sensitive simulations provide attractive benefits for timing analysis. Nevertheless, these simulations depend strongly on a single configuration pair of compiler and hardware platform. To cope with this limitation, we present SIMULTime, a new technique for context-sensitive timing simulation based on the software intermediate representation. The assured simulation throughput significantly increases by simulating simultaneously different hardware hardware platforms and compiler configurations. Multiple accurate timing predictions are produced by running the simulator only once. Our novel approach was applied on several applications showing that SIMULTime increases the average simulation throughput by 90% when at least four configurations are analyzed in parallel.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Finally, a third level of simulation abstraction is based on the source code intermediate representation (IR) commonly internal to compilers. An example has been presented in [8], where the authors rely on the LLVM Compiler Infrastructure [9]. Similarly to binary simulations, context-sensitive IR simulations need an IR execution engine that determines the program flow path and that consequently updates the timing estimation. ...
... The new methodology we propose in this paper belongs to the IR context-sensitive simulations category and it is inspired by the simulation approach previously presented in [8]. The authors of this work implemented a timing simulator, SIMULTime, based on the interpreter LLVM-IR execution engine for evaluating in parallel multiple SoCs and compiler optimizations configurations. ...
Conference Paper
Fast and accurate predictions of a program’s execution time are essential during the design space exploration of embedded systems. In this paper, we present a novel approach for efficient context-sensitive timing simulations based on the LLVM IR code representation. Our approach allows evaluating simultaneously multiple hardware platform configurations with only one simulation run. State-of-the-art solutions are improved by speeding up the simulation throughput relying on the fast LLVM IR JIT execution engine. Results show on average over 94% prediction accuracy and a speedup of 200 times compared to interpretive simulations. The simulation performance reaches up to 300 MIPS when one HW configuration is assessed and it grows up to 1 GIPS evaluating four configurations in parallel. Additionally, we show that our approach can be utilized for producing early timing estimations that support the designers in mapping a system to heterogeneous hardware platforms.
... For the presented framework, we decided to produce the necessary timing information via SIMULTime [5]. SIMULTime is a hybrid timing analysis tool that mixes both STA and MBTA. ...
Conference Paper
In this paper we present MODELTime, a fully automated framework that enables the consideration of target platform-dependent timing during the model-based development of embedded applications using MATLAB/Simulink simulations. MODELTime extends the fast functional evaluation capabilities of MATLAB/Simulink with a fully integrated evaluation of timing, which is a non-functional property. The proposed framework enables the fast and accurate exploration and visualization of timing properties for embedded processor platforms by automatically annotating timing properties using a bottom-up injection of context-sensitive timing behavior. Finally, we show the utility of the proposed approach by analyzing an automotive case study consisting of an air-fuel ratio control system.
Article
Full-text available
Simulators and empirical profiling data are often used to understand how suitable a specific hardware architecture is for an application. However, simulators can be slow, and empirical profiling-based methods can only provide insights about the existing hardware on which the applications are executed. While the insights obtained in this way are valuable, such methods cannot be used to evaluate a large number of system designs efficiently. Analytical performance evaluation models are fast alternatives, particularly well-suited for system design-space exploration. However, to be truly application-specific, they need to be combined with a workload model that captures relevant application characteristics. In this paper we introduce PISA, a framework based on the LLVM infrastructure that is able to generate such a model for sequential and parallel applications by performing hardware-independent characterization. Characteristics such as instruction-level parallelism, memory access patterns and branch behavior are analyzed per thread or process during application execution. To illustrate the potential of the framework, we provide a detailed characterization of a representative benchmark for graph-based analytics, Graph 500. Finally, we analyze how the properties extracted with PISA across Graph 500 and SPEC CPU2006 applications compare to measurements performed on x86 and POWER8 processors.
Article
Full-text available
The static estimation of the energy consumed by program executions is an important challenge, which has applications in program optimization and verification, and is instrumental in energy-aware software development. Our objective is to estimate such energy consumption in the form of functions on the input data sizes of programs. We have developed a tool for experimentation with static analysis which infers such energy functions at two levels, the instruction set architecture (ISA) and the intermediate code (LLVM IR) levels, and reflects it upwards to the higher source code level. This required the development of a translation from LLVM IR to an intermediate representation and its integration with existing components, a translation from ISA to the same representation, a resource analyzer, an ISA-level energy model, and a mapping from this model to LLVM IR. The approach has been applied to programs written in the XC language running on XCore architectures, but is general enough to be applied to other languages. Experimental results show that our LLVM IR level analysis is reasonably accurate (less than 6.4% average error vs. hardware measurements) and more powerful than analysis at the ISA level. This paper provides insights into the trade-off of precision versus analyzability at these levels.
Article
Full-text available
The gem5 simulation infrastructure is the merger of the best aspects of the M5 [4] and GEMS [9] simulators. M5 provides a highly configurable simulation framework, multiple ISAs, and diverse CPU models. GEMS complements these features with a detailed and exible memory system, including support for multiple cache coherence protocols and interconnect models. Currently, gem5 supports most commercial ISAs (ARM, ALPHA, MIPS, Power, SPARC, and x86), including booting Linux on three of them (ARM, ALPHA, and x86). The project is the result of the combined efforts of many academic and industrial institutions, including AMD, ARM, HP, MIPS, Princeton, MIT, and the Universities of Michigan, Texas, and Wisconsin. Over the past ten years, M5 and GEMS have been used in hundreds of publications and have been downloaded tens of thousands of times. The high level of collaboration on the gem5 project, combined with the previous success of the component parts and a liberal BSD-like license, make gem5 a valuable full-system simulation tool.
Article
Full-text available
Today, designers of next-generation embedded processors and software are increasingly faced with short product lifetimes. The resulting time-to-market constraints are contradicting the continually growing processor complexity. Nevertheless, an extensive design-space exploration and product verification is indispensable for a successful market launch. In the last decade, instruction-set simulators have become an essential development tool for the design of new programmable architectures. Consequently, the simulator performance is a key factor for the overall design efficiency. Motivated by the extremely poor performance of commonly used interpretive simulators, research work on fast compiled instruction-set simulation was started ten years ago. However, due to the restrictiveness of the compiled technique, it has not been able to push through in commercial products. In this paper, we tie up with our previous research on retargetable, compiled simulation techniques, and provide a discussion about their benefits and limitations using a particular compiled scheme, static scheduling, as an example. As a conclusion, we eventually present a novel retargetable simulation technique, which combines the performance of traditional compiled simulators with the flexibility of interpretive simulation. This technique is not limited to any class of architectures or applications and can be utilized from architecture exploration up to end-user software development. We demonstrate workflow and applicability of the so-called just-in-time cache-compiled simulation technique by means of state-of-the-art real-world architectures.
Conference Paper
Source-level timing simulation (STLS) is an important technique for early examination of timing behavior, as it is very fast and accurate. A factor occasionally more important than precision is simulation speed, especially in design space exploration or very early phases of development. Additionally, practices like rapid prototyping also benefit from high-performance timing simulation. Therefore, we propose to further reduce simulation run-time by utilizing a method called loop acceleration. Accelerating a loop in the context of SLTS means deriving the timing of a loop prior to simulation to increase simulation speed of that loop. We integrated this technique in our SLTS framework and conducted an comprehensive evaluation using the Mälardalen benchmark suite. We were able to reduce simulation time by up to 43% of the original time, while the introduced accuracy loss did not exceed 8 percentage points.
Conference Paper
The static estimation of the energy consumed by program executions is an important challenge, which has applications in program optimization and verification, and is instrumental in energy-aware software development. Our objective is to estimate such energy consumption in the form of functions on the input data sizes of programs. We have developed a tool for experimentation with static analysis which infers such energy functions at two levels, the instruction set architecture (ISA) and the intermediate code (LLVM IR) levels, and reflects it upwards to the higher source code level. This required the development of a translation from LLVM IR to an intermediate representation and its integration with existing components, a translation from ISA to the same representation, a resource analyzer, an ISA-level energy model, and a mapping from this model to LLVM IR. The approach has been applied to programs written in the XC language running on XCore architectures, but is general enough to be applied to other languages. Experimental results show that our LLVM IR level analysis is reasonably accurate (less than 6.4%6.4\,\% average error vs. hardware measurements) and more powerful than analysis at the ISA level. This paper provides insights into the trade-off of precision versus analyzability at these levels.
Article
With ever increasing design complexities, traditional cycle-accurate or instruction-set simulations are often too slow or too inaccurate for system prototyping in early design stages. As an alternative, host-compiled or source-level software simulation has been proposed, but existing approaches have largely focused on timing simulation only. In this paper, we propose a novel source-level simulation infrastructure that provides a full range of performance, energy, reliability, power and thermal (PERPT) estimation. Using a fully automated, retargetable back-annotation framework, intermediate representation code is statically annotated with timing, energy and resource accesses information obtained from low-level references at basic block granularity. The annotated model is natively compiled and combined with a cache model and occupancy analyzer to provide target performance, energy, soft-error vulnerability and power estimations. Finally, generated power traces are fed into thermal models for further temperature estimation. Comprehensive evaluations of our source-level models for PERPT estimations are performed. We applied our approach to PowerPC targets running various industry benchmark suites. source-level simulations are evaluated for different PERPT metrics and with cache models at various levels of detail to explore the speed and accuracy tradeoffs. More than 90% accuracy can be achieved for timing, energy, reliability and power estimation, and an average error of 0.05 K exists in steady-state thermal estimation. Simulation speeds range from 180 to 5740 MIPS for different types of metrics at different abstraction levels.
Conference Paper
We present a fast and accurate timing simulation of binary code execution on complex embedded processors. Underlying block timings are extracted from a preceding hardware execution and differentiated by execution context. Thereby, complex factors, such as caches, can be reflected accurately without explicit modeling. Based on timings observed in one hardware execution, timing of numerous other executions for different inputs can be simulated at an average error below 5% for complex applications on an ARM Cortex-A9 processor.
Article
Virtual Prototypes (VPs) have been now widely adopted by industry as platforms for early SW development, HW/SW co-verification, performance analysis and architecture exploration. Yet, rising design complexity, the need to test an increasing amount of software functionality as well as the verification of timing properties pose a growing challenge in the application of VPs. New approaches overcome the accuracy-speed bottleneck of today's virtual prototyping methods. These next-generation VPs are centered around ultra-fast host-compiled software models. Accuracy is obtained by advanced methods, which reconstruct the execution times of the software and model the timing behavior of the operating system, target processor and memory system. It is shown that simulation speed can further be increased by abstract TLM-based communication models and efficient hardware peripheral models. Additionally, an industrial flow for efficient model development is outlined. This support of ultra-fast and accurate HW/SW co-simulation will be a key enabler for successfully developing tomorrow's multiprocessor system-on-chip (MPSoC) platforms.
Conference Paper
We present an approach to accurately simulate the temporal behavior of binary embedded software based on timing data generated using static analysis. As the timing of an instruction sequence is significantly inuenced by the microarchitecture state prior to its execution, which highly depends on the preceding control ow, a sequence must be separately considered for different control ow paths instead of estimating the inuence of basic blocks or single instructions in isolation. We handle the thereby arising issue of an excessive or even infinite number of different paths by considering different execution contexts instead of control ow paths. Related approaches using context-sensitive cycle counts during simulation are limited to simulating the control ow that could be considered during analysis. We eliminate this limitation by selecting contexts dynamically, picking a suitable one when no predetermined choice is available, thereby enabling a context-sensitive simulation of unmodified binary code of concurrent programs, including asynchronous events such as interrupts. In contrast to other approximate binary simulation techniques, estimates are conservative, yet tight, making our approach reliable when evaluating performance goals. For a multi-threaded application the simulation deviates only by 0.24% from hardware measurements while the average overhead is only 50% compared to a purely functional simulation.
Conference Paper
Design space exploration (DSE) of complex embedded systems that combine a number of CPUs, dedicated hardware and software is a tedious task for which a broad range of approaches exists, from the use of high-level models to hardware prototyping. Each of these entails different simulation speed/accuracy tradeoffs, and thereby enables exploring a certain subset of the design space in a given time. Some simulation frameworks devoted to CPU-centric systems have been developed over the past decade, that either feature near real-time simulation speed or moderate to high speed with quasi-cycle level accuracy, often by means of instruction-set simulators or binary translation techniques. This paper presents an evaluation in term of accuracy in modeling real systems using the GEM5 simulator that belong to the first class. Performance figures of a wide range of benchmarks (e.g. in domains such as scientific computing and media applications) are captured and compared to results obtained on real hardware.
Conference Paper
With traditional cycle-accurate or instruction-set simulations of processors often being too slow, host-compiled or source-level software execution approaches have recently become popular. Such high-level simulations can achieve order of magnitude speedups, but approaches that can achieve highly accurate characterization of both power and performance metrics are lacking. In this paper, we propose a novel host-compiled simulation approach that provides close to cycle-accurate estimation of energy and timing metrics in a retargetable manner, using flexible, architecture description language (ADL) based reference models. Our automated flow considers typical front- and back-end optimizations by working at the compiler-generated intermediate representation (IR). Path-dependent execution effects are accurately captured through pairwise characterization and backannotation of basic code blocks with all possible predecessors. Results from applying our approach to PowerPC targets running various benchmark suites show that close to native average speeds of 2000 MIPS at more than 98% timing and energy accuracy can be achieved.
Conference Paper
This paper presents an approach for accurately estimating the execution time of parallel software components in complex embedded systems. Timing annotations obtained from highly optimized binary code are added to the source code of software components which is then integrated into a SystemC transaction-level simulation. This approach allows a fast evaluation of software execution times while being as accurate as conventional instruction set simulators. By simulating binary-level control flow in parallel to the original functionality of the software, even compiler optimizations heavily modifying the structure of the generated code can be modeled accurately. Experimental results show that the presented method produces timing estimates within the same level of accuracy as an established commercial tool for cycle-accurate instruction set simulation while being at least 20 times faster.
Conference Paper
Simulation and tracing tools help in the analysis, design, and tuning of both hardware and software systems. Simulators can execute code for hardware that does not yet exist, can provide access to internal state that may be invisible on real hardware, can give deterministic execution in the face of races, and can produce “stress test” situations that are hard to produce on the real hardware [4]. Tracing tools can provide detaide information about the behavior of a program; that information is used to drive an analyzer that analyzes or predicts the behavior of a particular system component. That, in turn, provides feedback that is used to improve the design and implementation of everything from architectures to compilers to applications. Analyzers consume many kinds of trace information; for example, address traces are used for studies of memory hierarchies, opcode and operand usage for superscalar and pipelined processor design, instruction counts for optimization studies, operand values for memoizing studies, and branch behavior for branch prediction.
Article
Tracing tools are used widely to help analyze, design, and tune both hardware and software systems. This paper describes a tool called Shade which combines efficient instruction-set simulation with a flexible, extensible trace generation capability. Efficiency is achieved by dynamically compiling and caching code to simulate and trace the application program. The user may control the extent of tracing in a variety of ways; arbitrarily detailed application state information may be collected during the simulation, but tracing less translates directly into greater efficiency. Current Shade implementations run on SPARC systems and simulate the SPARC (Versions 8 and 9) and MIPS I instruction sets. This paper describes the capabilities, design, implementation, and performance of Shade, and discusses instruction set emulation in general. 1. Introduction Tracing tools are used widely to help in the analysis, design, and tuning of both hardware and software systems. Tracing tools can provide ...
Article
Programs spend most of their time in loops and procedures. Therefore, most program transformations and the necessary static analyses deal with these. It has been long recognized, that different execution contexts for procedures may induce different execution properties. There are well established techniques for interprocedural analysis like the call string approach. Loops have not received similar attention in the area of data flow analysis and abstract interpretation. All executions are treated in the same way, although typically the first and later executions may exhibit very different properties. In this paper a new technique is presented that allows the application of the well known and established interprocedural analysis theory to loops, It turns out that the call string approach has limited flexibility in its possibilities to group several calling contexts together for the analysis. An extension to overcome this problem is presented that relies on a similar approach but gives more useful results in practice. The classical and the new techniques are implemented in our Program Analyzer Generator FAG, which is used to demonstrate our findings by applying the techniques to several real world programs.
Article
Shade is an instruction-set simulator and custom trace generator. Application programs are executed and traced under the control of a user-supplied trace analyzer. To reduce communication costs, Shade and the analyzer are run in the same address space. To further improve performance, code which simulates and traces the application is dynamically generated and cached for reuse. Current implementations run on SPARC systems and, to varying degrees, simulate the SPARC (Versions 8 and 9) and MIPS I instruction sets. This paper describes the capabilities, design, implementation, and performance of Shade, and discusses instruction set emulation in general. Shade improves on its predecessors by providing their various tracing capabilities together in a single tool. Shade is also fast: Running on a SPARC and simulating a SPARC, SPEC 89 benchmarks run about 2.3 times slower for floating-point programs and 6.2 times slower for integer programs. Saving trace data costs more, but Shade pr...
The LLVM IR Executor: lli
  • October
October, 2018. The LLVM IR Executor: lli. https://llvm.org/docs/CommandGuide/ lli.html
Hercules RM57Lx Development Kit
  • October
October, 2018. Hercules RM57Lx Development Kit. http://www.ti.com/tool/ tmdxrm57lhdk
Automated, retargetable back-annotation for host compiled performance and power modeling
  • Zhuoran Suhas Chakravarty
  • Andreas Zhao
  • Gerstlauer