Conference Paper

Crash test: A fast High-Fidelity FPGA-Based resiliency analysis framework

Univ. of Michigan, Ann Arbor, MI
DOI: 10.1109/ICCD.2008.4751886 Conference: Computer Design, 2008. ICCD 2008. IEEE International Conference on
Source: DBLP


Extreme scaling practices in silicon technology are quickly leading to integrated circuit components with limited reliability, where phenomena such as early-transistor failures, gate-oxide wearout, and transient faults are becoming increasingly common. In order to overcome these issues and develop robust design techniques for large-market silicon ICs, it is necessary to rely on accurate failure analysis frameworks which enable design houses to faithfully evaluate both the impact of a wide range of potential failures and the ability of candidate reliable mechanisms to overcome them. Unfortunately, while failure rates are already growing beyond economically viable limits, no fault analysis framework is yet available that is both accurate and can operate on a complex integrated system. To address this void, we present CrashTest, a fast, high-fidelity and flexible resiliency analysis system. Given a hardware description model of the design under analysis, CrashTest is capable of orchestrating and performing a comprehensive design resiliency analysis by examining how the design reacts to faults while running software applications. Upon completion, CrashTest provides a high-fidelity analysis report obtained by performing a fault injection campaign at the gate-level netlist of the design. The fault injection and analysis process is significantly accelerated by the use of an FPGA hardware emulation platform. We conducted experimental evaluations on a range of systems, including a complex LEON-based system-on-chip, and evaluated the impact of gate-level injected faults at the system level. We found that CrashTest is 16-90x faster than an equivalent software-based framework, when analyzing designs through direct primary I/Os. As shown by our LEON-based SoC experiments, CrashTest exhibits emulation speeds that are six orders of magnitude faster than simulation.

Download full-text


Available from: Andrea Pellegrini,
  • Source
    • "Fault injection in LEON processors is used in several works. In ‎ [8], crash test, which is a fast FPGA-based framework, evaluates the effects of SEUs and some permanent models. It uses LEON3 as a case study. "
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents an analysis of the effects and propagations of different faults such as Single Event Transient (SET), Multiple Event Transients (MET), Single Event Upset (SEU) and Multiple Bit Upsets (MBU) by simulation-based fault injection into Areoflex Gaisler LEON3 processor which is a 32 bit synthesizable processor based on SPARC V8 architecture. LEON3 is designed for ground-based applications. This investigation is done by injecting nearly 11200 transient faults into different components of LEON3 including flip-flops, registers, register-file and cache memories. The behavior of LEON3 processor against injected faults is reported. Besides, it is shown that nearly 52.83% of SEUs are overwritten; 31.74% of SEUs are latent and finally 15.43% of them are reported as failure while 44.74% of MBUs are overwritten; 38.42% of them are latent and 16.84 of these kind of faults are failed. Also, 98.03% of SETs are overwritten; 0.6% of them are latent and 1.36% of SETs are reported as failures. Finally, the effects of METs are as follows: 96.71% for overwritten faults; 1.15% for latent and 2.14% for failure. Moreover, integer unit and multiplier unit are the most susceptible components against single and multiple faults respectively.
    Computational Science and Engineering (CSE); 12/2012
  • Source
    • "This approach allows CrashTest to amortize design setup time among several experiments. CrashTest can accelerate resiliency analysis of industrialsize designs by up to six orders of magnitude compared to equivalent software-based fault injections [11]. Moreover, CrashTest does not alter the original design functionality, allowing it to execute a complete software stack, including the operating system and user applications. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Current technology scaling is leading to increasingly fragile components, making hardware reliability a primary design consideration. Recently researchers have proposed low-cost reliability solutions that detect hardware faults through software-level symptom monitoring. SWAT (SoftWare Anomaly Treatment), one such solution, demonstrated with microarchitecture-level simulations that symptom-based solutions can provide high fault coverage and a low Silent Data Corruption (SDC) rate. However, more accurate evaluations are needed to validate such solutions for hardware faults in real-world processor designs. In this paper, we evaluate SWAT's symptom-based detectors on gate-level faults using an FPGA-based, full-system prototype. With this platform, we performed a gate-level accurate fault injection campaign of 51,630 fault injections in the OpenSPARC T1 core logic across five SPECInt 2000 benchmarks. With an overall SDC rate of 0.79%, our results are comparable to previous microarchitecture-level evaluations of SWAT, demonstrating the effectiveness of symptom-based software detectors for permanent faults in real-world designs.
    03/2012; DOI:10.1109/DATE.2012.6176660
  • Source
    • "Moreover, approaches to analyze the achieved fault tolerance and reliability have been proposed , e.g. [11] [12] [13] [14] [15]. However, these approaches do no target dependability at ESL or focus on mixed fault simulation (more details are discussed in the related work section). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Raising the level of abstraction to design the next generation of embedded systems has become mandatory. This design methodology is commonly referred to Electronic System Level (ESL) design. Simultaneously, dependability of embedded systems becomes a major concern. To satisfy these demands already at ESL, we present a dependability analysis approach working directly at this level. The approach analyzes the effectiveness of dependability measures in SystemC-based virtual prototypes. Errors are injected into SystemC transactions using an XML-based configuration mechanism. This is combined with the specification of the expected behavior with respect to the injected errors. The developed analysis approach allows for validation of dependability measures as well as localization of missing or buggy measures. Experimental results for a complex image processing system, which determines the position of a game controller in video data, demonstrate the advantages of our approach.
    2011 Forum on Specification & Design Languages, FDL 2011, Oldenburg, Germany, September 13-15, 2011; 01/2011
Show more