Conference Paper

CrashTest: A fast high-fidelity FPGA-based resiliency analysis framework

Univ. of Michigan, Ann Arbor, MI
DOI: 10.1109/ICCD.2008.4751886 Conference: Computer Design, 2008. ICCD 2008. IEEE International Conference on
Source: DBLP

ABSTRACT Extreme scaling practices in silicon technology are quickly leading to integrated circuit components with limited reliability, where phenomena such as early-transistor failures, gate-oxide wearout, and transient faults are becoming increasingly common. In order to overcome these issues and develop robust design techniques for large-market silicon ICs, it is necessary to rely on accurate failure analysis frameworks which enable design houses to faithfully evaluate both the impact of a wide range of potential failures and the ability of candidate reliable mechanisms to overcome them. Unfortunately, while failure rates are already growing beyond economically viable limits, no fault analysis framework is yet available that is both accurate and can operate on a complex integrated system. To address this void, we present CrashTest, a fast, high-fidelity and flexible resiliency analysis system. Given a hardware description model of the design under analysis, CrashTest is capable of orchestrating and performing a comprehensive design resiliency analysis by examining how the design reacts to faults while running software applications. Upon completion, CrashTest provides a high-fidelity analysis report obtained by performing a fault injection campaign at the gate-level netlist of the design. The fault injection and analysis process is significantly accelerated by the use of an FPGA hardware emulation platform. We conducted experimental evaluations on a range of systems, including a complex LEON-based system-on-chip, and evaluated the impact of gate-level injected faults at the system level. We found that CrashTest is 16-90x faster than an equivalent software-based framework, when analyzing designs through direct primary I/Os. As shown by our LEON-based SoC experiments, CrashTest exhibits emulation speeds that are six orders of magnitude faster than simulation.

1 Bookmark
  • [Show abstract] [Hide abstract]
    ABSTRACT: The complexity of SoCs has been increasing enor-mously over the last years. This increases the effort for testing the SoCs against natural external influences and fault attacks. These tests require a huge amount of time because of the large fault scenario space. In this paper a novel method is presented on reduction of system test duration. This speed-up is reached by observing memory accesses during a golden model run to find security relevant regions in memories. Therefore, a novel monitor module has been designed and tested which stores the used memory addresses together with the access time stamps.
    25 th IEEE International SOC Conference (SOCC); 09/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents an analysis of the effects and propagations of different faults such as Single Event Transient (SET), Multiple Event Transients (MET), Single Event Upset (SEU) and Multiple Bit Upsets (MBU) by simulation-based fault injection into Areoflex Gaisler LEON3 processor which is a 32 bit synthesizable processor based on SPARC V8 architecture. LEON3 is designed for ground-based applications. This investigation is done by injecting nearly 11200 transient faults into different components of LEON3 including flip-flops, registers, register-file and cache memories. The behavior of LEON3 processor against injected faults is reported. Besides, it is shown that nearly 52.83% of SEUs are overwritten; 31.74% of SEUs are latent and finally 15.43% of them are reported as failure while 44.74% of MBUs are overwritten; 38.42% of them are latent and 16.84 of these kind of faults are failed. Also, 98.03% of SETs are overwritten; 0.6% of them are latent and 1.36% of SETs are reported as failures. Finally, the effects of METs are as follows: 96.71% for overwritten faults; 1.15% for latent and 2.14% for failure. Moreover, integer unit and multiplier unit are the most susceptible components against single and multiple faults respectively.
    Computational Science and Engineering (CSE); 12/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Current technology scaling is leading to increasingly fragile components, making hardware reliability a primary design consideration. Recently researchers have proposed low-cost reliability solutions that detect hardware faults through software-level symptom monitoring. SWAT (SoftWare Anomaly Treatment), one such solution, demonstrated with microarchitecture-level simulations that symptom-based solutions can provide high fault coverage and a low Silent Data Corruption (SDC) rate. However, more accurate evaluations are needed to validate such solutions for hardware faults in real-world processor designs. In this paper, we evaluate SWAT's symptom-based detectors on gate-level faults using an FPGA-based, full-system prototype. With this platform, we performed a gate-level accurate fault injection campaign of 51,630 fault injections in the OpenSPARC T1 core logic across five SPECInt 2000 benchmarks. With an overall SDC rate of 0.79%, our results are comparable to previous microarchitecture-level evaluations of SWAT, demonstrating the effectiveness of symptom-based software detectors for permanent faults in real-world designs.

Full-text (2 Sources)

Available from
May 21, 2014