Conference Paper

CrashTest: A fast high-fidelity FPGA-based resiliency analysis framework

Univ. of Michigan, Ann Arbor, MI
DOI: 10.1109/ICCD.2008.4751886 Conference: Computer Design, 2008. ICCD 2008. IEEE International Conference on
Source: DBLP

ABSTRACT Extreme scaling practices in silicon technology are quickly leading to integrated circuit components with limited reliability, where phenomena such as early-transistor failures, gate-oxide wearout, and transient faults are becoming increasingly common. In order to overcome these issues and develop robust design techniques for large-market silicon ICs, it is necessary to rely on accurate failure analysis frameworks which enable design houses to faithfully evaluate both the impact of a wide range of potential failures and the ability of candidate reliable mechanisms to overcome them. Unfortunately, while failure rates are already growing beyond economically viable limits, no fault analysis framework is yet available that is both accurate and can operate on a complex integrated system. To address this void, we present CrashTest, a fast, high-fidelity and flexible resiliency analysis system. Given a hardware description model of the design under analysis, CrashTest is capable of orchestrating and performing a comprehensive design resiliency analysis by examining how the design reacts to faults while running software applications. Upon completion, CrashTest provides a high-fidelity analysis report obtained by performing a fault injection campaign at the gate-level netlist of the design. The fault injection and analysis process is significantly accelerated by the use of an FPGA hardware emulation platform. We conducted experimental evaluations on a range of systems, including a complex LEON-based system-on-chip, and evaluated the impact of gate-level injected faults at the system level. We found that CrashTest is 16-90x faster than an equivalent software-based framework, when analyzing designs through direct primary I/Os. As shown by our LEON-based SoC experiments, CrashTest exhibits emulation speeds that are six orders of magnitude faster than simulation.

1 Bookmark
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: ABSTRACT Continued device scaling is resulting in smaller devices that are increasingly vulnerable to errors from various sources, e.g., wear-out and high energy particle strikes. As this reliability threat grows, future shipped hardware will likely fail due to in-the-eld hardware faults. A comprehensive relia- bility solution should detect the fault, diagnose the source of it, and recover the correct execution. Traditional redundancy-based reliability solutions that handle these faults are too expensive for main- stream computing. A promising approach is using software-level symptoms,to detect hardware faults. Specically , the SWAT project [11] has proposed a set of always-on monitors that perform such detec- tions at very low cost. In the rare event of a fault, a more expensive diagnosis mechanism is invoked alongside a checkpoint/replay-based recovery procedure. Previous studies, however, were in the context of single-threaded applications on uniprocessors, and their applicability in multicore systems is unclear. This thesis provides detection and diagnosis mechanisms,for hardware faults in multicore systems running multithreaded applications. For de- tection, we augmented the SWAT symptoms with multicore counterparts. These resulted in a high coverage of 98.8% for permanent faults, with a low 0.8% silent data corruption (SDC) rate. We also show that these symptoms,are effective for transient faults. These results demonstrate the applicability of symptom-based,detection for faults in multicore systems running multithreaded workloads. Permanent faults require a diagnosis mechanism, unlike transient faults. In multicore systems, a fault in a core may escape to a fault-free core, and the latter may result in a symptom. This makes permanent,fault diagnosis a challenge. We propose a novel mechanism,that identies the faulty core, with near-zero performance overhead in the fault-free case. Our diagnosis mechanism replays the execution from each core on two other cores and compares,the executions. A mismatch,in the executions results in identication of the faulty core. Our results show that the proposed diagnosis
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Continuously shrinking feature sizes lead to an increasingvulnerability of digital circuits. Manufacturing failure s and transi- ent faults may tamper the functionality. Thus, analyzing th e fault tolerance of a given implementation becomes an impor tant step in the design process. This can be done by simulation bas ed fault injection which does not cover all potential scenar ios or by formal techniques which conservatively cover any scen ario and any potential fault. In this paper, we show how to use application specific knowled ge when formally analyzing an implementation. Constraints are extracted from an application to evaluate the design und er more realistic conditions. The experimental results sho w, that fault tolerance with respect to a certain application c an be significantly higher than the fault tolerance without a ny restrictions.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Current technology scaling is leading to increasingly fragile components, making hardware reliability a primary design consideration. Recently researchers have proposed low-cost reliability solutions that detect hardware faults through software-level symptom monitoring. SWAT (SoftWare Anomaly Treatment), one such solution, demonstrated with microarchitecture-level simulations that symptom-based solutions can provide high fault coverage and a low Silent Data Corruption (SDC) rate. However, more accurate evaluations are needed to validate such solutions for hardware faults in real-world processor designs. In this paper, we evaluate SWAT's symptom-based detectors on gate-level faults using an FPGA-based, full-system prototype. With this platform, we performed a gate-level accurate fault injection campaign of 51,630 fault injections in the OpenSPARC T1 core logic across five SPECInt 2000 benchmarks. With an overall SDC rate of 0.79%, our results are comparable to previous microarchitecture-level evaluations of SWAT, demonstrating the effectiveness of symptom-based software detectors for permanent faults in real-world designs.

Full-text (2 Sources)

Available from
May 21, 2014