Conference Paper

CrashTest: A fast high-fidelity FPGA-based resiliency analysis framework

Univ. of Michigan, Ann Arbor, MI
DOI: 10.1109/ICCD.2008.4751886 Conference: Computer Design, 2008. ICCD 2008. IEEE International Conference on
Source: DBLP

ABSTRACT Extreme scaling practices in silicon technology are quickly leading to integrated circuit components with limited reliability, where phenomena such as early-transistor failures, gate-oxide wearout, and transient faults are becoming increasingly common. In order to overcome these issues and develop robust design techniques for large-market silicon ICs, it is necessary to rely on accurate failure analysis frameworks which enable design houses to faithfully evaluate both the impact of a wide range of potential failures and the ability of candidate reliable mechanisms to overcome them. Unfortunately, while failure rates are already growing beyond economically viable limits, no fault analysis framework is yet available that is both accurate and can operate on a complex integrated system. To address this void, we present CrashTest, a fast, high-fidelity and flexible resiliency analysis system. Given a hardware description model of the design under analysis, CrashTest is capable of orchestrating and performing a comprehensive design resiliency analysis by examining how the design reacts to faults while running software applications. Upon completion, CrashTest provides a high-fidelity analysis report obtained by performing a fault injection campaign at the gate-level netlist of the design. The fault injection and analysis process is significantly accelerated by the use of an FPGA hardware emulation platform. We conducted experimental evaluations on a range of systems, including a complex LEON-based system-on-chip, and evaluated the impact of gate-level injected faults at the system level. We found that CrashTest is 16-90x faster than an equivalent software-based framework, when analyzing designs through direct primary I/Os. As shown by our LEON-based SoC experiments, CrashTest exhibits emulation speeds that are six orders of magnitude faster than simulation.

1 Bookmark
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a framework for an in-depth analysis of transient faults in microprocessor-based embedded systems. The framework is based on a debug-like mechanism supporting an interpretation and analysis of the system behavior from an application point of view, in terms of function execution flow and passed/returned parameters. The framework offers a highly-customizable fault/error debug and classification approach, based on such application-level information, aimed at supporting the designer in the evaluation and tuning of the system dependability-related properties. We present an implementation of the proposed framework within a state-of-the-art fault injection environment for SystemC transaction-level multiprocessor specifications, and we show that the approach can be ported also in other environments. An experimental session considering an embedded system based on a processor highlights the benefits of the proposed approach.
    2011 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, DFT 2011, Vancouver, BC, Canada, October 3-5, 2011; 01/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents an analysis of the effects and propagations of different faults such as Single Event Transient (SET), Multiple Event Transients (MET), Single Event Upset (SEU) and Multiple Bit Upsets (MBU) by simulation-based fault injection into Areoflex Gaisler LEON3 processor which is a 32 bit synthesizable processor based on SPARC V8 architecture. LEON3 is designed for ground-based applications. This investigation is done by injecting nearly 11200 transient faults into different components of LEON3 including flip-flops, registers, register-file and cache memories. The behavior of LEON3 processor against injected faults is reported. Besides, it is shown that nearly 52.83% of SEUs are overwritten; 31.74% of SEUs are latent and finally 15.43% of them are reported as failure while 44.74% of MBUs are overwritten; 38.42% of them are latent and 16.84 of these kind of faults are failed. Also, 98.03% of SETs are overwritten; 0.6% of them are latent and 1.36% of SETs are reported as failures. Finally, the effects of METs are as follows: 96.71% for overwritten faults; 1.15% for latent and 2.14% for failure. Moreover, integer unit and multiplier unit are the most susceptible components against single and multiple faults respectively.
    Computational Science and Engineering (CSE); 12/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Continued technology scaling is resulting in systems with billions of devices. Unfortunately, these devices are prone to failures from various sources, resulting in even commodity systems being affected by the growing reliability threat. Thus, traditional solutions involving high redundancy or piecemeal solutions targeting specific failure modes will no longer be viable owing to their high overheads. Recent reliability solutions have explored using low-cost monitors that watch for anomalous software behavior as a symptom of hardware faults. We previously proposed the SWAT system that uses such low-cost detectors to detect hardware faults, and a higher cost mechanism for diagnosis. However, all of the prior work in this context, including SWAT, assumes single-threaded applications and has not been demonstrated for multithreaded applications running on multicore systems. This paper presents mSWAT, the first work to apply symptom based detection and diagnosis for faults in multicore architectures running multithreaded software. For detection, we extend the symptom-based detectors in SWAT and show that they result in a very low silent data corruption (SDC) rate for both permanent and transient hardware faults. For diagnosis, the multicore environment poses significant new challenges. First, deterministic replay required for SWAT's single-threaded diagnosis incurs higher overheads for multithreaded workloads. Second, the fault may propagate to fault-free cores resulting in symptoms from fault-free cores and no available known-good core, breaking fundamental assumptions of SWAT's diagnosis algorithm. We propose a novel permanent fault diagnosis algorithm for multithreaded applications running on multicore systems that uses a lightweight isolated deterministic replay to diagnose the faulty core with no prior knowledge of a known good core. Our results show that this technique successfully diagnoses over 95% of the detected permanent faults while incurring low hardware ov- erheads. mSWAT thus offers an affordable solution to protect future multicore systems from hardware faults.
    Microarchitecture, 2009. MICRO-42. 42nd Annual IEEE/ACM International Symposium on; 01/2010

Full-text (2 Sources)

Available from
May 21, 2014