Microprocessor sensitivity to failures: Control vs. execution and combinational vs. sequential logic
Coordinated Sci. Lab., Illinois Univ., Urbana, IL, USA
DOI: 10.1109/DSN.2005.63 Conference: Dependable Systems and Networks, 2005. DSN 2005. Proceedings. International Conference on
The goal of this study is to characterize the impact of soft errors on embedded processors. We focus on control versus speculation logic on one hand, and combinational versus sequential logic on the other. The target system is a gate-level implementation of a DLX-like processor. The synthesized design is simulated, and transients are injected to stress the processor while it is executing selected applications. Analysis of the collected data shows that fault sensitivity of the combinational logic (4.2% for a fault duration of one clock cycle) is not negligible, even though it is smaller than the fault sensitivity of flip-flops (10.4%). Detailed study of the error impact, measured at the application level, reveals that errors in speculation and control blocks collectively contribute to about 34% of crashes, 34% of fail-silent violations and 69% of application incomplete executions. These figures indicate the increasing need for processor-level detection techniques over generic methods, such as ECC and parity, to prevent such errors from propagating beyond the processor boundaries.
Available from: Karthik Pattabiraman
- "We also assume that the processor's control logic is error-free. This is also reasonable because the control logic constitutes a very small portion of the processor . "
[Show abstract] [Hide abstract]
ABSTRACT: Intermittent hardware faults are bursts of errors that last from a few CPU cycles to a few seconds. Recent studies have shown that intermittent fault rates are increasing due to technology scaling and are likely to be a significant concern in future systems. We study the impact of intermittent hardware faults in programs. A simulation-based fault-injection campaign shows that the majority of the intermittent faults lead to program crashes. We build a crash model and a program model that represents the data dependencies in a fault-free execution of the program. We then use this model to glean information about when the program crashes and the extent of fault propagation. Empirical validation of our model using fault-injection experiment shows that it predicts almost all actual crash-causing intermittent faults, and in 93% of the considered faults the prediction is accurate within 100 instructions. Further, the model is found to be more than two orders of magnitude faster than equivalent fault-injection experiments performed with a microprocessor simulator.
Available from: Andrea Pellegrini
- "Software-Based Resiliency Analysis: Often, softwarebased fault injection is preferred to hardware-based solutions due to its low cost, less complex development cycle, flexibility of customization, or simply because no low-level hardware model of the design is available. There are several software-based resiliency analysis frameworks presented in literature   . Although they have many advantages, the speed of low level (e.g., gate-level) simulations does not make these solutions feasible for analyzing complex designs or complete systems running software applications. "
[Show abstract] [Hide abstract]
ABSTRACT: Extreme scaling practices in silicon technology are quickly leading to integrated circuit components with limited reliability, where phenomena such as early-transistor failures, gate-oxide wearout, and transient faults are becoming increasingly common. In order to overcome these issues and develop robust design techniques for large-market silicon ICs, it is necessary to rely on accurate failure analysis frameworks which enable design houses to faithfully evaluate both the impact of a wide range of potential failures and the ability of candidate reliable mechanisms to overcome them. Unfortunately, while failure rates are already growing beyond economically viable limits, no fault analysis framework is yet available that is both accurate and can operate on a complex integrated system. To address this void, we present CrashTest, a fast, high-fidelity and flexible resiliency analysis system. Given a hardware description model of the design under analysis, CrashTest is capable of orchestrating and performing a comprehensive design resiliency analysis by examining how the design reacts to faults while running software applications. Upon completion, CrashTest provides a high-fidelity analysis report obtained by performing a fault injection campaign at the gate-level netlist of the design. The fault injection and analysis process is significantly accelerated by the use of an FPGA hardware emulation platform. We conducted experimental evaluations on a range of systems, including a complex LEON-based system-on-chip, and evaluated the impact of gate-level injected faults at the system level. We found that CrashTest is 16-90x faster than an equivalent software-based framework, when analyzing designs through direct primary I/Os. As shown by our LEON-based SoC experiments, CrashTest exhibits emulation speeds that are six orders of magnitude faster than simulation.
Available from: psu.edu
- "Two main mechanisms lead to this result: 1) many bits of micro-architectural state are dead, in that they will be written before being referenced , and 2) some micro-architectural state affects performance but not correctness, with predictor state being the obvious example. Previous work has characterized the relative error vulnerability of micro-architectural structures by observing what fraction of their bits are necessary for an architecturally correct execution (ACE, where architecturallyvisible state never contains an incorrect value)  and by injecting micro-architectural faults and observing which fraction lead to program crashes and incorrect program outputs  . There are discrepancies in the numbers computed by these two approaches because an architecturally correct execution is not explicitly required to compute the correct program output without program crashes; some derating can occur in the software itself. "
[Show abstract] [Hide abstract]
ABSTRACT: In this work, we characterize a significant source of software derating that we call instruction-level derating. Instruction-level derating encompasses the mechanisms by which computation on incorrect values can result in correct computation. We characterize the instruction-level derating that occurs in the SPEC CPU2000 INT benchmarks, classifying it (by source) into six categories: value comparison, sub-word operations, logical operations, overflow/precision, lucky loads, and dynamically-dead values. We also characterize the temporal nature of this derating, demonstrating that the effects of a fault persist in architectural state long after the last time they are referenced. Finally, we demonstrate how this characterization can be used to avoid unnecessary error recoveries (when a fault will be masked by software anyway) in the context of a dual modular redundant (DMR) architecture.
Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. The impact factor represents a rough estimation of the journal's impact factor and does not reflect the actual current impact factor. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.