Microprocessor sensitivity to failures: Control vs. execution and combinational vs. sequential logic
ABSTRACT The goal of this study is to characterize the impact of soft errors on embedded processors. We focus on control versus speculation logic on one hand, and combinational versus sequential logic on the other. The target system is a gate-level implementation of a DLX-like processor. The synthesized design is simulated, and transients are injected to stress the processor while it is executing selected applications. Analysis of the collected data shows that fault sensitivity of the combinational logic (4.2% for a fault duration of one clock cycle) is not negligible, even though it is smaller than the fault sensitivity of flip-flops (10.4%). Detailed study of the error impact, measured at the application level, reveals that errors in speculation and control blocks collectively contribute to about 34% of crashes, 34% of fail-silent violations and 69% of application incomplete executions. These figures indicate the increasing need for processor-level detection techniques over generic methods, such as ECC and parity, to prevent such errors from propagating beyond the processor boundaries.
- SourceAvailable from: Karthik Pattabiraman
Conference Paper: Processor-Level Selective Replication[Show abstract] [Hide abstract]
ABSTRACT: We propose a processor-level technique called selective replication, by which the application can choose where in its application stream and to what degree it requires replication. Recent work on static analysis and fault-injection-based experiments on applications reveals that certain variables in the application are critical to its crash- and hang-free execution. If it can be ensured that only the computation of these variables is error-free, then a high degree of crash/hang coverage can be achieved at a low performance overhead to the application. The selective replication technique provides an ideal platform for validating this claim. The technique is compared against complete duplication as provided in current architecture-level techniques. The results show that with about 59% less overhead than full duplication, selective replication detects 97% of the data errors and 87% of the instruction errors that were covered by full duplication. It also reduces the detection of errors benign to the final outcome of the application by 17.8% as compared to full duplication.Dependable Systems and Networks, 2007. DSN '07. 37th Annual IEEE/IFIP International Conference on; 07/2007
Conference Paper: ERSA: Error Resilient System Architecture for Probabilistic Applications[Show abstract] [Hide abstract]
ABSTRACT: There is a growing concern about the increasing vulnerability of future computing systems to errors in the underlying hardware. Traditional redundancy techniques are expensive for designing energy-efficient systems that are resilient to high error rates. We present Error Resilient System Architecture (ERSA), a low-cost robust system architecture for emerging killer probabilistic applications such as Recognition, Mining and Synthesis (RMS) applications. While resilience of such applications to errors in low-order bits of data is well-known, execution of such applications on error-prone hardware significantly degrades output quality (due to high-order bit errors and crashes). ERSA achieves high error resilience to high-order bit errors and control errors (in addition to low-order bit errors) using a judicious combination of 3 key ideas: (1) asymmetric reliability in many-core architectures, (2) error-resilient algorithms at the core of probabilistic applications, and (3) intelligent software optimizations. Error injection experiments on a multi-core ERSA hardware prototype demonstrate that, even at very high error rates of 20,000 errors/second/core or 2x10-4 error/cycle/core (with errors injected in architecturally-visible registers), ERSA maintains 90% or better accuracy of output results, together with minimal impact on execution time, for probabilistic applications such as K-Means clustering, LDPC decoding and Bayesian networks. Moreover, we demonstrate the effectiveness of ERSA in tolerating high rates of static memory errors that are characteristic of emerging challenges such as Vccmin problems and erratic bit errors. Using the concept of configurable reliability, ERSA platforms may also be adapted for general-purpose applications that are less resilient to errors (but at higher costs).Design, Automation and Test in Europe, DATE 2010, Dresden, Germany, March 8-12, 2010; 03/2010