-
Proceedings of the 47th Design Automation Conference, DAC 2010, Anaheim, California, USA, July 13-18, 2010; 01/2010
-
26th International Conference on Computer Design, ICCD 2008, 12-15 October 2008, Lake Tahoe, CA, USA, Proceedings; 01/2008
-
2007 Design, Automation and Test in Europe Conference and Exposition (DATE 2007), April 16-20, 2007, Nice, France; 01/2007
-
TACO. 01/2007; 4.
-
[show abstract]
[hide abstract]
ABSTRACT: Recently, there has been a growing concern that, in relation to process technology scaling, the soft-error rate will become a major challenge in designing reliable systems. In this work, we introduce a high-fidelity, high-performance simu-lation infrastructure for quantifying the derating effects on soft-error rates while considering microarchitectural, tim-ing and logic-related masking, using realistic workloads on a CMP switch design. We use a gate-level model for the CMP switch design, enabling us to inject faults into blocks of combinational logic. We are then able to track logic-related and time-related fault masking, as well as microarchitectural-related fault masking, at the architecture level. We find out that for complex designs, logic-and time-related fault mask-ing account for more than 50% of the masked faults. We also observe that only 3-4% of the injected faults propagate an error at the design's output and cause an error in the application's execution, resulting in a derating factor of 30. From our experiments, we also demonstrate that soft-error derating effects highly depend on the design's characteristics and utilization.
-
[show abstract]
[hide abstract]
ABSTRACT: Extreme transistor scaling trends in silicon technology are soon to reach a point where manufactured systems will suffer from limited device reliability and severely reduced life-time, due to early transistor failures, gate oxide wear-out, manufacturing defects, and radiation-induced soft errors (SER). In this paper we present a low-cost technique to harden a microprocessor pipeline and caches against these reliability threats. Our approach utilizes online built-in self-test (BIST) and microarchitectural checkpointing to detect, diagnose and recover the computation impaired by silicon defects or SER events. The approach works by periodically testing the processor to determine if the system is broken. If so, we reconfigure the processor to avoid using the broken component. A similar mechanism is used to detect SER, faults, with the difference that recovery is implemented by re-execution. By utilizing low-cost techniques to address defects and SER, we keep protection costs significantly lower than traditional fault-tolerance approaches while providing high levels of coverage for a wide range of faults. Using detailed gate-level simulation, we find that our approach provides 95% and 99% coverage for silicon defects and SER events, respectively, with only a 14% area overhead.