-
[show abstract]
[hide abstract]
ABSTRACT: We investigate the correlation between low-level faults in the control logic of a modern microprocessor and their instruction-level impact on the execution of typical workload. Such information can prove immensely useful in accurately assessing and prioritizing faults with regards to their criticality, as well as commensurately allocating resources to enhance online testability and error/fault resilience through concurrent error detection/correction methods. To this end, we developed an extensive fault simulation infrastructure which allows injection of stuck-at faults and transient errors of arbitrary starting time and duration, as well as cost-effective simulation and classification of their repercussions into various instruction-level error types. As a test vehicle for our study, we employ a superscalar, dynamically-scheduled, out-of-order, Alpha-like microprocessor, on which we execute SPEC2000 integer benchmarks. Extensive fault injection campaigns in control modules of this microprocessor facilitate valuable observations regarding the distribution of low-level faults into the instruction-level error types that they cause. Experimentation with both Register Transfer (RT-) and Gate-Level faults, as well as with both stuck-at faults and transient errors, confirms the validity and corroborates the utility of these observations.
IEEE Transactions on Computers 10/2011; · 1.10 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: We present a Concurrent Error Detection (CED) scheme for the Scheduler of a modern microprocessor. The proposed CED scheme is based on monitoring a set of invariances imposed through added hardware, violation of which signifies the occurrence of an error. The novelty of our solution stems from the workload-cognizant way in which these invariances are selected so that they leverage the application-level error masking inherent in program execution. Specifically, in order to ensure cost-effectiveness of the hardware employed to construct these invariances, we make use of information regarding the type and frequency of errors affecting the typical workload of the microprocessor. Thereby, we identify the most susceptible aspects of instruction execution and we accordingly distribute CED resources to protect them. Our approach is demonstrated on the Scheduler of an Alpha-like superscalar microprocessor with dynamic scheduling, hybrid branch prediction and out-of-order execution capabilities. Using an extensive fault-simulation infrastructure that we developed around this microprocessor, we profile the impact of Scheduler faults across a variety of different SPEC2000 benchmarks. Based on the results, we construct a CED scheme which monitors the time and location of instruction execution, the executed operation, the utilized resources, as well as the executed and retired sequence of instructions. At a hardware cost of only 32 percent of the Scheduler, the corresponding CED scheme detects over 85 percent of its faults that affect the architectural state of the microprocessor. Furthermore, over 99.5 percent of these faults are detected before they corrupt the architectural state, while the average detection latency for the remaining faults is in the order of a few clock cycles, implying that efficient recovery methods can be developed.
IEEE Transactions on Computers 10/2011; · 1.10 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Towards improving performance, modern microprocessors incorporate a variety of architectural features, such as branch prediction and speculative execution, which are not critical to the correctness of their operation. While faults in the corresponding hardware may not necessarily affect functional correctness, they may, nevertheless, adversely impact performance. In this paper, we investigate quantitatively the performance impact of such faults using a superscalar, dynamically-scheduled, out-of-order, Alpha-like microprocessor, on which we execute SPEC2000 integer benchmarks. We provide extensive fault simulation-based experimental results and we discuss how this information may guide the inclusion of additional hardware for performance loss recovery and yield enhancement.
Computer Design, 2009. ICCD 2009. IEEE International Conference on; 11/2009
-
[show abstract]
[hide abstract]
ABSTRACT: We investigate the correlation between register transfer-level faults in the control logic of a modern microprocessor and their instruction-level impact on the execution flow of typical programs. Such information can prove immensely useful in accurately assessing and prioritizing faults with regards to their criticality, as well as commensurately allocating resources to enhance testability, diagnosability, manufacturability and reliability. To this end, we developed an extensive infrastructure which allows injection of stuck-at faults and transient errors of arbitrary starting point and duration, as well as cost-effective simulation and classification of their repercussions into various instruction-level error types. As a test vehicle for our study, we employ a superscalar, dynamically-scheduled, out-of-order, Alpha-like microprocessor, on which we execute SPEC2000 integer benchmarks. Extensive experimentation with faults injected in control logic modules of this microprocessor reveals interesting trends and results, corroborating the utility of this simulation infrastructure and motivating its further development and application to various tasks related to robust design.
Test Conference, 2008. ITC 2008. IEEE International; 11/2008
-
[show abstract]
[hide abstract]
ABSTRACT: This paper proposes an offline test strategy for finding the largest fault-free connected sub-structure of a mesh-based NoC. Faulty switch ports are found by flooding the NoC with test packets. Then, NoC routers are reconfigured according to the degraded NoC structure to route incoming packets.
Test Conference, 2008. ITC 2008. IEEE International; 11/2008
-
[show abstract]
[hide abstract]
ABSTRACT: Moving towards reconfigurability is an approach to increase fault tolerance on System-on-Chip design. In this paper, we propose a self-reconfigurable NoC architecture utilizing a robust rerouting method. At first, an offline test strategy for locating system-level faults in NoC switch ports is utilized. Using the information achieved in the test phase, every switch reconfigures itself to avoid routing packets through faulty links by utilizing our local rerouting method. The proposed rerouting method is evaluated using a Transaction-Level platform. Experimental results show that our proposed rerouting method delivers all the packets in a faulty NoC successfully and has a less communication overhead compared to a pure flooding method.
Design & Test Symposium (EWDTS), 2008 East-West; 11/2008
-
[show abstract]
[hide abstract]
ABSTRACT: This paper presents an efficient high-level synthesis (HLS) approach to improve RT-level concurrent testing. The proposed method used for both fault detection and fault location. At first the available resources are used in their dead intervals to test active resources for fault detection, and then some changes are applied to the RT-level controller to locate the faults. The fault detection step is based on a genetic algorithm (GA) search technique. This genetic algorithm is applied to the design after high level synthesis process to explore the test map. The proposed method has been evaluated based on dependability enhancement and area/latency overhead imposed to different benchmarks after applying our algorithm. The dependability has been considered in terms of fault coverage. The experimental result shows that applying our algorithm, the associated area overhead and performance penalty are negligible while the online fault coverage improvement is considerable.
On-Line Testing Symposium, 2008. IOLTS '08. 14th IEEE International; 08/2008
-
[show abstract]
[hide abstract]
ABSTRACT: This paper presents an efficient method for online testing of NoC switches. This method deals with control faults of NoC switches; i.e. the routing faults which cause NoC packets to be sent to output ports not intended to. A high level fault model has been proposed in this paper to model switch routing faults. The proposed method is evaluated by fault simulation that is based on our high-level fault model. This simulation and evaluation environment is modeled at the transaction level in VHDL.
Defect and Fault-Tolerance in VLSI Systems, 2007. DFT '07. 22nd IEEE International Symposium on; 10/2007
-
[show abstract]
[hide abstract]
ABSTRACT: This paper presents a concurrent error detection technique for the control logic of a modern microprocessor. Our method is based on execution time prediction for each instruction executing in the processor. To evaluate the proposed method, we use a superscalar, dynamically-scheduled, out-of-order, Alpha-like microprocessor, on which we execute SPEC2000 integer benchmarks and we consider the coverage and the detection latency for faults in the scheduler module of the microprocessor controller. Experimental results show, that through this method, a large percentage of control logic faults can be detected with low latency during normal operation of the processor.