[show abstract][hide abstract] ABSTRACT: We investigate the correlation between low-level faults in the control logic of a modern microprocessor and their instruction-level impact on the execution of typical workload. Such information can prove immensely useful in accurately assessing and prioritizing faults with regards to their criticality, as well as commensurately allocating resources to enhance online testability and error/fault resilience through concurrent error detection/correction methods. To this end, we developed an extensive fault simulation infrastructure which allows injection of stuck-at faults and transient errors of arbitrary starting time and duration, as well as cost-effective simulation and classification of their repercussions into various instruction-level error types. As a test vehicle for our study, we employ a superscalar, dynamically-scheduled, out-of-order, Alpha-like microprocessor, on which we execute SPEC2000 integer benchmarks. Extensive fault injection campaigns in control modules of this microprocessor facilitate valuable observations regarding the distribution of low-level faults into the instruction-level error types that they cause. Experimentation with both Register Transfer (RT-) and Gate-Level faults, as well as with both stuck-at faults and transient errors, confirms the validity and corroborates the utility of these observations.
IEEE Transactions on Computers 10/2011; · 1.38 Impact Factor
[show abstract][hide abstract] ABSTRACT: We present a Concurrent Error Detection (CED) scheme for the Scheduler of a modern microprocessor. The proposed CED scheme is based on monitoring a set of invariances imposed through added hardware, violation of which signifies the occurrence of an error. The novelty of our solution stems from the workload-cognizant way in which these invariances are selected so that they leverage the application-level error masking inherent in program execution. Specifically, in order to ensure cost-effectiveness of the hardware employed to construct these invariances, we make use of information regarding the type and frequency of errors affecting the typical workload of the microprocessor. Thereby, we identify the most susceptible aspects of instruction execution and we accordingly distribute CED resources to protect them. Our approach is demonstrated on the Scheduler of an Alpha-like superscalar microprocessor with dynamic scheduling, hybrid branch prediction and out-of-order execution capabilities. Using an extensive fault-simulation infrastructure that we developed around this microprocessor, we profile the impact of Scheduler faults across a variety of different SPEC2000 benchmarks. Based on the results, we construct a CED scheme which monitors the time and location of instruction execution, the executed operation, the utilized resources, as well as the executed and retired sequence of instructions. At a hardware cost of only 32 percent of the Scheduler, the corresponding CED scheme detects over 85 percent of its faults that affect the architectural state of the microprocessor. Furthermore, over 99.5 percent of these faults are detected before they corrupt the architectural state, while the average detection latency for the remaining faults is in the order of a few clock cycles, implying that efficient recovery methods can be developed.
IEEE Transactions on Computers 10/2011; · 1.38 Impact Factor
[show abstract][hide abstract] ABSTRACT: The notion of Architectural Vulnerability Factor (AVF) has been extensively used by designers to evaluate various aspects of design robustness. While AVF is a very accurate way of assessing element resiliency, its calculation requires rigorous and extremely time-consuming experiments. In response, designers have introduced various methodologies that allow AVF calculation within reasonable time, at the cost of some loss of accuracy. In this paper, we present a method for calculating the AVF of design elements-using Statistical Fault Injection (SFI)-with equal accuracy but several orders of magnitude faster than traditional SFI techniques. Our method partitions the design into various hierarchical levels and systematically performs incremental fault injections to generate the AVF numbers. The presented method has been applied on an Intel microprocessor, where experimental results corroborate its ability to achieve great speed-up while maintaining perfect accuracy in calculating AVF.
European Test Symposium (ETS), 2011 16th IEEE; 06/2011
[show abstract][hide abstract] ABSTRACT: We present a non-intrusive concurrent error de- tection (CED) method for protecting the control logic of a contemporary floating point unit (FPU). The proposed method is based on the observation that control logic errors lead to extensive datapath corruption and affect, with high probability, the exponent part of the IEEE 754 floating point representation. Thus, exponent monitoring can be utilized to detect errors in the control logic of the FPU. Predicting the exponent involves relatively simple operations, therefore our method incurs signifi- cantly lower overhead than the classical approach of duplicating the control logic of the FPU. Indeed, experimental results on the openSPARC T1 processor show that, as compared to control logic duplication, which incurs an area overhead of 17.9% of the FPU size, our method incurs an area overhead of only 5.8% yet still achieves detection of over 95% of transient errors in the FPU control logic. Moreover, the proposed method offers the ancillary benefit of also detecting 98.1% of datapath errors that affect the exponent, which cannot be detected via duplication of control logic. Finally, when combined with a classical residue code-based method for the fraction, our method leads to a complete CED solution for the entire FPU which provides a coverage of 94.4% of all errors at an area cost of 16.32% of the FPU size. In this study, we propose an alternative method to protect the control logic of an FPU by monitoring the exponent part of the floating point representation. Our method is based on the conjecture that a control logic error will incorrectly guide the datapath and, by extension, severely alter the expected outcome of the performed operation. As a result, it is highly likely that a control logic error will modify the value of the exponent portion of the floating point output. Given that it is relatively straightforward to calculate the correct exponent through simple operations, monitoring exponent correctness leads to an inexpensive yet very efficient CED method for the FPU control logic. Furthermore, it provides the ancillary benefit of detecting errors in the exponent part of the repre- sentation and, when combined with a residue code-based error detection method for the fraction, it results in a very low-cost CED solution for the entire FPU. The rest of the paper is organized as follows: Section II briefly describes existing techniques for the protection of FPUs. Section III describes the proposed exponent monitoring- based CED method, followed by section IV where the de- velopment of the simulation-based experimental infrastructure and the actual CED implementation is presented. The merit figures of the proposed method, namely the attained coverage and incurred overhead, are assessed in section V, followed by conclusions in section VI.
29th IEEE VLSI Test Symposium, VTS 2011, May 1-5, 2011, Dana Point, California, USA; 01/2011
[show abstract][hide abstract] ABSTRACT: We present a method for selective hardening of control state elements against soft errors in modern microprocessors. In order to effectively allocate resources, our method seeks to rank the control state elements based on their susceptibility, taking into account the high degree of architectural masking inherent in modern microprocessors. The novelty of our method lies in the way this ranking is computed. Unlike methods that compute the architectural vulnerability of registers based on high-level simulations on performance models, our method operates at the Register Transfer (RT-) Level and is, therefore, more accurate. In contrast to previous RT-Level methods, however, it does not rely on extensive transient fault injection campaigns and lengthy executions of workloads to completion, which may make such analysis prohibitive. Instead, it monitors the behavior of key global microprocessor signals in response to a progressive stuck-at fault injection method during partial workload execution. Experimentation with the Scheduler module of an Alpha-like microprocessor corroborates that our method generates a near-optimal ranking, yet is several orders of magnitude faster.
[show abstract][hide abstract] ABSTRACT: Towards improving performance, modern microprocessors incorporate a variety of architectural features, such as branch prediction and speculative execution, which are not critical to the correctness of their operation. While faults in the corresponding hardware may not necessarily affect functional correctness, they may, nevertheless, adversely impact performance. In this paper, we investigate quantitatively the performance impact of such faults using a superscalar, dynamically-scheduled, out-of-order, Alpha-like microprocessor, on which we execute SPEC2000 integer benchmarks. We provide extensive fault simulation-based experimental results and we discuss how this information may guide the inclusion of additional hardware for performance loss recovery and yield enhancement.
Computer Design, 2009. ICCD 2009. IEEE International Conference on; 11/2009
[show abstract][hide abstract] ABSTRACT: We discuss the results of an extensive fault simulation study involving the control logic of a modern alpha-like microprocessor. In this comparative study, faults are injected in both the RT- and the Gate-Level description of the design and are simulated under actual workload of the microprocessor, which is executing SPEC2000 benchmarks. The objective of this study is to analyze and contrast the impact of RT- and gate-level faults on the instruction execution flow of the microprocessor. The key observation is a pronounced consistency in the type and frequency of instruction level errors (ILEs) arising due to RT- vs. gate-level faults. The motivation for this work stems from the need to understand the relative importance of low-level faults based on their instruction-level impact, in order to appropriately allocate error detection and/or correction resources. Hence, the consistency revealed through this study implies that such decisions can be made equally effective based on RT-level fault simulation results, as with their far more computationally-expensive gate-level equivalents.
27th IEEE VLSI Test Symposium, VTS 2009, May 3-7, 2009, Santa Cruz, California, USA; 01/2009
[show abstract][hide abstract] ABSTRACT: We investigate the correlation between register transfer-level faults in the control logic of a modern microprocessor and their instruction-level impact on the execution flow of typical programs. Such information can prove immensely useful in accurately assessing and prioritizing faults with regards to their criticality, as well as commensurately allocating resources to enhance testability, diagnosability, manufacturability and reliability. To this end, we developed an extensive infrastructure which allows injection of stuck-at faults and transient errors of arbitrary starting point and duration, as well as cost-effective simulation and classification of their repercussions into various instruction-level error types. As a test vehicle for our study, we employ a superscalar, dynamically-scheduled, out-of-order, Alpha-like microprocessor, on which we execute SPEC2000 integer benchmarks. Extensive experimentation with faults injected in control logic modules of this microprocessor reveals interesting trends and results, corroborating the utility of this simulation infrastructure and motivating its further development and application to various tasks related to robust design.
Test Conference, 2008. ITC 2008. IEEE International; 11/2008
[show abstract][hide abstract] ABSTRACT: This paper presents a concurrent error detection technique for the control logic of a modern microprocessor. Our method is based on execution time prediction for each instruction executing in the processor. To evaluate the proposed method, we use a superscalar, dynamically-scheduled, out-of-order, Alpha-like microprocessor, on which we execute SPEC2000 integer benchmarks and we consider the coverage and the detection latency for faults in the scheduler module of the microprocessor controller. Experimental results show, that through this method, a large percentage of control logic faults can be detected with low latency during normal operation of the processor.
[show abstract][hide abstract] ABSTRACT: Abstract Field-programmable gate ,arrays ,(FPGAs) are becoming increasingly popular due to low design times, easy testing and implementation procedures and low costs. FPGAs placement and routing are NP-complete problems dealt well with modern tools using heuristic algorithms. As modern FPGAs increase in size and also new capabilities, such as Run-Time Reconfiguration (RTR), are introduced, the complexity of these problems isgreatly increased. In this paper we approach ,both problems using a modified version ,of Kohonen Self- Organizing map. The algorithm, consisting of four phases, takes into consideration constraints that may apply to the FPGA design (such as I/O pins, resource constraints like global clock etc). The modified algorithm yields a good topological map of the design to be placed, minimizing the average distance between connecting logic blocks. Index Terms—FPGA, self-organizing feature map, placement, routing, constraints
20th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2008), November 3-5, 2008, Dayton, Ohio, USA, Volume 2; 01/2008