Conference Paper

Quantitative Analysis of Long-Latency Failures in System Software.

DOI: 10.1109/PRDC.2009.13 Conference: 2009 15th IEEE Pacific Rim International Symposium on Dependable Computing, PRDC 2009, Shanghai, China, 16-18 November 2009
Source: DBLP


This paper presents a study on long latency failures using accelerated fault injection. The data collected from the experiments are used to analyze the significance, causes, and characteristics of long latency failures caused by soft errors in the processor and the memory. The results indicate that a non-negligible portion of soft errors in the code and data memory lead to long latency failures. The long latency failures are caused by errors with long fault activation times and errors causing failures only under certain runtime conditions. On the other hand, less than 0.5% of soft errors in the processor registers used in kernel mode lead to a failure with latency longer than a thousand seconds. This is due to a strong temporal locality of the register values. The study shows also that the obtained insight can be used to guide design and placement (in the application code and/or system) of application-specific error detectors.

1 Follower
8 Reads
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a measurement-based analysis of the fault and error sensitivities of dynamic memory. We extend a software-implemented fault injector to support data-type-aware fault injection into dynamic memory. The results indicate that dynamic memory exhibits about 18 times higher fault sensitivity than static memory, mainly because of the higher activation rate. Furthermore, we show that errors in a large portion of static and dynamic memory space are recoverable by simple software techniques (e.g., reloading data from a disk). The recoverable data include pages filled with identical values (e.g., `0') and pages loaded from files unmodified during the computation. Consequently, the selection of targets for protection should be based on knowledge of recoverability rather than on error sensitivity alone.
    Proceedings of the 2010 IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2010, Chicago, IL, USA, June 28 - July 1 2010; 01/2010
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents an efficient fault tolerance sys- tem for heterogeneous many-core processors. The efficiencies and coverage of the presented fault tolerance are optimized by customizing the techniques for different types of components in the highest layers of system abstractions and codesigning the techniques in a way that separates algorithms and mechanisms.
    25th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2011, Anchorage, Alaska, USA, 16-20 May 2011 - Workshop Proceedings; 01/2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a fault-tolerant, programmable voter architecture for software-implemented N-tuple modular redundant (NMR) computer systems. Software NMR is a cost-efficient solution for high-performance, mission-critical computer systems because this can be built on top of commercial off-the-shelf (COTS) devices. Due to the large volume and randomness of voting data, software NMR system requires a programmable voter. Our experiment shows that voting software that executes on a processor has the time-of-check-to-time-of-use (TOCTTOU) vulnerabilities and is unable to tolerate long duration faults. In order to address these two problems, we present a special-purpose voter processor and its embedded software architecture. The processor has a set of new instructions and hardware modules that are used by the software in order to accelerate the voting software execution and address the identified two reliability problems. We have implemented the presented system on an FPGA platform. Our evaluation result shows that using the presented system reduces the execution time of error detection codes (commonly used in voting software) by 14% and their code size by 56%. Our fault injection experiments validate that the presented system removes the TOCTTOU vulnerabilities and recovers under both transient and long duration faults. This is achieved by using 0.7% extra hardware in a baseline processor.
    IEEE Aerospace Conference Proceedings 01/2012; DOI:10.1109/AERO.2012.6187253
Show more