Quantitative Analysis of Long-Latency Failures in System Software.
DOI: 10.1109/PRDC.2009.13 Conference: 2009 15th IEEE Pacific Rim International Symposium on Dependable Computing, PRDC 2009, Shanghai, China, 16-18 November 2009
This paper presents a study on long latency failures using accelerated fault injection. The data collected from the experiments are used to analyze the significance, causes, and characteristics of long latency failures caused by soft errors in the processor and the memory. The results indicate that a non-negligible portion of soft errors in the code and data memory lead to long latency failures. The long latency failures are caused by errors with long fault activation times and errors causing failures only under certain runtime conditions. On the other hand, less than 0.5% of soft errors in the processor registers used in kernel mode lead to a failure with latency longer than a thousand seconds. This is due to a strong temporal locality of the register values. The study shows also that the obtained insight can be used to guide design and placement (in the application code and/or system) of application-specific error detectors.
[Show abstract] [Hide abstract]
- "Finally silent means that the fault is repaired in any way, normally through redundancies implemented across the circuit. This problem has been also addressed in different contexts like large software systems  Latent faults are extremely difficult to understand because they remain inside the circuit, until the application of a proper sequence of inputs that allows their propagation to primary outputs. A good case study is the Embedded System, normally based on a microprocessor execution, where processes are continuously running. "
ABSTRACT: Fault detection and diagnosis of safety and mission-critical embedded systems is a constant concern of the aerospace community. There exists a number of fault detection and diagnosis techniques, which are usually based on generating complicated models of the system, or increasing its redundancy. The present paper shows a new approach, based on hash libraries, which allows for fault detection and diagnosis at circuit level, without previous modeling of the system or adding redundancy. Additionally, the proposed technique can be used to detect faults in real-time in a final application, where no comparison with a golden chip is available.
- [Show abstract] [Hide abstract]
ABSTRACT: This paper presents a measurement-based analysis of the fault and error sensitivities of dynamic memory. We extend a software-implemented fault injector to support data-type-aware fault injection into dynamic memory. The results indicate that dynamic memory exhibits about 18 times higher fault sensitivity than static memory, mainly because of the higher activation rate. Furthermore, we show that errors in a large portion of static and dynamic memory space are recoverable by simple software techniques (e.g., reloading data from a disk). The recoverable data include pages filled with identical values (e.g., `0') and pages loaded from files unmodified during the computation. Consequently, the selection of targets for protection should be based on knowledge of recoverability rather than on error sensitivity alone.
- [Show abstract] [Hide abstract]
ABSTRACT: This paper presents an efficient fault tolerance sys- tem for heterogeneous many-core processors. The efficiencies and coverage of the presented fault tolerance are optimized by customizing the techniques for different types of components in the highest layers of system abstractions and codesigning the techniques in a way that separates algorithms and mechanisms.
Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. The impact factor represents a rough estimation of the journal's impact factor and does not reflect the actual current impact factor. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.