Conference Paper

Quantitative Analysis of Long-Latency Failures in System Software.

DOI: 10.1109/PRDC.2009.13 Conference: 2009 15th IEEE Pacific Rim International Symposium on Dependable Computing, PRDC 2009, Shanghai, China, 16-18 November 2009
Source: DBLP


This paper presents a study on long latency failures using accelerated fault injection. The data collected from the experiments are used to analyze the significance, causes, and characteristics of long latency failures caused by soft errors in the processor and the memory. The results indicate that a non-negligible portion of soft errors in the code and data memory lead to long latency failures. The long latency failures are caused by errors with long fault activation times and errors causing failures only under certain runtime conditions. On the other hand, less than 0.5% of soft errors in the processor registers used in kernel mode lead to a failure with latency longer than a thousand seconds. This is due to a strong temporal locality of the register values. The study shows also that the obtained insight can be used to guide design and placement (in the application code and/or system) of application-specific error detectors.

1 Follower
9 Reads
    • "Finally silent means that the fault is repaired in any way, normally through redundancies implemented across the circuit. This problem has been also addressed in different contexts like large software systems [7] Latent faults are extremely difficult to understand because they remain inside the circuit, until the application of a proper sequence of inputs that allows their propagation to primary outputs. A good case study is the Embedded System, normally based on a microprocessor execution, where processes are continuously running. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Fault detection and diagnosis of safety and mission-critical embedded systems is a constant concern of the aerospace community. There exists a number of fault detection and diagnosis techniques, which are usually based on generating complicated models of the system, or increasing its redundancy. The present paper shows a new approach, based on hash libraries, which allows for fault detection and diagnosis at circuit level, without previous modeling of the system or adding redundancy. Additionally, the proposed technique can be used to detect faults in real-time in a final application, where no comparison with a golden chip is available.
    No preview · Article · Sep 2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a measurement-based analysis of the fault and error sensitivities of dynamic memory. We extend a software-implemented fault injector to support data-type-aware fault injection into dynamic memory. The results indicate that dynamic memory exhibits about 18 times higher fault sensitivity than static memory, mainly because of the higher activation rate. Furthermore, we show that errors in a large portion of static and dynamic memory space are recoverable by simple software techniques (e.g., reloading data from a disk). The recoverable data include pages filled with identical values (e.g., `0') and pages loaded from files unmodified during the computation. Consequently, the selection of targets for protection should be based on knowledge of recoverability rather than on error sensitivity alone.
    No preview · Conference Paper · Jan 2010
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents an efficient fault tolerance sys- tem for heterogeneous many-core processors. The efficiencies and coverage of the presented fault tolerance are optimized by customizing the techniques for different types of components in the highest layers of system abstractions and codesigning the techniques in a way that separates algorithms and mechanisms.
    No preview · Conference Paper · Jan 2011
Show more