Quantitative Analysis of Long-Latency Failures in System Software.
ABSTRACT This paper presents a study on long latency failures using accelerated fault injection. The data collected from the experiments are used to analyze the significance, causes, and characteristics of long latency failures caused by soft errors in the processor and the memory. The results indicate that a non-negligible portion of soft errors in the code and data memory lead to long latency failures. The long latency failures are caused by errors with long fault activation times and errors causing failures only under certain runtime conditions. On the other hand, less than 0.5% of soft errors in the processor registers used in kernel mode lead to a failure with latency longer than a thousand seconds. This is due to a strong temporal locality of the register values. The study shows also that the obtained insight can be used to guide design and placement (in the application code and/or system) of application-specific error detectors.
- [show abstract] [hide abstract]
ABSTRACT: As the size of the SRAM cache and DRAM memory grows in servers and workstations, cosmic-ray errors are becoming a major concern for systems designers and end users. Several techniques exist to detect and mitigate the occurrence of cosmic-ray upset, such as error detection, error correction, cache scrubbing, and array interleaving. This paper covers the tradeoffs of these techniques in terms of area, power, and performance penalties versus increased reliability. In most system applications, a combination of several techniques is required to meet the necessary reliability and data-integrity targets.IEEE Transactions on Device and Materials Reliability 10/2005; · 1.52 Impact Factor
Conference Proceeding: Error sensitivity of the Linux kernel executing on PowerPC G4 and Pentium 4 processors[show abstract] [hide abstract]
ABSTRACT: The goals of this study are: (i) to compare Linux kernel (2.4.22) behavior under a broad range of errors on two target processors - the Intel Pentium 4 (P4) running RedHat Linux 9.0 and the Motorola PowerPC (G4) running YellowDog Linux 3.0 - and (ii) to understand how architectural characteristics of the target processors impact the error sensitivity of the operating system. Extensive error injection experiments involving over 115,000 faults/errors are conducted targeting the kernel code, data, stack, and CPU system registers. Analysis of the obtained data indicates significant differences between the two platforms in how errors manifest and how they are detected in the hardware and the operating system. In addition to quantifying the observed differences and similarities, the paper provides several examples to support the insights gained from this research.Dependable Systems and Networks, 2004 International Conference on; 01/2004
- [show abstract] [hide abstract]
ABSTRACT: Many fault injection tools are available for dependability assessment. Although these tools are good at injecting a single fault model into a single system, they suffer from two main limitations for use in distributed systems: (1) no single tool is sufficient for injecting all necessary fault models; (2) it is difficult to port these tools to new systems. NFTAPE, a tool for composing automated fault injection experiments from available lightweight fault injectors, triggers, monitors, and other components, helps to solve these problems. We have conducted experiments using NFTAPE with several types of lightweight fault injectors, including driver-based, debugger-based, target-specific, simulation-based, hardware-based, and performance-fault injections. Two example experiments are described in this paper. The first uses a hardware fault injector with a Myrinet LAN; the other uses a Software Implemented Fault Injection (SWIFI) fault injector to target a spaceimaging application. Keywords...02/2000;