Quantitative Analysis of Long-Latency Failures in System Software.
ABSTRACT This paper presents a study on long latency failures using accelerated fault injection. The data collected from the experiments are used to analyze the significance, causes, and characteristics of long latency failures caused by soft errors in the processor and the memory. The results indicate that a non-negligible portion of soft errors in the code and data memory lead to long latency failures. The long latency failures are caused by errors with long fault activation times and errors causing failures only under certain runtime conditions. On the other hand, less than 0.5% of soft errors in the processor registers used in kernel mode lead to a failure with latency longer than a thousand seconds. This is due to a strong temporal locality of the register values. The study shows also that the obtained insight can be used to guide design and placement (in the application code and/or system) of application-specific error detectors.
- [Show abstract] [Hide abstract]
ABSTRACT: In this paper we report the synthesis, structure and Li ion conductivity of a new tetragonal garnet phase Nd3Zr2Li7O12. In line with other tetragonal garnet systems, the Li is shown to be ordered in the tetrahedral and distorted octahedral sites, and the Li ion conductivity is consequently low. In an effort to improve the ionic conductivity of the parent material, we have also investigated Al doping to reduce the Li content, Nd3Zr2Li5.5Al0.5O12, and hence introduce disorder on the Li sublattice. This was found to be successful leading to a change in the unit cell symmetry from tetragonal to cubic, and an enhanced Li ion conductivity. Neutron diffraction studies showed that the Al was introduced onto the ideal tetrahedral garnet site, a site preference also supported by the results of computer modelling studies. The effect of moisture on the conductivity of these systems was also examined, showing significant changes at low temperatures consistent with a protonic contribution in humid atmospheres. In line with these observations, computational modelling suggests favourable exchange energy for the Li+/H+ exchange process.10/2013; 1(44). DOI:10.1039/C3TA13252H
Conference Paper: Pluggable Watchdog: Transparent Failure Detection for MPI Programs[Show abstract] [Hide abstract]
ABSTRACT: This paper presents a framework and its techniques that can detect various types of runtime errors and failures in MPI programs. The presented framework offloads its detection techniques to an external device (e.g., extension card). By developing intelligence on the normal behavioral and semantic execution patterns of monitored parallel threads, the presented external error detectors can accurately and quickly detect errors and failures. This architecture allows us to use powerful detectors without directly using the computing power of the monitored system. The separation of hardware of the monitored and monitoring systems offers an extra advantage in terms of system reliability. We have prototyped our system on a parallel computer system by using an FPGA-based PCI extension card as a monitoring device. We have conducted a fault injection experiment to evaluate the presented techniques using eight MPI-based parallel programs. The techniques cover ~98.5% of faults, on average. The average performance overhead is 1.8% for techniques that detect crash and hang failures and 6.6% for techniques that detect SDC failures.Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on; 01/2013
- [Show abstract] [Hide abstract]
ABSTRACT: This paper presents a fault-tolerant, programmable voter architecture for software-implemented N-tuple modular redundant (NMR) computer systems. Software NMR is a cost-efficient solution for high-performance, mission-critical computer systems because this can be built on top of commercial off-the-shelf (COTS) devices. Due to the large volume and randomness of voting data, software NMR system requires a programmable voter. Our experiment shows that voting software that executes on a processor has the time-of-check-to-time-of-use (TOCTTOU) vulnerabilities and is unable to tolerate long duration faults. In order to address these two problems, we present a special-purpose voter processor and its embedded software architecture. The processor has a set of new instructions and hardware modules that are used by the software in order to accelerate the voting software execution and address the identified two reliability problems. We have implemented the presented system on an FPGA platform. Our evaluation result shows that using the presented system reduces the execution time of error detection codes (commonly used in voting software) by 14% and their code size by 56%. Our fault injection experiments validate that the presented system removes the TOCTTOU vulnerabilities and recovers under both transient and long duration faults. This is achieved by using 0.7% extra hardware in a baseline processor.IEEE Aerospace Conference Proceedings 01/2012; DOI:10.1109/AERO.2012.6187253