Software implemented transient fault detection in space computer

School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China
Aerospace Science and Technology (Impact Factor: 0.94). 03/2007; 11(2-3):245-252. DOI: 10.1016/j.ast.2006.06.006


Computer systems operating in space environment are subject to different radiation phenomena, whose effects are often called “Soft Error”. Generally, these systems employ hardware techniques to address soft-errors, however, software techniques can provide a lower-cost and more flexible alternative. This paper presents a novel, software-only, transient-fault-detection technique, which is based on a new control flow checking scheme combined with software redundancy. The distinctive advantage of our approach over other fault tolerance techniques is the lower performance overhead with the higher fault coverage. It is able to cope with transient faults affecting data and the program control flow. By applying the proposed technique on several benchmark applications, we evaluate the error detection capabilities by means of several fault injection campaigns. Experimental results show that the proposed approach can detect more than 98% of the injected bit-flip faults with a mean execution time increase of 153%.

1 Follower
23 Reads
  • Source
    • "Nous allons traiter un cas plus complexe (l'interaction matériellogiciel ). Une étude bibliographique a démonté les lacunes des méthodes conventionnelles actuelles d'évaluation quantitative (RBD [6]; Réseaux de fiabilité [7]; FTA [8]…), et qui semblent aujourd'hui insuffisantes pour une prise en compte correcte de certain modes de défaillances (défaillance de cause commune [9]; modes latents de défaillance [10]; problème d'autotest, arrêt intempestif…). Nous avons construis une méthode d'évaluation analytique, contrairement aux approches numériques basés sur la simulation de type (réseau de Pétri, Monte-Carlo…). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Actuellement, les efforts se concentrent sur l'intégration d'outils pour la conception de systèmes complexes, en omettant les aspects sûreté. La sûreté de fonctionnement relevant du niveau système, chaque métier se doit de l'intégrer dans sa doctrine, mais la validation des avancées de chacun ne peut qu'être réalisée globalement. Le travail mené par l'équipe A3SI-CRAN a pour objectif de définir une méthodologie de conception d'un système complexe programmable dédié à une application mécatronique [1], intégrant dès les premières phases du cycle de développement [2], les aspects de la sûreté de fonctionnement. L'apport d'une telle méthodologie doit permettre de faire face à un certain nombre de contraintes propres au domaine des capteurs intelligents (les exigences du cahier des charges, le respect des normes législatives en vigueur).
  • Source
    • "Most of the studies, however, have focused on fault coverage and error latency of hardware fault-tolerant mechanisms in digital systems as dependability measures [4]. At recent years, it was reported that the environmental transient faults could be masked only by software without hardware error masking mechanisms [5]. Thus, a substantial number of faults do not affect the program results for several reasons: faults whose errors are neutralized by the next instructions, faults affecting the execution of instructions that do not contribute to the benchmark results, and faults whose errors are tolerated by the semantic of the running benchmark. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Fault tolerance is an essential requirement for critical programming systems, due to potential catastrophic consequences of faults. Several approaches to evaluate system reliability parameters exist today; however, their work is based on the assumptions that hardware and software failures happen independently. The challenge in this field is to take into account the hardware-software interactions in the evaluation of the model. In the continuity of the CETIM project (Belhadaoui et al. 2007) whose principal objective is to define an integrated design of dependable mechatronic systems, this work evaluates important reliability parameters of an embedded application in a stack processor architecture using two dynamic models. The first one (stack processor emulator (Jallouli et al. 2007)) allows the study of dynamic performance and the evaluation of a fault-tolerant technique. The second one (information flow approach (Hamidi et al. 2005)) evaluates the failure probability for each assembler instruction and for some program loops. The main objective is to estimate the failure probability of the whole application. The hierarchically modelling with the information flow approach makes it possible to evaluate the efficiency of protection program loops. These loops ensure the fault tolerance policy by recovering imminent failures and allow the application to run successfully thanks to a permanent software recover mechanism: in case of a detected and not corrected error, the system returns to the last faultless state. This work is useful because it allows adjusting the architecture and shows the advantages of the hardware-software interactions during the co-design phase before the hardware implementation. It puts the hand on the critical points in term of reliability thanks to the scenarios of critical failure paths in the processor architecture.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Soft errors are a growing concern for computer reliability. To mitigate the effects of soft errors, a variety of software-based fault tolerance methodologies have been proposed for their low costs. Data duplication techniques have the advantage of flexible and general implementation with strong capacity for error detection. However, the trade-off between reliability, performance and memory overhead should be carefully considered before employing data duplication techniques. In this paper, we first introduce an analytical model, named PRASE (Program Reliability Analysis with Soft Errors), which is able to access the impact of soft errors for the reliability of a program. Furthermore, the analytical result of PRASE points out a factor about data reliability weight, which meters the criticality of data for the overall reliability of the target program. Based on PRASE, we propose a novel data duplication approach, called ODD, which can provide the optimum error coverage under system performance constraints. To illustrate the effectiveness of our method, we perform several fault injection experiments and performance evaluations on a set of simple benchmark programs using the SimpleScalar tool set.
    15th Asia-Pacific Software Engineering Conference (APSEC 2008), 3-5 December 2008, Beijing, China; 01/2008
Show more