Software implemented transient fault detection in space computer

School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China
Aerospace Science and Technology 01/2007; DOI: 10.1016/j.ast.2006.06.006

ABSTRACT Computer systems operating in space environment are subject to different radiation phenomena, whose effects are often called “Soft Error”. Generally, these systems employ hardware techniques to address soft-errors, however, software techniques can provide a lower-cost and more flexible alternative. This paper presents a novel, software-only, transient-fault-detection technique, which is based on a new control flow checking scheme combined with software redundancy. The distinctive advantage of our approach over other fault tolerance techniques is the lower performance overhead with the higher fault coverage. It is able to cope with transient faults affecting data and the program control flow. By applying the proposed technique on several benchmark applications, we evaluate the error detection capabilities by means of several fault injection campaigns. Experimental results show that the proposed approach can detect more than 98% of the injected bit-flip faults with a mean execution time increase of 153%.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Electronic equipment operating in harsh environments such as space is subjected to a range of threats. The most important of these is radiation that gives rise to permanent and transient errors on microelectronic components. The occurrence rate of transient errors is significantly more than permanent errors. The transient errors, or soft errors, emerge in two formats: control flow errors (CFEs) and data errors. Valuable research results have already appeared in literature at hardware and software levels for their alleviation. However, there is the basic assumption behind these works that the operating system is reliable and the focus is on other system levels. In this paper, we investigate the effects of soft errors on the operating system components and compare their vulnerability with that of application level components. Results show that soft errors in operating system components affect both operating system and application level components. Therefore, by providing endurance to operating system level components against soft errors, both operating system and application level components gain tolerance.
    The Scientific World Journal 01/2014; 2014:506105. · 1.73 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: In space radiation environment, a large number of cosmic ray often results in transient faults on on-board computer. These transient faults lead to data flow errors or control flow errors during program running. For the control flow errors caused by transient faults, this paper proposes a control flow checking method based on classifying basic blocks CFCCB. CFCCB firstly classifies the basic blocks based on the control flow graph that has been inserted the abstract blocks. CFCCB then designs formatted signatures for the basic blocks and inserts the instructions for comparing and update signatures into every basic block for the purpose of checking the control flow errors that are inter-blocks, intra-blocks or inter-procedures. Compared to existing algorithms, CFCCB not only has high label express ability, but also can be configured flexibly. The fault injection experiment results of CFCCB and other similar algorithms have shown that, the average fail rate of programs with CFCCB has decreased to 19.9% at the cost of increasing the executing time by 34% and increasing the memory overhead by 41.5% in average. CFCCB has lower performance and memory overhead, and has highest reliability among the similar algorithms.
    Parallel Architectures, Algorithms and Programming, International Symposium on. 12/2010;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Fault-tolerant control (FTC) for the space-borne equipments is very important in the engineering design. This paper presents a two-layer intelligent FTC approach to handle the speed stability problem in the swing-arm system suffering from various faults in space. This approach provides the reliable FTC at the performance level, and improves the control flow error detection capability at the code level. The faults degrading the system performance are detected by the performance-based fault detection mechanism. The detected faults are categorized as the anticipated faults and unanticipated faults by the fault bank. Neural network is used as an on-line estimator to approximate the unanticipated faults. The compensation control and intelligent integral sliding mode control are employed to accommodate two types of faults at the performance level, respectively. To guarantee the reliability of the FTC at the code level, the key parts of the program codes are modified by control flow checking by software signatures (CFCSS) to detect the control flow errors caused by the single event upset. Meanwhile, some of the undetected control flow errors can be detected by the FTC at the performance level. The FTC for the anticipated fault and unanticipated fault are verified in Synopsys Saber, and the detection of control flow error is tested in the DSP controller. Simulation results demonstrate the efficiency of the novel FTC approach.
    Acta Astronautica 04/2012; 73:67–75. · 0.70 Impact Factor