Radiation tolerance in FPGAs is an important field of research particularly for reliable computation in electronics used in aerospace and satellite missions. The motivation behind this research is the degradation of reliability in FPGA hardware due to single-event effects caused by radiation particles. Redundancy is a commonly used technique to enhance the fault-tolerance capability of radiation-sensitive applications. However, redundancy comes with an overhead in terms of excessive area consumption, latency, and power dissipation. Moreover, the redundant circuit implementations vary in structure and resource usage with the redundancy insertion algorithms as well as number of used redundant stages. The radiation environment varies during the operation time span of the mission depending on the orbit and space weather conditions. Therefore, the overheads due to redundancy should also be optimized at run-time with respect to the current radiation level. In this paper, we propose a technique called Dynamic Reliability Management (DRM) that utilizes the radiation data, interprets it, selects a suitable redundancy level, and performs the run-time reconfiguration, thus varying the reliability levels of the target computation modules. DRM is composed of two parts. The design-time tool flow of DRM generates a library of various redundant implementations of the circuit with different magnitudes of performance factors. The run-time tool flow, while utilizing the radiation/error-rate data, selects a required redundancy level and reconfigures the computation module with the corresponding redundant implementation. Both parts of DRM have been verified by experimentation on various benchmarks. The most significant finding we have from this experimentation is that the performance can be scaled multiple times by using partial reconfiguration feature of DRM, e.g., 7.7 and 3.7 times better performance results obtained for our data sorter and matrix multiplier case studies compared with static reliability management techniques. Therefore, DRM allows for maintaining a suitable trade-off between computation reliability and performance overhead during run-time of an application.
1. Introduction
The advancement of semiconductor technology to nanometer dimensions has made the design of digital circuits challenging. Future shrinking of device parameters is inevitable due to the increasing need for high computing power and more computational blocks on a single system-on-chip. Besides the conventional performance, area, and power constraints, today’s circuits have to conform to reliability requirements. The reliability of the digital circuits however degrades due to probabilistic nature of errors appearing in nanodevices. These errors, being permanent or transient in nature, are mainly caused by process variations, thermal fluctuations, quantum effects, power supply noise, and capacitive/inductive coupling, as few examples [1–4]. These errors are treated for fault tolerance at device, logic, or network layers depending on the feasibility of mitigation at each of these layers. However, when the nanodevice-based circuits are used in high-radiation environments (consisting alpha particles and cosmic rays), an additional source of error emerges which is known as radiation-induced errors. Although the reliability studies for electronics include well-defined mitigation strategies for different sources of errors, they propose redundancy as the most efficient way of countering radiation-induced errors.
The future of space computing is largely dominated by Field Programmable Gate Arrays (FPGAs) due to their capability of run-time modification of functional implementation on hardware. The reconfiguration feature of FPGAs allows them to perform various tasks during different phases of a mission [5]. Moreover, increase in on-board processing requirements on space missions for various image-processing applications goes well with the highly parallel architecture of FPGAs [6]. Most importantly, the space, weight, and power for a satellite payload design can be minimized with the usage of FPGAs due to their ability to perform multiple tasks without having a dedicated hardware for each of these tasks.
The advantages of using FPGAs in space computing brought the attention of researchers and FPGA companies to ensure the fault-tolerant operation of FPGAs in high-radiation space environments. When FPGAs are exposed to high solar or cosmic radiation (involving high-energy electrons, alpha particles, and heavy ions), errors in the form of logic reversals appear in the digital circuit elements which could be as disastrous as causing a system-level failure or as moderate as internally masked errors. For long-term space missions, the accumulation of ionizing dose, measured as Total Ionizing Dose (TID), is an important design parameter which indicates how long the FPGA can withstand the radiation before its transistors begin to degrade [7]. For rather short-term space missions, Single-Event Effects (SEEs) cause temporary errors to appear in the circuit whose mitigation strategies vary from internal masking (using redundancy) to system-reset requirement. Generally, SEEs, which are measureable changes in the state of a microelectronic device [8], are classified into four domains. Single-Event Transient (SET) is a voltage spike causing a glitch in a combinational element; Single-Event Upset (SEU) is a soft error caused by the radiation particle in memory contents, particularly SRAM cells; Single-Event Latchup (SEL) is the high-current state in a device caused by the passage of a single energetic particle through sensitive regions of the device structure or Single-Event Functional Interrupt (SEFI) that cause the component to reset, lockup, or otherwise malfunction in a detectable way. SETs are typically short-term and ineffective unless they are latched into a sequential element, thus behaving as an SEU. SELs can be corrected by power cycling, while SEFIs require resetting the component [7–9]. The remaining category, i.e., SEU, is the major concern for the reliable FPGA operation due to its higher appearance rate compared with other SEEs as well as its accumulation effects. SRAM-based FPGAs are highly susceptible to SEUs appearing in their configuration and application memory elements.
The alternative to SRAM-based FPGAs are radiation-hardened FPGAs which include the foremost Actel (currently known as Microsemi) RTAX antifuse FPGAs [10]. These devices are one-time programmable, and the development of permanent interconnections after configuration makes them immune to SEUs. However, being nonreprogrammable, they lose their charm for utilization in multiple design modification scenarios. The second popular category consists of flash-based FPGAs [11], which offer full reconfiguration though lack partial reconfiguration [6]. Moreover, these devices have typically lower TID rating than SRAM or antifuse FPGAs [12]. However, an additional category of radiation-tolerant FPGAs consists of space-grade versions which utilize the performance of SRAM FPGAs while having built-in radiation-tolerance features, e.g., Xilinx Virtex-5QV [13].
In contrast to inherent radiation-tolerant capabilities of FPGAs, fault-tolerant computation approaches in hardware and software are utilized as well. Redundancy [14, 15] and scrubbing [16, 17] are two most popular techniques for tolerating SEUs and avoiding their accumulation. A classical and widely used form of redundancy is triple modular redundancy (TMR). In principle, TMR instantiates three identical copies of a circuit and places a voter module at the end to take a majority decision for each output. Hence, TMR does not depend on error detection but mitigates the error by passage through the voter. In contrast, scrubbing involves continuously configuring configuration memory to prevent accumulation of errors. The combination of redundancy and scrubbing is considered a widespread optimal fault-tolerant solution in hardware. Additional approaches in the literature include duplication with comparison (DWC) [18], error checking and correcting codes (ECAC) [19], and algorithm-based fault tolerance (ABFT) [20]. However, in this paper, we focus solely on redundancy and its different variations in hardware. Hardware redundancy techniques for FPGAs are more involved than basic TMR with respect to partitioning a circuit into submodules, deciding on how many voters to insert, and where to place the voters in the FPGA design. Tools for automating redundancy insertion in FPGA designs are available, including the TMR tool of Xilinx [21], Precision Hi-Rel software [22], and the BYU-LANL TMR tool [23]. Fault-tolerance mechanisms, particularly modular redundancy, come with an overhead in terms of excessive area consumption as well as latency and power dissipation. Therefore, while providing fault tolerance, the design of a mission critical system also has to limit these overheads to given constraints.
The radiation environment during the operation time of the satellite is variant, especially the radiation particle strike rate increases enormously above Earth’s magnetosphere [6, 24]. The fault-tolerance techniques, particularly redundancy, have a constant overhead in performance factors of area, latency, and power dissipation. However, in general, higher stages of redundancy provide more reliability at the cost of increasing overheads in performance factors. Therefore, the realization of reliability-performance trade-off is mandatory before designing a redundant system. In order to avoid a constant degradation in system performance due to a fixed overhead in performance factors, the system should be self-adaptive in a way to optimize the trade-off between reliability and performance factors, at run-time, based on the radiation strength of the environment. We implement this concept named as “Dynamic Reliability Management (DRM).” DRM, based on the partial reconfiguration of FPGAs, is beneficial as it allows for the optimization of the performance overheads and, thus, can save cost and power or free hardware resources for other tasks when feasible.
The contribution of this paper lies around providing a complete approach for using SRAM-based FPGAs in space missions whereby using real-time radiation scenarios for our analysis and experimentation. Most importantly, we focus on visualizing the trade-off of reliability with the three main performance parameters, i.e., latency, area, and power consumption. This is in contrast to the research studies which work on the same research line but fail to show the complete picture of reliability versus three performance factors and how to prefer one redundant structure over another based on the performance constraints. This paper provides such solution where Pareto optimization can be used to filter out the suitable redundant structure based on one or more performance factors. Our developed tool flows are unique and more extensive than previous research works which are comprehensively verified in this paper as well. Last but not the least, we also provide the possible extensions to our experimentation platform, which can be used as future research directions in this vast and emerging research area. We have tried our best to be generic as well as having a bigger scope for our research so that the maximum research community in space computing can benefit from it.
This paper is organized as follows. In the next section, we provide the foundation for understanding the adaptive reliability management concept by describing implementation strategies of redundancy in FPGAs, varying radiation environments, corresponding decision mechanisms, and commonly used reliability theories. Afterwards, we discuss and analyze major research works in self-adaptive reliability management and contrast them to our approach of Dynamic Reliability Management. Section 3 describes the two portions of our DRM approach as design and run-time parts, while verification of each of these parts has been provided with detailed experimentation in Section 4. Section 5 concludes the paper.
2. Background and Related Work
In this section, we describe the basic implementation strategy of TMR in FPGAs followed by an insight into different forms of TMR and its cascaded version. To understand the varying radiation environment of space, we give real radiation scenarios as examples. The reliability computation of FPGA-based circuits can be done by conventional and probabilistic theories, which will be briefly described as well. Finally, we summarize and analyze the major research works similar to our approach of Dynamic Reliability Management.
2.1. Triple Modular Redundancy in FPGAs
The concept of triple modular redundancy (TMR) is straightforward as it triplicates a logic design and takes the final output of the design from a voter placed at the outputs of the redundant modules [14]. The function of the voter is to sample three logic outputs and forward the majority result. The limitation of this architecture is the single point of failure, i.e., an error occurring in the voter renders the TMR technique useless. To avoid the single point of failure, we can create a more reliable architecture involving triplicated voters in addition to triplicated logic modules. This architecture runs the three branches in parallel unless they are to be interfaced with the outside world of the FPGA, where they can be converged using reducing voters or could be interfaced in the form of three outputs.
The TMR technique can be used in an FPGA by simply triplicating the inputs, outputs, and logic modules, inserting buffers, and connecting the outputs of logic modules to the triplicated voter. This straightforward implementation is not suitable due to some practical considerations. Firstly, TMR is able to counter one error among the three redundant branches, and a larger length of each branch increases its probability of being erroneous more than once. To deal with this issue, there is a need to break the logic of the branch at regular intervals and place the triplicated voters in intermediate stages of the circuit as shown in Figure 1. Thus, an error occurring in one partition of the logic will not be forwarded to the next partition due to the error-mitigation effect of the triplicated voter. However, the minimum size of the logic partition, or granularity level, could be limited to a single component on an FPGA, e.g., a lookup table and multiplexer. In addition, there are certain locations on an FPGA called illegal-cut locations which should not be triplicated due to the FPGA architecture, e.g., dedicated route connections in a slice [25]. Moreover, voters should not be placed on high-speed carry chains in order to not deteriorate the timing performance of the design. Most importantly, voters should always be added in the feedback paths to avoid data corruption at the outputs of sequential elements being forwarded into the feedback paths [14, 25]. These voters are commonly denoted as synchronization voters.