ThesisPDF Available

Fault-Tolerant Satellite Computing with Modern Semiconductors

Authors:

Abstract and Figures

Miniaturized satellites enable a variety space missions which were in the past infeasible, impractical or uneconomical with traditionally-designed heavier spacecraft. Especially CubeSats can be launched and manufactured rapidly at low cost from commercial components, even in academic environments. However, due to their low reliability and brief lifetime, they are usually not considered suitable for life- and safety-critical services, complex multi-phased solar-system-exploration missions, and missions with a longer duration. Commercial electronics are key to satellite miniaturization, but also responsible for their low reliability: Until 2019, there existed no reliable or fault-tolerant computer architectures suitable for very small satellites. To overcome this deficit, a novel on-board-computer architecture is described in this thesis. Robustness is assured without resorting to radiation hardening, but through software measures implemented within a robust-by-design multiprocessor-system-on-chip. This fault-tolerant architecture is component-wise simple and can dynamically adapt to changing performance requirements throughout a mission. It can support graceful aging by exploiting FPGA-reconfiguration and mixed-criticality. Experimentally, we achieve 1.94W power consumption at 300Mhz with a Xilinx Kintex Ultrascale+ proof-of-concept, which is well within the powerbudget range of current 2U CubeSats. To our knowledge, this is the first COTS-based, reproducible on-board-computer architecture that can offer strong fault coverage even for small CubeSats.
Content may be subject to copyright.
A preview of the PDF is not available
... Recent research work (Ph.D. thesis of C. Fuchs, December 2019 [19]) proposes a novel on-board-computer architecture for very small satellites (<100kg) capable of achieving high reliability without using radiation-hardened semiconductors, through the combined use of hardware and software-implemented fault tolerance techniques [19]. However, in spite of this promising research result from C. Fuchs, to the best of the author's knowledge, there are no fault-tolerant boards available for CubeSats, especially boards that can cope with transient faults that affect the processor, which are the major threat to the reliability of CubeSats. ...
... Recent research work (Ph.D. thesis of C. Fuchs, December 2019 [19]) proposes a novel on-board-computer architecture for very small satellites (<100kg) capable of achieving high reliability without using radiation-hardened semiconductors, through the combined use of hardware and software-implemented fault tolerance techniques [19]. However, in spite of this promising research result from C. Fuchs, to the best of the author's knowledge, there are no fault-tolerant boards available for CubeSats, especially boards that can cope with transient faults that affect the processor, which are the major threat to the reliability of CubeSats. ...
... On average, when the faults are injected in the 16 less significant bits of the registers, the target system behaves outside the expected behavior (i.e., failure modes showing abnormal behavior) in about 20% of the faults. On the other hand, when the fault was injected in the first 8 most significant bits group ( [17][18][19][20][21][22][23][24]), we observed that, on average, 87.55% of the faults have no impact on the target system. A similar result was observed for faults injected in the last 8 most significant bits group ( [25][26][27][28][29][30][31][32]), as 91.68% of the faults had no impact on the target system behavior. ...
Thesis
Full-text available
CubeSats are small satellites built with up to 12 units of the shape of a cube of 10cm edge and weight of 10kg maximum and represent an emergent trend in the space industry. These satellites use commercial off-the-shelf (COTS) components to reduce cost and take advantage of the superior performance/power consumption ratio of COTS, which is an order of magnitude better than the equivalent radiation hardened space-grade-components. Unfortunately, COTS components are susceptible to Single Event Upsets (SEU), which are transient errors caused by space radiation. SEU makes the study of the impact of faults caused by space radiation a mandatory step in the development of CubeSats software, in order to carefully evaluate weak points that must be strengthened through the use of specific software fault tolerance techniques. The fact that the impact of faults is strongly dependent on the software running on the COTS hardware indicates that the study of the impact of radiation faults must be carried out every time the CubeSat software has a major change, or even a minor update.This thesis presents CubeSatFI, a fault injection platform for CubeSats meant to facilitate the incorporation of this extra step in the Verification and Validation of CubeSats software. CubeSatFI allows the easy definition of fault injection campaigns that emulate the effects of space radiation. SEU are emulated realistically through bit-flip faults injected in the processor registers and in other locations of the CubeSat boards that can be reached by boundary-scan, which is available in CubeSat boards through JTAG Test Access Port. The execution of the fault injection campaigns is controlled by the CubeSatFI platform in a fully automated mode.The effectiveness of CubeSatFI is demonstrated with the EDC (Environment Data Collection), a payload system that will be used in a constellation of satellites from the Brazilian National Institute for Space Research (Instituto Nacional de Pesquisas Espaciais - INPE), providing a realistic insight on the impact of faults in the EDC software.
... In this section, we will provide a brief summary of the base concept that we have now developed into a fully fledged hardware prototype. This section is meant only to provide a brief introduction to the concept, considerably more in-depth documentation, as well as testing and validation results can be found in [1] and [14]. ...
... To test our implementation, we conducted fault injection at different levels, which we documented in [16] and [14]. Next, we developed a multi-core model of our MPSoC also in ArchC/SystemC to conduct further faultinjection close-to-hardware, which is further described in [17]. ...
... Next, we developed a multi-core model of our MPSoC also in ArchC/SystemC to conduct further faultinjection close-to-hardware, which is further described in [17]. A more detailed description of these validation steps is described in detail in [14]. To achieve worstcase performance estimations, we measured the worstcase performance cost of this approach, which are also described further in [18]. ...
Conference Paper
Full-text available
In this contribution we present practical experiences from realizing a prototype of the first truly fault-tolerant and autonomously operating avionics suite for miniaturized satellite down to the size of a 2U CubeSat. Our initial demonstrator setup consists of a mix of COTS parts and FPGA development boards, which we gradually expanded in scope and capabilities. After four iterations of PCB development and manufacturing, we have condensed this design to a fully integrated custom PCB-based prototype. Our fourth architecture iteration is stackable and is designed to fit on an 80x80mm PCB footprint. It is furthermore capable of operating as generic satellite subsystem node, functioning in a distributed, fault-tolerant, interconnected manner together with other subsystems. Each node is fully replaceable by two or more neighboring subsystem-nodes. In consequence, we achieve a satellite bus setup which is in spirit similar to integrated modular avionics and modern fault-tolerant avionics network architectures used in other fields. We realize this setup through a high-speed chip-to-chip network in a compact CubeSat form factor.
Conference Paper
Full-text available
A common rootfs option for Linux mobile phones is the XIP-modified CramFS which, because of its ability to eXecute-In-Place, can save lots of RAM, but requires extra Flash memory. Another option, SquashFS, saves Flash by compressing files but requires more RAM, or it delivers lower performance. By combining the best attributes of both with some original ideas, we've created a compelling new option in the Advanced XIP File System (AXFS). This paper will discuss the architecture of AXFS. It will also review benchmark results that show how AXFS can make Linux-based mobile devices cheaper, faster, and less power-hungry. Finally, it will explore how the smallest and largest of Linux systems benefit from the changes made in the kernel for AXFS.
Conference Paper
Full-text available
We present the implementation of a fault-tolerant MPSoC for very small satellites (<100kg) based upon commercial components and library IP. This MPSoC is the result of a co-design process and is designed as ideal platform for software-implemented fault-tolerance measures. It enforces strong isolation between processors, and combines fault-tolerance measures across the embedded stack within an FPGA. This allows us to assure robustness for a satellite on-board computer consisting of modern semiconductors manufactured in fine technology nodes, for which traditional fault-tolerance concepts are ineffective. We successfully implemented this design on several Xilinx Ultrascale and Ultrascale+ FPGA with modest utilization. We show that a 4-core implementation is possible with just 1.93W total power consumption, which for the first time enables a true fault-tolerance for very small spacecraft such as CubeSats. For critical space missions aboard heavier satellites, we implemented an MPSoC-variant for the space-grade XQRKU060 part together with the Xilinx Radiation Testing Consortium. The MPSoC was developed for a 4-year ESA project. It can satisfy the high performance requirements of future scientific and commercial space missions at low cost, while offering the strong fault-coverage necessary for platform control for missions with a long duration.
Conference Paper
Full-text available
In this contribution, we present a CubeSat-compatible on-board computer (OBC) architecture that offers strong fault tolerance to enable the use of such spacecraft in critical and long-term missions. We describe in detail the design of our OBC's breadboard setup, and document its composition from the component-level, all the way down to the software level. Fault tolerance in this OBC is achieved without resorting to radiation hardening, just intelligent through software. The OBC ages graceful, and makes use of FPGA-reconfiguration and mixed criticality. It can dynamically adapt to changing performance requirements throughout a space mission. We developed a proof-of-concept with several Xilinx Ultrascale and Ultrascale+ FPGAs. With the smallest Kintex Ultrascale+ KU3P device, we achieve 1.94W total power consumption at 300Mhz, well within the power budget range of current 2U CubeSats. To our knowledge, this is the first scalable and COTS-based, widely reproducible OBC solution which can offer strong fault coverage even for small CubeSats. To reproduce this OBC architecture, no custom-written, proprietary, or protected IP is needed, and the needed design tools are available free-of-charge to academics. All COTS components required to construct this architecture can be purchased on the open market, and are affordable even for academic and scientific CubeSat developers.
Article
Errors in dynamic random access memory (DRAM) are a common form of hardware failure in modern compute clusters. Failures are costly both in terms of hardware replacement costs and service disruption. While a large body of work exists on DRAM in laboratory conditions, little has been reported on real DRAM failures in large production clusters. In this paper, we analyze measurements of memory errors in a large fleet of commodity servers over a period of 2.5 years. The collected data covers multiple vendors, DRAM capacities and technologies, and comprises many millions of DIMM days. The goal of this paper is to answer questions such as the following: How common are memory errors in practice? What are their statistical properties? How are they affected by external factors, such as temperature and utilization, and by chip-specific factors, such as chip density, memory technology and DIMM age? We find that DRAM error behavior in the field differs in many key aspects from commonly held assumptions. For example, we observe DRAM error rates that are orders of magnitude higher than previously reported, with 25,000 to 70,000 errors per billion device hours per Mbit and more than 8% of DIMMs affected by errors per year. We provide strong evidence that memory errors are dominated by hard errors, rather than soft errors, which previous work suspects to be the dominant error mode. We find that temperature, known to strongly impact DIMM error rates in lab conditions, has a surprisingly small effect on error behavior in the field, when taking all other factors into account. Finally, unlike commonly feared, we don't observe any indication that newer generations of DIMMs have worse error behavior.
Chapter
The principle of equitable access was introduced by the International Telecommunications Union to ensure that every nation, spacefaring or not, would have the possibility, at any time, to have access to space and to the necessary spectrum to communicate to and from satellites, without creating or receiving interferences to and from others. The principle applies specifically to the orbital slots and spectrum allocation procedures for the geosynchronous orbits belt, from where broadcasting has been conducted since the 1960s. Such principle does not have, instead, a direct enforcement in the allocation procedures for satellites in non-geosynchronous orbits. Given today’s increasing number of satellites proposed and launched, in particular as part of (mega-)constellations, and given the increasing concerns related to overcrowded low Earth orbits, this should be the right time to raise the issue on starting enforcing the principle in all orbits, before non-yet-spacefaring nations find themselves incredibly thwarted in launching one or more satellites, let alone fairly competing with space powers and spacefaring corporations. Opening the debate might be worth the effort, even just as a reminder that space is province of all mankind.