Chapter
To read the full-text of this research, you can request a copy directly from the author.

Abstract

Hardware redundancy impacts size, weight, power consumption, and cost of a system. In some applications, it is preferable to use extra time rather than extra hardware to tolerate faults. In this chapter, we describe time redundancy techniques for detection and correction of transient faults. We also show how time redundancy can be combined with some encoding scheme to handle permanent faults. We consider four encoding schemes: alternating logic, recomputing with shifted operands, recomputing with swapped operands, and recomputing with duplication with comparison.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... The primary objective of fault tolerance is to increase system dependability. To fulfill this aim, fault-tolerant systems must uphold specified service delivery, even amidst component faults [1]. Novel technologies elevate various facets of our quality of life while concurrently bolstering societal productivity and efficiency. ...
... • Program Crashes and Abnormal Terminations: One of the primary consequences of faults in program exe-cution is program crashes and abnormal terminations. Faults such as hardware failures, memory corruption, or unhandled exceptions can cause the program to terminate abruptly or enter an undefined state, resulting in system instability and potential data loss [1]. Researchers have proposed various techniques for detecting and recovering from program crashes, including CFC methods that verify the integrity of program execution path • Incorrect Outputs and Results: Faults can lead to incorrect outputs and results, affecting the reliability and accuracy of embedded systems. ...
... For instance, a system that requires high availability, such as a telecommunications network, would typically opt for hardware-based fault tolerance methods. Conversely, a less critical system like a word processing application may utilize softwarebased fault tolerance methods [1], [31]. ...
Article
Full-text available
Fault tolerance is a critical aspect of modern computing systems, ensuring correct functionality in the presence of faults. This paper presents a comprehensive survey of fault tolerance methods and mitigation techniques in embedded systems, with a focus on both software and hardware faults. Emphasis is placed on real-time embedded systems, considering their resource constraints and the increasing interconnectivity of computing systems in commercial and industrial applications. The survey covers various fault tolerance methods, including hardware, software, and hybrid redundancy. Particular attention is given to software faults, acknowledging their significance as a leading cause of system failures, while also addressing hardware faults and their mitigation. Moreover, the paper explores the challenges posed by soft errors in modern computing systems. The survey concludes by emphasizing the need for continued research and development in fault tolerance methods, specifically in the context of real-time embedded systems, and highlights the potential for extending fault tolerance approaches to diverse computing environments.
... To evaluate the fault coverage possible, faults are assumed to behave according to some fault model. A fault model attempts to describe the effect of the fault that can occur [28]. Faults are grouped into two categories of permanent and temporary [29]. ...
... The most common gate-level fault model is the multiple stuck-at faults. A single stuck-at fault is a fault which results in a line in a logic circuit being permanently stuck at a logic one or zero [28]. It is assumed that the basic functionality of the circuit is not changed by the fault, i.e., an AND gate does not become an OR gate. ...
... In a multiple stuck-at fault model, multiple lines in a logic circuit are stuck-at some logic values (the same or different). A circuit with k lines can have 2k different single stuck-at faults and 3 k − 1 different multiple stuck-at faults [28]. Therefore, testing for all possible multiple stuck-at faults is infeasible for large circuits. ...
Article
Full-text available
In this paper, fault-tolerant and error-correcting 4-bit S-boxes for cryptography applications with multiple error detection and correction are presented. Here, we consider three applicable 4-bit S-boxes, which are used in lightweight block ciphers PRESENT and PRINCE and lightweight hash function SPONGENT as basic circuits for the error-correcting method. The proposed design does not require two-rail checkers for detecting the error and the redundant S-box for repairing the S-box. This reduces the overall area consumption of the proposed design. In the proposed approach, the error-correcting part of the circuit is implemented concurrently with the main circuit of the S-box. Therefore, the four output bits of the S-box are tested individually to improve the efficiency of fault diagnosis. The proposed fault-tolerant S-box method can detect and repair transient and permanent faults simultaneously. In other words, the structure can detect and repair single, double, triple, and quadruple faults at a time. The comparison with the famous fault-tolerant and error-correcting methods shows that the ability of the proposed method to create error-correcting 4-bit S-boxes is acceptable. The performance of S-boxes with error and with our error-correcting method has been investigated in the image encryption. The analyzes show that the proposed method has the desirable results. Also, the area and timing results, in 180 nm CMOS technology, show the proposed structures are comparable in terms of area and delay overheads than those of the other methods.
... They need to tolerate both transient and permanent faults [3][4] [5]. Most of the faults are transient and short-lived in nature and disclose themselves in the form of soft errors [4] [6]. However, the permanent fault impacts a specific component (e.g., a single core or memory) in the system, making it unavailable until it is replaced or repaired [5] [6]. ...
... Most of the faults are transient and short-lived in nature and disclose themselves in the form of soft errors [4] [6]. However, the permanent fault impacts a specific component (e.g., a single core or memory) in the system, making it unavailable until it is replaced or repaired [5] [6]. Re-execution of the faulty tasks is one of the widely used approaches employed to unravel transient faults [4] [5]. ...
... Re-execution of the faulty tasks is one of the widely used approaches employed to unravel transient faults [4] [5]. However, permanent faults usually require hardware redundancy to solve the malfunctioning of the system [4] [6]. To this end, N-Modular Redundancy (NMR), Standby-Sparing (SS), and Primary/Backup (PB) techniques, as prevalent and practical real-world fault-tolerance techniques, are exploited in embedded systems to tolerate both transient and permanent faults [6][7] [8]. ...
Article
One of the essential requirements of embedded systems is a guaranteed level of reliability. In this regard, fault-tolerance techniques are broadly applied to these systems to enhance reliability. However, fault-tolerance techniques may increase power consumption due to their inherent redundancy. For this purpose, power management techniques are applied, along with fault-tolerance techniques, which generally prolong the system lifespan by decreasing the temperature and leading to an aging rate reduction. Yet, some power management techniques, such as Dynamic Voltage and Frequency Scaling (DVFS), increase the transient fault rate and timing error. For this reason, heterogeneous multicore platforms have received much attention due to their ability to make a trade-off between power consumption and performance. Still, it is more complicated to map and schedule tasks in a heterogeneous multicore system. In this paper, for the first time, we propose a power management method for a heterogeneous multicore system that reduces power consumption and tolerates both transient and permanent faults through primary/backup technique while considering core-level power constraint, real-time constraint, and aging effect. Experimental evaluations demonstrate the efficiency of our proposed method in terms of reducing power consumption compared to the state-of-the-art schemes, together with guaranteeing reliability and considering the aging effect.
... The design of resilient electronic systems is critical for devices operating in harsh environments, such as space. The resilient of a computer system refers to the ability of the system to be reasonably relied upon for the services it provides [EDU13], i.e. to maintain the dependability of the system against performance degradation or functional impairment due to various faults or anomalies. In other words, a system's dependability is determined by how much users believe in the accuracy of its outputs. ...
... Faults, errors, and failures are three main threats to system dependability. In general, a fault occurs at the physical level caused by some failure mechanism, an error is an effect on a subsystem level, and a failure can produce systemic effects [EDU13]. The details of these threats are introduced as follows: ...
... As described in Section 2.3.1, faults are the source of errors and failures. Therefore, fault mitigation is the critical means to achieve system dependability, and the general methods are described as follows [EDU13]: ...
Thesis
Full-text available
As a result of CMOS scaling, radiation-induced Single-Event Effects (SEEs) in electronic circuits became a critical reliability issue for modern Integrated Circuits (ICs) operating under harsh radiation conditions. SEEs can be triggered in combinational or sequential logic by the impact of high-energy particles, leading to destructive or non-destructive faults, resulting in data corruption or even system failure. Typically, the SEE mitigation methods are deployed statically in processing architectures based on the worst-case radiation conditions, which is most of the time unnecessary and results in a resource overhead. Moreover, the space radiation conditions are dynamically changing, especially during Solar Particle Events (SPEs). The intensity of space radiation can differ over five orders of magnitude within a few hours or days, resulting in several orders of magnitude fault probability variation in ICs during SPEs. This thesis introduces a comprehensive approach for designing a self-adaptive fault resilient multiprocessing system to overcome the static mitigation overhead issue. This work mainly addresses the following topics: (1) Design of an on-chip radiation particle monitor for real-time radiation environment detection, (2) Investigation of space environment predictor, as support for solar particle events forecast, (3) Dynamic mode configuration in the resilient multiprocessing system. Therefore, according to detected and predicted in-flight space radiation conditions, the target system can be configured to use no mitigation or low-overhead mitigation during non-critical periods of time. The redundant resources can be used to improve system performance or save power. On the other hand, during increased radiation activity periods, such as SPEs, the mitigation methods can be dynamically configured appropriately depending on the real-time space radiation environment, resulting in higher system reliability. Thus, a dynamic trade-off in the target system between reliability, performance and power consumption in real-time can be achieved. All results of this work are evaluated in a highly reliable quad-core multiprocessing system that allows the self-adaptive setting of optimal radiation mitigation mechanisms during run-time. Proposed methods can serve as a basis for establishing a comprehensive self-adaptive resilient system design process. Successful implementation of the proposed design in the quad-core multiprocessor shows its application perspective also in the other designs.
... Hardware redundancy is the most common technique, which is the addition of extra hardware components for detecting or tolerating faults [5,6]. For example, instead of using a single core/processor, more cores/processors can be exploited so that each application is executed on each core/processor; then, the fault can be detected or even corrected. ...
... These techniques are referred to as M-of-N systems, which means that the system consists of N components, and the correct operation of this system is achieved when at least M components correctly work. The TMR system is a 2-of-3 system with M = 2 and N = 3, which is realized by three components performing the same action, and the result is voted on [5,6]. ...
... In DWC, two identical hardware components perform the exact computation in parallel, and their output is compared. Therefore, the DWC technique can only detect faults but cannot tolerate them because the faulty component cannot be determined [5,6]. In standby-sparing, one module is operational, and one or more modules are standby or spares. ...
Article
Full-text available
A common requirement of embedded software in charge of safety tasks is to guarantee the identification of random hardware failures (RHFs) that can affect digital components. RHFs are unavoidable. For this reason, the functional safety standard devoted to automotive applications requires embedded software designs able to detect and eventually mitigate them. For this purpose, various software-based error detection techniques have been proposed over the years, focusing mainly on detecting control flow errors. Many control flow checking (CFC) algorithms have been proposed to accomplish this task. However, applying these approaches can be difficult because their respective literature gives little guidance on their practical implementation in high-level programming languages, and they have to be implemented in low-level code, e.g., assembly. Moreover, the current trend in the automotive industry is to adopt the so-called model-based software design approach, where an executable algorithm model is automatically translated into C or C++ source code. This paper presents two novelties: firstly, the compliance of the experimental data on the capabilities of control flow checking (CFC) algorithms with the ISO 26262 automotive functional safety standard; secondly, by implementing the CFC algorithm in the application behavioral model, the off-the-shelves code generator seamlessly produces the hardened source code of the application. The assessment was performed using a novel fault injection environment targeting a RISC-V (RV32I) microcontroller.
... Reliability is subjected to different types of faults: (i) transient faults, (ii) permanent faults [1][5] [6]. Transient faults are the most common fault type and are short-lived [6] [10]. Typically, task re-execution can unravel transient fault [6] [10]. ...
... Transient faults are the most common fault type and are short-lived [6] [10]. Typically, task re-execution can unravel transient fault [6] [10]. However, permanent faults damage the system components and require hardware redundancy to be tolerated [6] [10]. ...
... Typically, task re-execution can unravel transient fault [6] [10]. However, permanent faults damage the system components and require hardware redundancy to be tolerated [6] [10]. The Standby-Sparing (SS) and Primary/Backup (PB) techniques are the common techniques to deal with both transient and permanent faults [6][10] [13]. ...
Article
In addition to meeting the real-time constraint, power/energy efficiency and high reliability are two vital objectives for real-time embedded systems. Recently, heterogeneous multicore systems have been considered an appropriate solution for achieving joint power/energy efficiency and high reliability. However, power/energy and reliability are two conflict requirements due to the inherent redundancy of fault-tolerance techniques. Also, because of the heterogeneity of the system, the execution of the tasks, especially real-time tasks, in the heterogeneous system is more complicated than the homogeneous system. The proposed method in this paper employs a passive primary/backup technique to preserve the reliability requirement of the system at a satisfactory level and reduces power/energy consumption in heterogeneous multicore systems by considering real-time and peak power constraints. The proposed method attempts to map the primary and backup tasks in a mixed manner to benefit from the execution of the tasks in different core types and schedules the backup tasks after finishing the primary tasks to remove the overlap between the execution of the primary and backup tasks. Compared to the existing state-of-the-art methods, experimental results demonstrate our proposed method's power efficiency and effectiveness in terms of schedulability.
... Therefore future states depend only on the current state and transition rates from the current state to other possible states. Transition rates are λ parameters of the exponential distribution, which is the commonly employed distribution used for modeling component failures within system analysis [3]. ...
... Development and usage of medical devices is undoubtedly no exception [9], [10]. Because of the non-deterministic dynamics of modern systems, we can witness currently an intensive research in extending concepts of dependability theory [11], [12], [3], in which CTMCs constitute one of the primary roots [13]. CTMCs are currently also one of the methods used for quantitative analysis of fault-tolerant systems prior to development. ...
... The three properties of dependability most commonly referred to in the literature are safety, availability and reliability [16]. This paper emphasizes the reliability [3]. ...
Conference Paper
Full-text available
The online pupillometry-based feedback system is intended as a cognitive training and rehabilitation system developed at Mälardalen University. The purpose of the system is to engage a person in a cognitive computer task whose difficulty is adjusted in real time depending on the person's cognitive load. Previous research has uncovered a significant correlation between cognitive load and pupil dilation, suggesting that electroencephalogram usage for estimating cognitive load can be eliminated. The online pupillometry-based feedback system is measuring the pupil-diameter in real time to classify cognitive load using a neural network. The classification of cognitive load is used to modulate the difficulty level of the cognitive task, with the purpose of challenging the participant and to optimize the cognitive training. At the current state the system is fully integrated, but possesses no fault-tolerant features to produce a long-term reliable service. This paper proposes a fault-tolerant architecture for the online pupillometry-based feedback system, for which internal repairs and failure rates are modeled using continuous-time Markov chains. The results show adequacy of the extended architecture, assuming slightly optimistic failure rates. Even though the system is specific, the reliability approach presented can be applied on other medical devices and systems.
... Originally, fault tolerance techniques were used to cope with the physical defects of individual hardware components. Designers of early computing systems employed redundant structures with voting to eliminate the effect of failed components, error-detection or correcting codes to detect or correct information errors, diagnostic techniques to locate failed components and automatic switch over to replace them [3]. ...
... Fault tolerance is the ability of a system to continue performing its needed function in spite of faults [3]. In a broad sense, fault tolerance is associated with reliability, with no fault operation, and with the absence of system failure. ...
... We need fault tolerance structures because it is practically impossible to manufacture a perfect system that works with no faults all the time. The fundamental problem is that as complexity of a system increases, its reliability drastically deteriorates, unless compensatory measures are taken [3]. For example, if the reliability of one nonredundant module is 99.99%, then the reliability of a system with a hundred copies of this module is 99.01%, whereas the reliability of a large system consisting of 10,000 copies of non-redundant module is just 36.79%. ...
Article
Full-text available
Electronic systems’ growth causes complexity and increases the risk of failure. Fault tolerance structures are one of the useful ideas for resolving this problem. In this paper, a fault tolerant approach to digital electronic modules is introduced, using hardware redundancy to make those modules fault tolerant. The proposed structure of hardware redundancy has a voter unit that can mask and detect faults at the same time. Such a voter unit is achieved by using both majority voter units and minority voter units in its structure. The proposed voter unit has a three‐layer structure. One minority voter, three majority voters, and another minority voter are respectively producing these three layers of the proposed voter unit. These layers can mask and correct fault and detect the faulty module by producing a unique fault detecting code simultaneously. The code works properly as long as the majority of the modules work properly. Then, the result of the voter unit (the code) is transmitted to a relevant switching unit in order to switch the faulty module with a spare module. A redundant system based on the proposed redundancy structure with M spare modules can switch M faulty modules and tolerate (M + 1) faults.
... Fault mitigation comes in the form of redundancy, either hardware-or softwareimplemented [3], by replicating components that mask faults and prevent them from propagating. From a safety standpoint, either technique can be employed, but from a cost point-of-view, software-implemented techniques are considered preferable, since high safety oriented hardware components are typically more expensive than software development costs [4]. ...
... Following the nomenclature provided by Dubrova et al. [3], a fault is physical defect, imperfection, or flaw that occurs in some hardware or software component. Resulting from a fault, an error is a deviation from the expected computational value. ...
Article
Full-text available
Simulation-based Fault Injection (FI) is crucial for validating system behaviour in safety-critical applications, such as the automotive industry. The ISO 26262 standard’s Part 11 extension provides failure modes for digital components, driving the development of new fault models to assess software-implemented mechanisms against random hardware failures (RHF). This paper proposes a Fault Injection framework, QEFIRA, and shows its ability to achieve the failure modes proposed by Part 11 of the ISO 26262 standard and estimate relevant metrics for safety mechanisms. QEFIRA uses QEMU to inject permanent and transient faults during runtime, whilst logging the system state and providing automatic post-execution analysis. Complemented with a confusion matrix, it allows us to gather standard compliant metrics to characterise and evaluate different designs in the early stages of development. Comparatively to the native QEMU implementation, the tool only shows a slowdown of 1.4× for real-time microcontroller-based applications.
... Definition 2: We define λ as the failure rate of the system. During the lifetime phase, it tends to remain constant [21]. ...
... Definition 7: MTTR, Mean Time to Repair is the average time required to isolate, repair, and test a fault condition for the system [21]. ...
Conference Paper
Full-text available
Processing capacity distribution has become widespread in the fog computing era. End-user services have multiplied, from consumer products to Industry 5.0. In this scenario, the services must have a very high-reliability level. But in a system with such displacement of hardware, the reliability of the service necessarily passes through the hardware design. Devices shall have a high quality, but they shall also efficiently support fault management. Hardware design must take into account all fault management functions and participate in creating a fault management policy to ensure that the ultimate goal of fault management is fulfilled, namely to increase a system's reliability. Efficiently and sustainably, both in the system's performance and the product's cost. This paper analyzes the hardware design techniques that efficiently contribute to the realization of fault management and, consequently, guarantee a high level of reliability and availability for the services offered to the end customer. We describe hardware requirements and how they affect the choice of devices in the hardware design of networking systems.
... Since the failure rate is constant during the useful life phase of the system (λ(t) = const.), it can be safely assumed that the random variable T (system lifetime) has an exponential distribution with parameter λ over this interval [47]. Therefore, the PDF of T would be f (t) = λe −λt and Equation 2.4 can be rewritten as ...
... Integrating by parts results in . It is a widely used dependability measure defined as the expected time of the occurrence of the first system failure [47]. For repairable systems, the measure Mean Time To Repair (MTTR) is commonly used to specify the average downtime, that is, the time period from the moment of failure to the restoration of the correct service. ...
Thesis
This dissertation proposes a cross-layer framework able to synergistically optimize resilience and power consumption of processor-based systems. It is composed of three building blocks: SWIELD multimodal flip-flop (FF), System Operation Management Unit (SOMU) and Framework Function Library (FFL). Implementation of the building blocks is performed at circuit, architecture and software layer of the system stack respectively. The SWIELD FF can be configured to operate as a regular flip-flop or as an enhanced flip-flop for protection against timing/radiation-induced faults. It is necessary to perform replacement of selected timing-critical flip-flops in a system with SWIELD FFs during design time. When the system is active, the SWIELD FFs operation mode is dynamically managed by the SOMU controller according to the current requirements. Finally, the FFL contains a set of software procedures that greatly simplify framework utilization. By relying on the framework, a system can intelligently interchange techniques such as Adaptive Voltage/Frequency Scaling, selective Triple Modular Redundancy and clock gating during operation. Additionally, a simple and convenient strategy for integration of the framework in processor-based systems is also presented. A key feature of the proposed strategy is to determine the number of SWIELD FFs to be inserted in a system. Using this strategy, the framework was successfully embedded in instances of both single- and multicore systems. Various experiments were conducted to evaluate the framework influence on the target systems with respect to resilience and power consumption. At expense of about 1% area overhead, the framework is able to preserve performance and to reduce power consumption up to 15%, depending on the number of SWIELD FFs in the system. Furthermore, it was also shown that under certain conditions, the framework can provide failure-free system operation.
... The enormous importance of this research field is expressed by Rouissi and Hoblos [40]; they underline that the ability of a system to accommodate faults has to be achieved employing a conscious FTD. A considerable number of FTD approaches especially in the field of electronics engineering are complied in [41]. ...
... This amount of over-actuation is also inevitable for allowing the compensation of possible faults, e.g., a slippery surface. For the general design of such systems, which may only function if a nearly perfectly working control system is present, a large amount of fault-tolerance is mandatory, also because some potential user mistakes cannot be predicted [41]. ...
Article
Full-text available
Researchers around the globe have contributed for many years to the research field of fault-tolerant control; the importance of this field is ever increasing as a consequence of the rising complexity of technical systems, the enlarging importance of electronics and software as well as the widening share of interconnected and cloud solutions. This field was supplemented in recent years by fault-tolerant design. Two main goals of fault-tolerant design can be distinguished. The first main goal is the improvement of the controllability and diagnosability of technical systems through intelligent design. The second goal is the enhancement of the fault-tolerance of technical systems by means of inherently fault-tolerant design characteristics. Inherently fault-tolerant design characteristics are, for instance, redundancy or over-actuation. This paper describes algorithms, methods and tools of fault-tolerant design and an application of the concept to an automated guided vehicle (AGV). This application took place on different levels ranging from conscious requirements management to redundant elements, which were consciously chosen, on the most concrete level of a technical system, i.e., the product geometry. The main scientific contribution of the paper is a methodical framework for fault-tolerant design, as well as certain algorithms and methods within this framework. The underlying motivation is to support engineers in design and control trough product development process transparency and appropriate algorithms and methods.
... Therefore, the SCRES is designed as hardware and software system with fault tolerance structure [5,6]. It consists of different subsystems and each of them performs its objective function. ...
... The safety of the SCRES is determined by its fault-tolerant structure and robust behavior, as well as their functional behavior [5]. Functional behavior is defined by an algorithm that specifies the conditions and sequence of actions of the subsystems and modules of the SCRES when performing its functions. ...
Conference Paper
Full-text available
This paper represents a problem of design a behavior algorithm. As behavior algorithm is developed for a radio-electronic system, so it should provide a high level of safety of its functioning. The radio electronic system as a part of complex technical system performs a task of organizing control actions. As an example of such complex technical system there can be mentioned nuclear power plant or missile launch system and a radio electronic system is its control system. The safety of radio electronic safety-critical systems is traditionally ensured by inducing structural redundancy. This paper shows an example of synthesis of safe behavior algorithms for the target detection radio electronic complex system on the basis of minimization of increased values of safety characteristic.
... Several CED approaches have been proposed in literature. One conventional approach is the use of Double Modular Redundancy (DMR) [2]. However it can be shown, latent faults can disturb DMR scheme [3]. ...
... Consider a scenario where there exists a permanent latent fault in one of the replicas; this condition disrupts the detection mechanism of DMR. To get away with permanent latent fault issue in one of the replicas, one must use N-Modular Redundancy (NMR) or other similar techniques [2], [4], [5] which have significant area overhead. To compensate for huge area overhead of concurrent NMR scheme, several encoding schemes have been proposed in the literature such as weight-based codes [6], Berger codes [7], Bose-Lin codes [8], etc. ...
Article
Full-text available
Reversible logic has 100% fault observability meaning that a fault in any circuit node propagates to the output stage. In other words, reversible circuits are latent-fault-free. Our motivation is to incorporate this unique feature of reversible logic to design CMOS circuits having perfect or 100% Concurrent Error Detection (CED) capability. For this purpose, we propose a new fault preservative reversible gate library called Even Target -Mixed Polarity Multiple Control Toffoli (ET-MPMCT). By using ET-MPMCT, we ensure that the evenness/oddness of applied 1’s at input, is preserved at all levels of a circuit including output level unless there is a faulty node. A single fault always destroys the parity of input at the output. Our design strategy has two steps for a given function: 1) implement the function with our proposed reversible ET-MPMCT gate library; and 2) apply reversible-to-CMOS gate conversion. For the first step, we propose two approaches. For our first approach in step 1, we first need to have a reversible form of the given function if it is irreversible. Then, synthesize the reversible function using Mixed Polarity Multiple Control Toffoli (MPMCT) gate library by conventional reversible logic synthesis techniques. Finally, the synthesized circuit is converted to ET-MPMCT constructed circuit which is fault preservative. Our second approach, is an ESOP -Exclusive Sum of Products -based synthesis approach modified for our proposed fault preservative ET-MPMCT gate library. It does work with both irreversible and reversible functions. We synthesize our circuits with both approaches in step 1 and choose the circuit with lower number of reversible gates to be fed to step 2 of our design strategy. In second step of our design strategy, we convert our fault preservative reversible circuits into their CMOS counterparts. The performance of our designs is compared with other CED schemes in the literature in terms of area, power consumption, delay and detection rate. Simulations are done with Cadence Genus tool using TSMC 40nm technology. Clearly, results are in favor of our proposed techniques.
... Design changes can start from the early design stage at the functional level to eliminate or reduce identified risks. By modifying elements with higher risk priority numbers and adding new fault tolerance mechanisms, the reliability of the component can be improved [33], and the key fault modes of elements with higher risk priority numbers can be eliminated. This can reduce the effect on the main functions of the system, and the modified fault tolerance mechanism needs to be updated in a timely manner in the SysML model. ...
Article
Full-text available
As embedded systems become increasingly complex, traditional reliability analysis methods based on text alone are no longer adequate for meeting the requirements of rapid and accurate quantitative analysis of system reliability. This article proposes a method for automatically generating and quantitatively analyzing dynamic fault trees based on an improved system model with consideration for temporal characteristics and redundancy. Firstly, an “anti-semantic” approach is employed to automatically explore the generation of fault modes and effects analysis (FMEA) from SysML models. The evaluation results are used to promptly modify the system design to meet requirements. Secondly, the Profile extension mechanism is used to expand the SysML block definition diagram, enabling it to describe fault semantics. This is combined with SysML activity diagrams to generate dynamic fault trees using traversal algorithms. Subsequently, parametric diagrams are employed to represent the operational rules of logic gates in the fault tree. The quantitative analysis of dynamic fault trees based on probabilistic models is conducted within the internal block diagram of SysML. Finally, through the design and simulation of the power battery management system, the failure probability of the top event was obtained to be 0.11981. This verifies that the design of the battery management system meets safety requirements and demonstrates the feasibility of the method.
... An established method of enhancing hardware fault tolerance in machines is through redundancy, where critical hardware components are duplicated to mitigate the risk of failure (Guiochet et al., 2017;Visinsky et al., 1994Visinsky et al., , 1991. However, this approach has significant drawbacks, including increased machine size, weight, power consumption, and financial costs (Dubrova, 2013). Moreover, retrofitting existing machines with redundant components is of-ten impossible. ...
Preprint
Full-text available
Industry is rapidly moving towards fully autonomous and interconnected systems that can detect and adapt to changing conditions, including machine hardware faults. Traditional methods for adding hardware fault tolerance to machines involve duplicating components and algorithmically reconfiguring a machine's processes when a fault occurs. However, the growing interest in reinforcement learning-based robotic control offers a new perspective on achieving hardware fault tolerance. However, limited research has explored the potential of these approaches for hardware fault tolerance in machines. This paper investigates the potential of two state-of-the-art reinforcement learning algorithms, Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC), to enhance hardware fault tolerance into machines. We assess the performance of these algorithms in two OpenAI Gym simulated environments, Ant-v2 and FetchReach-v1. Robot models in these environments are subjected to six simulated hardware faults. Additionally, we conduct an ablation study to determine the optimal method for transferring an agent's knowledge, acquired through learning in a normal (pre-fault) environment, to a (post-)fault environment in a continual learning setting. Our results demonstrate that reinforcement learning-based approaches can enhance hardware fault tolerance in simulated machines, with adaptation occurring within minutes. Specifically, PPO exhibits the fastest adaptation when retaining the knowledge within its models, while SAC performs best when discarding all acquired knowledge. Overall, this study highlights the potential of reinforcement learning-based approaches, such as PPO and SAC, for hardware fault tolerance in machines. These findings pave the way for the development of robust and adaptive machines capable of effectively operating in real-world scenarios.
... Therefore, the next generation of code smell datasets is supposed to consider factors related to the application domain and development context, mainly the system criticalness level [127], deployment and run-time environments, and stakeholders. For instance, when voting between the result of tools, the application domain can be used to weigh each tool according to the application [126]. ...
Preprint
Full-text available
The accuracy reported for code smell-detecting tools varies depending on the dataset used to evaluate the tools. Our survey of 45 existing datasets reveals that the adequacy of a dataset for detecting smells highly depends on relevant properties such as the size, severity level, project types, number of each type of smell, number of smells, and the ratio of smelly to non-smelly samples in the dataset. Most existing datasets support God Class, Long Method, and Feature Envy while six smells in Fowler and Beck's catalog are not supported by any datasets. We conclude that existing datasets suffer from imbalanced samples, lack of supporting severity level, and restriction to Java language.
... Therefore, the next generation of code smell datasets is supposed to consider factors related to the application domain and development context, mainly the system criticalness level [127], deployment and run-time environments, and stakeholders. For instance, when voting between the result of tools, the application domain can be used to weigh each tool according to the application [126]. ...
Article
Full-text available
The accuracy reported for code smell-detecting tools varies depending on the dataset used to evaluate the tools. Our survey of 45 existing datasets reveals that the adequacy of a dataset for detecting smells highly depends on relevant properties such as the size, severity level, project types, number of each type of smell, number of smells, and the ratio of smelly to non-smelly samples in the dataset. Most existing datasets support God Class, Long Method, and Feature Envy while six smells in Fowler and Beck's catalog are not supported by any datasets. We conclude that existing datasets suffer from imbalanced samples, lack of supporting severity level, and restriction to Java language.
... The results of the duplication will then be compared with the data that has been previously stored. If the data is different, an error occurs in the system [18]. ...
Article
Reading data from sensors in the context of the Internet of Things (IoT) is one of the main parameters of a high-reliability system. Data reading by sensors is prone to errors due to various variables that the problem cannot accommodate. One method to overcome this problem is the time redundancy algorithm, where the sensor takes data and its best value several times. In this study, the sensors used to calculate the water levels are ultrasonic sensors and resistance. At the river monitoring point, data is sent using the UDP (User Datagram Protocol) protocol that previous studies have used. This research has been done as a prototype with a maximum distance between sensor nodes of one meter. The results of sending data using UDP were successful, with an average delay of 3293 microseconds. And the ultrasonic sensor that has been tested based on transient fault against time redundancy has an accuracy of 94.5% based on a sampling delay of 1000ms.
... Furthermore, connectivity, mobility, and power source also play an important role. The ability of a system to continue working despite faults is referred to as fault tolerance [162]. While the concept of fault-tolerant computing refers to using systems that can perform correctly in the presence of errors [163]. ...
Article
Full-text available
An inflection point in the computing industry is occurring with the implementation of the Internet of Things and 5G communications, which has pushed centralized cloud computing toward edge computing resulting in a paradigm shift in computing. The purpose of edge computing is to provide computing, network control, and storage to the network edge to accommodate computationally intense and latency-critical applications at resource-limited endpoints. Edge computing allows edge devices to offload their overflowing computing tasks to edge servers. This procedure may completely exploit the edge server’s computational and storage capabilities and efficiently execute computing operations. However, transferring all the overflowing computing tasks to an edge server leads to long processing delays and surprisingly high energy consumption for numerous computing tasks. Aside from this, unused edge devices and powerful cloud centers may lead to resource waste. Thus, hiring a collaborative scheduling approach based on task properties, optimization targets, and system status with edge servers, cloud centers, and edge devices is critical for the successful operation of edge computing. This paper briefly summarizes the edge computing architecture for information and task processing. Meanwhile, the collaborative scheduling scenarios are examined. Resource scheduling techniques are then discussed and compared based on four collaboration modes. As part of our survey, we present a thorough overview of the various task offloading schemes proposed by researchers for edge computing. Additionally, according to the literature surveyed, we briefly looked at the fairness and load balancing indicators in scheduling. Finally, edge computing resource scheduling issues, challenges, and future directions have discussed.
... Fault masking is a fault-tolerant technique that is traditionally used to allow the correct functioning of a circuit in the presence of faults. The most popular fault masking scheme is the triple modular redundancy (TMR) [74]. In TMR, a critical module is triplicated and the outputs of the three modules are given to a majority voter. ...
Article
Full-text available
Hardware obfuscation is a well-known countermeasure against reverse engineering. For FPGA designs, obfuscation can be implemented with a small overhead by using underutilised logic cells; however, its effectiveness depends on the stealthiness of the added redundancy. In this paper, we show that it is possible to deobfuscate an SRAM FPGA design by ensuring the full controllability of each instantiated look-up table input via iterative bitstream modification. The presented algorithm works directly on bitstream and does not require the possession of a flattened netlist. The feasibility of our approach is verified on the example of an obfuscated SNOW 3G design implemented on a Xilinx 7-series FPGA.
... It is unavoidable that faults will occur during the operation of electrical or electronic (E/E) systems. A fault is defined here as "a physical defect, imperfection, or flaw that occurs in some hardware or software component" (Dubrova, 2013). Faults can cause system malfunction that, depending on the operational scenario, could produce a hazard where there is risk of harm to humans. ...
Research
Full-text available
It is plausible that biological organisms can provide insights into designing software for fail-operational automotive systems. Here the concept of the imbalance organism is described to bridge the gap from knowledge about biological organisms to the automotive software domain. The useful function for a target system is specified by the engineer by defining an identity. This identity marks the environment in which the system as organism must function during its product life, using the definition of equilibrium-action cycles to specify how that what is relevant to the identity leads to imbalance, followed by appropriate action to change the relation between the organism and the environment such as to restore balance within the organism. By sufficient specification of the identity, the result is a system functioning as imbalance organism that exhibits behaviour, without stopping until product death, that matches in its sense to that of self-maintenance of the identity. The end is simply a product that performs the intended function and thus purpose what it was designed for, yet, in a way that parallels the biological organism, providing a basis for fail-operational design beyond the current high-level approaches. A simple inverted-pendulum vehicle is used to elucidate the concepts of identity, imbalance organism, imbalance network and equilibrium-action cycle. Finally, the plausibility of imbalance network design is evaluated against Classic AUTOSAR.
... Established methodologies exist for modeling fault tolerance in localized systems (some system which is not part of a MAS) that consider the fault status of the local system as a state machine [48,56,57]. The specific notation and emphasis vary between researchers but the fundamental concepts are the same. ...
Article
Full-text available
Driven by the ever-growing diversity of software and hardware agents available on the market, Internet-of-Things (IoT) systems, functioning as heterogeneous multi-agent systems (MASs), are increasingly required to provide a level of reliability and fault tolerance. In this paper, we develop an approach to generalized quantifiable modeling of fault-tolerant and reliable MAS. We propose a novel software architectural model, the Intelligence Transfer Model (ITM), by which intelligence can be transferred between agents in a heterogeneous MAS. In the ITM, we propose a novel mechanism, the latent acceptable state, which enables it to achieve improved levels of fault tolerance and reliability in task-based redundancy systems, as used in the ITM, in comparison with existing agent-based redundancy approaches. We demonstrate these improvements through experimental testing of the ITM using an open-source candidate implementation of the model, developed in Python, and through an open-source simulator that tested the behavior of ITM-based MASs at scale. The results of these experiments demonstrated improvements in fault tolerance and reliability across all MAS configurations we tested. Fault tolerance was observed to improve by a factor of between 1.27 and 6.34 in comparison with the control group, depending on the ITM configuration tested. Similarly, reliability was observed to improve by a factor of between 1.00 and 4.73. Our proposed model has broad applicability to various IoT applications and generally in MASs that have fault tolerance or reliability requirements, such as in cloud computing and autonomous vehicles.
... Though integrating effective mitigation measures might reduce the damage from risks or hazards, failure of these mitigations may lead to a system fault. Fault refers to any physical defect or imperfection in some hardware or software component [63] or is defined as an abnormal condition that prompts the component failure. Error that arises due to a fault in accuracy is a discrepancy or deviation in the computation of measurements, perception, cognition, or decision-making. ...
Article
Full-text available
Smart mobility is an imperative facet of smart cities, and the transition of conventional automotive systems to connected and automated vehicles (CAVs) is envisioned as one of the emerging technologies on urban roads. The existing AV mobility environment is perhaps centered around road users and infrastructure, but it does not support future CAV implementation due to its proximity with distinct modules nested in the cyber layer. Therefore, this paper conceptualizes a more sustainable CAV-enabled mobility framework that accommodates all cyber-based entities. Further, the key to a thriving autonomous system relies on accurate decision making in real-time, but cyberattacks on these entities can disrupt decision-making capabilities, leading to complicated CAV accidents. Due to the incompetence of the existing accident investigation frameworks to comprehend and handle these accidents, this paper proposes a 5Ws and 1H-based investigation approach to deal with cyberattack-related accidents. Further, this paper develops STRIDE threat modeling to analyze potential threats endured by the cyber-physical system (CPS) of a CAV ecosystem. Also, a stochastic anomaly detection system is proposed to identify the anomalies, abnormal activities, and unusual operations of the automated driving system (ADS) functions during a crash analysis.
... HR consists of installing multiple sets of sensors on the aircraft that provide redundant measurements of the states. A voting logic is designed to monitor the outputs from all sensors, which isolates the sensor(s) with fault, and decides the correct state value using remaining sensor(s) [4,5,6]. However, one issue with the HR-based FDC scheme is the cost and weight penalty (due to redundant sensors installations). ...
Preprint
Compared with traditional model-based fault detection and classification (FDC) methods, deep neural networks (DNN) prove to be effective for the aerospace sensors FDC problems. However, time being consumed in training the DNN is excessive, and explainability analysis for the FDC neural network is still underwhelming. A concept known as imagefication-based intelligent FDC has been studied in recent years. This concept advocates to stack the sensors measurement data into an image format, the sensors FDC issue is then transformed to abnormal regions detection problem on the stacked image, which may well borrow the recent advances in the machine vision vision realm. Although promising results have been claimed in the imagefication-based intelligent FDC researches, due to the low size of the stacked image, small convolutional kernels and shallow DNN layers were used, which hinders the FDC performance. In this paper, we first propose a data augmentation method which inflates the stacked image to a larger size (correspondent to the VGG16 net developed in the machine vision realm). The FDC neural network is then trained via fine-tuning the VGG16 directly. To truncate and compress the FDC net size (hence its running time), we perform model pruning on the fine-tuned net. Class activation mapping (CAM) method is also adopted for explainability analysis of the FDC net to verify its internal operations. Via data augmentation, fine-tuning from VGG16, and model pruning, the FDC net developed in this paper claims an FDC accuracy 98.90% across 4 aircraft at 5 flight conditions (running time 26 ms). The CAM results also verify the FDC net w.r.t. its internal operations.
... When this step is performed correctly, the fault tolerance method will be more efficient. Due to the low durability of transient faults, it is assumed that they are mitigated during sequential executions of time redundancy-based approaches [29]. Further, time redundancy-based fault Fun approaches diversify the operations through simple modifications to enhance their detection capability when confronting permanent faults. ...
Article
Full-text available
This paper presents a fault-tolerant ALU (“FT-EALU”) based on time redundancy and reward/punishment-based learning approaches for real-time embedded systems that face limitations in hardware and power consumption budgets. In this method, operations are diversified to three versions in order to correct permanent faults along with the transient ones. The diversities of versions considered in FT-EALU are provided by lightweight modifications to differentiate them and clear the effect of permanent faults. Selecting lightweight modifications such as shift and swap would avoid high timing overhead in computation while providing significant differences which are necessary for fault detection. Next, the replicated versions are executed serially in time, and their corresponding results are voted based on the derived learned weights. The proposed weighted voting module generates the final output based on the results and their weights. In the proposed weighted voting module, a reward/punishment strategy is employed to provide the weight of each version of execution indicating its effectiveness in the final output. To this aim, in the method defined for each version of execution, a weight is defined according to its correction capability confronting several faulty scenarios. Thus, this weight defines the reliability of the temporal results as well as their effect on the final result. The final result is generated bit by bit based on the weight of each version of execution and its computed result. Based on the proposed learning scheme, positive or negative weights are assigned to execution versions. These weights are derived in bit level based on the capability of execution versions in mitigating permanent faults in several fault injection scenarios. Thus, our proposed method is low cost and more efficient compared to related research which are mainly based on information and hardware redundancy due to employing time redundancy and static learning approach in correcting permanent faults. Several experiments are performed to reveal the efficiency of our proposed approach based on which FT-EALU is capable of correcting about 84.93%84.93%84.93\% and 69.71%69.71%69.71\% of permanent injected faults on single and double bits of input data.
... where p j is the probability to recover a message bit j from a single trace and N is odd [Dub13,p. 64]. ...
Conference Paper
Full-text available
In this paper, we present a side-channel attack on a first-order masked implementation of IND-CCA secure Saber KEM. We show how to recover both the session key and the long-term secret key from 24 traces using a deep neural network created at the profiling stage. The proposed message recovery approach learns a higher-order model directly, without explicitly extracting random masks at each execution. This eliminates the need for a fully controllable profiling device which is required in previous attacks on masked implementations of LWE/LWR-based PKEs/KEMs. We also present a new secret key recovery approach based on maps from error-correcting codes that can compensate for some errors in the recovered message. In addition, we discovered a previously unknown leakage point in the primitive for masked logical shifting on arithmetic shares.
... Time-redundancy architectures execute the same operation for providing multiple copies of the hashes at different times. Finally, information redundancy adds check data to verify hash correctness before using it and, in some cases, even allowing the correction of some erroneous hash [12][13][14]. Each architecture implementing some kind of redundancy presents advantages and disadvantages concerning power consumption, amount of hardware resources, and performance (throughput and efficiency). ...
Article
Full-text available
In emergent technologies, data integrity is critical for message-passing communications, where security measures and validations must be considered to prevent the entrance of invalid data, detect errors in transmissions, and prevent data loss. The SHA-256 algorithm is used to tackle these requirements. Current hardware architecture works present issues regarding real-time balance among processing, efficiency and cost, because some of them introduce significant critical paths. Besides, the SHA-256 algorithm itself considers no verification mechanisms for internal calculations and failure prevention. Hardware implementations can be affected by diverse problems, ranging from physical phenomena to interference or faults inherent to data spectra. Previous works have mainly addressed this problem through three kinds of redundancy: information, hardware, or time. To the best of our knowledge, pipelining has not been previously used to perform different hash calculations with a redundancy topic. Therefore, in this work, we present a novel hybrid architecture, implemented on a 3-stage pipeline structure, which is traditionally used to improve performance by simultaneously processing several blocks; instead, we propose using a pipeline technique for implementing hardware and time redundancies, analyzing hardware resources and performance to balance the critical path. We have improved performance at a certain clock speed, defining a data flow transformation in several sequential phases. Our architecture reported a throughput of 441.72 Mbps and 2255 LUTs, and presented an efficiency of 195.8 Kbps/LUT.
... Outputs from all the sensors are continuously monitored by a voting logic, which detects (and isolates) the defective sensor. The correct measurement is then reported using the remaining other sensors [4,5,6]. ...
Preprint
In this paper, a novel data-driven approach named Augmented Imagefication for Fault detection (FD) of aircraft air data sensors (ADS) is proposed. Exemplifying the FD problem of aircraft air data sensors, an online FD scheme on edge device based on deep neural network (DNN) is developed. First, the aircraft inertial reference unit measurements is adopted as equivalent inputs, which is scalable to different aircraft/flight cases. Data associated with 6 different aircraft/flight conditions are collected to provide diversity (scalability) in the training/testing database. Then Augmented Imagefication is proposed for the DNN-based prediction of flying conditions. The raw data are reshaped as a grayscale image for convolutional operation, and the necessity of augmentation is analyzed and pointed out. Different kinds of augmented method, i.e. Flip, Repeat, Tile and their combinations are discussed, the result shows that the All Repeat operation in both axes of image matrix leads to the best performance of DNN. The interpretability of DNN is studied based on Grad-CAM, which provide a better understanding and further solidifies the robustness of DNN. Next the DNN model, VGG-16 with augmented imagefication data is optimized for mobile hardware deployment. After pruning of DNN, a lightweight model (98.79% smaller than original VGG-16) with high accuracy (slightly up by 0.27%) and fast speed (time delay is reduced by 87.54%) is obtained. And the hyperparameters optimization of DNN based on TPE is implemented and the best combination of hyperparameters is determined (learning rate 0.001, iterative epochs 600, and batch size 100 yields the highest accuracy at 0.987). Finally, a online FD deployment based on edge device, Jetson Nano, is developed and the real time monitoring of aircraft is achieved. We believe that this method is instructive for addressing the FD problems in other similar fields.
... Research in fault-tolerant cloud computing systems span a wide range of applications, from general-purpose computer systems to highly available computer, space, transportation, and military systems [33], [36], [37]. We list some applications and discussed them briefly in order to illustrate the difference between them. ...
Article
Full-text available
Fault-tolerance methods are required to ensure high availability and high reliability in cloud computing environments. In this survey, we address fault-tolerance in the scope of cloud computing. Recently, cloud computing-based environments have presented new challenges to support fault-tolerance and opened new paths to develop novel strategies, architectures, and standards. We provide a detailed background of cloud computing to establish a comprehensive understanding of the subject, from basic to advanced. We then highlight fault-tolerance components and system-level metrics and identify the needs and applications of fault-tolerance in cloud computing. Furthermore, we discuss state-of-the-art proactive and reactive approaches to cloud computing fault-tolerance. We further structure and discuss current research efforts on cloud computing fault-tolerance architectures and frameworks. Finally, we conclude by enumerating future research directions specific to cloud computing fault-tolerance development.
... Outputs from all the sensors are continuously monitored by a voting logic, which detects (and isolates) the erroneous sensor. The correct measurement is then reported using the remaining other sensors [4][5][6]. ...
Article
Full-text available
Fault detection (FD) is important for health monitoring and safe operation of dynamical systems. Previous studies use model-based approaches which are sensitive to system specifics, attenuating the robustness. Data-driven methods have claimed accurate performances which scale well to different cases, but the algorithmic structures and enclosed operations are “black,” jeopardizing its robustness. To address these issues, exemplifying the FD problem of aircraft air data sensors, we explore to develop a robust (accurate, scalable, explainable, and interpretable) FD scheme using a typical data-driven method, i.e., deep neural networks (DNN). To guarantee the scalability, aircraft inertial reference unit measurements are adopted as equivalent inputs to the DNN, and a database associated with 6 different aircraft/flight conditions is constructed. Convolutional neural networks (CNN) and long-short time memory (LSTM) blocks are used in the DNN scheme for accurate FD performances. To enhance robustness of the DNN, we also develop two new concepts: “large structure” which corresponds to the parameters that can be objectively optimized (e.g., CNN kernel size) via certain metrics (e.g., accuracy) and “small structure” that conveys subjective understanding of humans (e.g., class activation mapping in CNN) within a certain context (e.g., object detection). We illustrate the optimization process we adopted in devising the DNN large structure, which yields accurate (90%) and scalable (24 diverse cases) performances. We also interpret the DNN small structure via class activation mapping, which yields promising results and solidifies the robustness of DNN. Lessons and experiences we learned are also summarized in the paper, which we believe is instructive for addressing the FD problems in other similar fields.
... Redundancy can also be classified as homogeneous or heterogenous, depending on the type of redundant modules used. In homogenous redundancy, the same technology is replicated to perform the same function, mitigating only random failures [7]. On the other hand, the heterogeneous approach uses different technologies to perform the same function, allowing the system to recover from systematic failures due to a given technology's inherent limitations ...
Conference Paper
Full-text available
Highly reliable systems achieve a low failureprobability during their operational lifetime with the help ofredundancy. This technique ensures functionality by replicatingcomponents or modules, on both software and hardware. Theaddition of redundancy and further architectural decisions thatarise from its usage results in increased system complexity. Theresultant complexity hinders analytical approaches to evaluatecompeting architectural designs, as the time and effort spent withthis type of evaluation may significantly delay development. A wayto avoid time spent on this type of analysis is to submit the designedarchitecture to simulation, both for validation and evaluation. Inthis paper, we propose the usage of a simulation tool, specificallyQEMU, to assist reliable system development and simulation.Based on this tool, extensions were developed, aiming for asimulation environment that covers the redundancy use case,allowing to validate the complex interactions under redundantarchitectures, and supports reliability estimations to comparearchitecturally redundant designs.
... According to [Dub13], redundancy enables the following approaches for dealing with faults • Fault masking ensures selecting a fault-free service or result out of multiple redundant services or results that might be affected by faults. ...
Thesis
The advent of new technology trends like the Internet of things and autonomous driving have made embedded systems more pervasive in the everyday life and has brought them to applications with strict non-functional requirements such as performance and power consumption. However, the continuous shrinkage in semiconductor devices has made the electronic components used in these systems increasingly susceptible to failure and degradation mechanisms like negative-bias temperature instability and gate oxide breakdown. This renders reliability a major concern in the design of embedded systems, and necessitates the automatic analysis and optimization of system reliability alongside other design objectives. To this end, this dissertation is based upon a system-level design methodology to overcome challenges and leverage opportunities in the design of reliable, yet efficient system implementations. Due to variations in manufacturing, environment, and usage conditions, the reliability of modern electronic components is associated with various forms of uncertainty. The uncertainty in component reliability can propagate through the system, causing system reliability to be uncertain as well. Moreover, destructive effects such as high temperature have simultaneous impacts on several adjacent components, giving rise to correlation in their uncertainties. To consider uncertain characteristics and their correlations in system reliability analysis, this dissertation proposes to incorporate existing techniques, such as binary decision diagrams, and a Monte Carlo simulation into an uncertainty-aware reliability analysis. It models the reliability of each component using a reliability function with parameters characterized by probability distributions, and derives the probability distribution of system reliability. This necessitates the design space exploration to allow for the comparison of candidate system implementations with design objectives represented by probability distributions instead of single values. To this end, this dissertation introduces novel statistical and probabilistic comparison operators that enable to differentiate the quality of system implementations in the presence of uncertainty at low execution time overhead. The fabrication of many identical components into a single integrated circuit to achieve massive parallelism or redundancy can bring about vulnerabilities to common-cause failures which are multiple simultaneous component failures resulting from a shared root cause and a coupling mechanism. Reducing the effects of redundancy, common-cause failures have a significant contribution to system unreliability. The manifestation of common-cause failures upon the occurrence of the root cause is often probabilistic, i.e., once the root cause occurs, the affected components fail with different probabilities. The consideration of probabilistic common-cause failures renders system reliability analysis a more demanding task. This dissertation proposes various approaches to incorporate probabilistic common-cause failures into an automatic system-level reliability analysis efficiently. These approaches are based on (i) the decomposition of system reliability model into mutually exclusive success paths, and (ii) the explicit and implicit considerations of probabilistic common-cause failures in the analysis of individual paths. System-level reliability optimization is usually carried out through redundancy allocation and component hardening. Excessive use of these means may deteriorate other design objectives like performance and monetary costs. To optimize reliability in a multi-objective design space exploration, one has to rely on (i) optimization techniques such as evolutionary algorithms that typically explore new implementations by randomly changing the best found implementations, and (ii) reliability analysis techniques that evaluate the reliability of each implementation as a whole, without identifying reliability bottlenecks. This dissertation applies the concept of component importance in a low-overhead heuristic to estimate the contribution of components to the system (un)reliability and to rank them accordingly. This information is used in a local search to efficiently improve the reliability of system implementations found in each generation of an evolutionary algorithm through selective component hardening.
Conference Paper
Full-text available
Fault tree analysis is a system malfunction hazard evaluation quantitative and qualitative procedure. The method is well-known and widely used, especially in the safety systems domain, where it is a mandatory integral part of the so-called "Hazard Evaluation" documentation. This paper proposes an alternative or complementary deductive fault analysis method: it uses system topology to build a hypergraph representation of the system to identify component criticality and support loss of functionality probability evaluation. Once automated, the proposed method seems promising when the system engineers explore the different architectures. They may have indication about architecture's reliability without continuous feedback from the system safety team. The system safety team must check the solution once the engineers select the final architecture. They can also use the proposed method to validate the correctness of the fault tree analysis.
Thesis
The aim of this work is to compare various state-of-the-art techniques aimed at solving the problem of Fault Isolation of a fault on an aircraft. Starting from this main aim, the work has also been extended to compare techniques that correctly estimate the extent of the fault that has occurred. The comparison on these issues is carried out using specific metrics defined within the aforementioned thesis work. The ultimate aim of this work is to improve the techniques already known by using a Bayesian filter and showing the performance increase that this tool allows to achieve.
Article
Field Programmable Gate Arrays (FPGAs) are often used in space, military, and commercial applications due to their re-programmable feature. FPGAs are semiconductor components susceptible to soft errors due to radiation effects. Fault tolerance is a critical feature for improving the reliability of electronic and computational components in high-safety applications. Triple Modular Redundancy (TMR) is electronic systems’ most commonly used fault-tolerant technique. TMR is reliable and efficient to recover the single-event upsets. However, the limitation of this technique is the area overhead. Prior work has proposed many conventional fault-tolerant approaches that have been unable to avoid area overhead. This paper introduces a novel work related to an error analysis-based technique. This technique works with an error percentage, and a preferential algorithm, which is also proposed to reduce the hardware complexity in the existing works. This technique can be applied on various types of arithmetic circuits. The proposed technique is applied to the adder circuit to verify the hardware usage, power consumption, and delay; it has been implemented on the Proasic3e 3000 FPGA. The simulated results were observed as 39.89% fewer IO cells, 47.10% fewer core cells, and 5.32% less power as compared to the TMR-based adder.
Chapter
The undeniable need of energy efficiency in today’s devices is leading to the adoption of innovative computing paradigms—such as Approximate Computing. As this paradigm is gaining increasing interest, important challenges, as well as opportunities, arise concerning the dependability of those systems. This chapter will focus on test and reliability issues related to approximate hardware systems. It will cover problems and solutions concerning the impact of the approximation on hardware defect classification, test generation, and test application. Moreover, the impact of the approximation on the fault tolerance will be discussed, along with related design solutions to mitigate it.KeywordsApproximate computingTestReliabilityApproximate circuitsFault classificationTest pattern generationApproximation-aware test application methodologyFault injectionError analysisError detection and correctionFault maskingTriple modular redundancyDuplication with comparisonApproximate fault tolerance
Book
This textbook intends to be a comprehensive and substantially self-contained two-volume book covering performance, reliability, and availability evaluation subjects. The volumes focus on computing systems, although the methods may also be applied to other systems. The first volume covers Chapter 1 to Chapter 14, whose subtitle is ``Performance Modeling and Background". The second volume encompasses Chapter 15 to Chapter 25 and has the subtitle ``Reliability and Availability Modeling, Measuring and Workload, and Lifetime Data Analysis". This text is helpful for computer performance professionals for supporting planning, design, configuring, and tuning the performance, reliability, and availability of computing systems. Such professionals may use these volumes to get acquainted with specific subjects by looking at the particular chapters. Many examples in the textbook on computing systems will help them understand the concepts covered in each chapter. The text may also be helpful for the instructor who teaches performance, reliability, and availability evaluation subjects. Many possible threads could be configured according to the interest of the audience and the duration of the course. Chapter 1 presents a good number of possible courses programs that could be organized using this text. Volume II is composed of the last two parts. Part III examines reliability and availability modeling by covering a set of fundamental notions, definitions, redundancy procedures, and modeling methods such as Reliability Block Diagrams (RBD) and Fault Trees (FT) with the respective evaluation methods, adopts Markov chains, Stochastic Petri nets and even hierarchical and heterogeneous modeling to represent more complex systems. Part IV discusses performance measurements and reliability data analysis. It first depicts some basic measuring mechanisms applied in computer systems, then discusses workload generation. After, we examine failure monitoring and fault injection, and finally, we discuss a set of techniques for reliability and maintainability data analysis.
Book
This textbook intends to be a comprehensive and substantially self-contained two-volume book covering performance, reliability, and availability evaluation subjects. The volumes focus on computing systems, although the methods may also be applied to other systems. The first volume covers Chapter 1 to Chapter 14, whose subtitle is ``Performance Modeling and Background". The second volume encompasses Chapter 15 to Chapter 25 and has the subtitle ``Reliability and Availability Modeling, Measuring and Workload, and Lifetime Data Analysis". This text is helpful for computer performance professionals for supporting planning, design, configuring, and tuning the performance, reliability, and availability of computing systems. Such professionals may use these volumes to get acquainted with specific subjects by looking at the particular chapters. Many examples in the textbook on computing systems will help them understand the concepts covered in each chapter. The text may also be helpful for the instructor who teaches performance, reliability, and availability evaluation subjects. Many possible threads could be configured according to the interest of the audience and the duration of the course. Chapter 1 presents a good number of possible courses programs that could be organized using this text. Volume I is composed of the first two parts, besides Chapter 1. Part I gives the knowledge required for the subsequent parts of the text. This part includes six chapters. It covers an introduction to probability, descriptive statistics and exploratory data analysis, random variables, moments, covariance, some helpful discrete and continuous random variables, Taylor series, inference methods, distribution fitting, regression, interpolation, data scaling, distance measures, and some clustering methods. Part II presents methods for performance evaluation modeling, such as operational analysis, Discrete-Time Markov Chains (DTMC), and Continuous Time Markov Chains (CTMC), Markovian queues, Stochastic Petri nets (SPN), and discrete event simulation.
Article
A concise tutorial on fault-tolerant computing concepts is presented. It illustrates these concepts with descriptions of several state-of-the-art, fault-tolerant computing systems. The discussion is limited to microprocessor- and microcomputer-based systems, particularly those which utilize commercially available processors. The concepts and definitions presented here are based on those of Siewiorek and Swarz, Nelson and Carroll, and Avizienis. Causes of faults and their effects are introduced, and three major techniques to maintain a system's normal performance or to attempt to improve it are discussed. They are fault avoidance, fault masking and fault tolerance. Refs.
Article
A new method of concurrent error detection in the Arithmetic and Logic Units (ALU's) is proposed. This method, called "Recomputing with Shifted Operands" (RESO), can detect errors in both the arithmetic and logic operations. RESO uses the principle of time redundancy in detecting the errors and achieves its error detection capability through the use of the already existing replicated hardware in the form of identical bit slices. It is shown that for most practical ALU implementations, including the carry-lookahead adders, the RESO technique will detect all errors caused by faults in a bit-slice or a specific subcircuit of the bit slice. The fault model used is more general than the commonly assumed stuck-at fault model. Our fault model assumes that the faults are confined to a small area of the circuit and that the precise nature of the faults is not known. This model is very appropriate for the VLSI circuits.
Article
This paper details the fault detection capability of a design technique named "alternating logic design." The technique achieves its fault detection capability by utilizing a redundancy in time instead of the conventional redundancy in space and is based on the successive execution of a required function and its dual. In combinational networks the method involves the utilization of a self-dual fumction to represent the required function and the realization of the self dual function in a network with structral properties which are sufficient to guarantee the detection of all single faults. One network structure with sufficient structral properties to detect all single stuck-line faults is the standard AND/OR or OR/AND two-level network [1]. However, other more general combinational logic structures also possess sufficient structural properties. Necessary and sufficient structural properties for any alternating network to be capable of detecting all single faults are derived.
Article
The adder is intended to be used as a building block in the design of more complex circuits and systems using very large scale integration (VLSI). An efficient approach to error detection has been selected through extensive comparisons of several methods that use hardware, time, and hybrid redundancy. Simulation and analysis results are presented to illustrate the adder's timing characteristics, hardware requirements, and error-detection capabilities. One novel feature of the analysis is the introduction of error latency as a means of comparing the error-detection capabilities of several alternative approaches
Switching and Authomata Theory
  • Z Kohavi