Conference PaperPDF Available

Simple Testing Can Prevent Most Critical Failures --- An Analysis of Production Failures in Distributed Data-intensive Systems

Authors:

Abstract and Figures

Large, production quality distributed systems still fail pe-riodically, and do so sometimes catastrophically, where most or all users experience an outage or data loss. We present the result of a comprehensive study investigat-ing 198 randomly selected, user-reported failures that oc-curred on Cassandra, HBase, Hadoop Distributed File System (HDFS), Hadoop MapReduce, and Redis, with the goal of understanding how one or multiple faults eventually evolve into a user-visible failures. We found that from a testing point of view, almost all failures re-quire only 3 or fewer nodes to reproduce, which is good news considering that these services typically run on a very large number of nodes. However, multiple inputs are needed to trigger the failures with the order between them being important. Finally, we found the error logs of these systems typically contain sufficient data on both the errors and the input events that triggered the failure, enabling the diagnose and the reproduction of the pro-duction failures – often with unit tests. We found the majority of catastrophic failures could easily have been prevented by performing simple testing on error handling code – the last line of defense – even without an understanding of the software design. We ex-tracted three simple rules from the bugs that have lead to some of the catastrophic failures, and developed a static checker, Aspirator, capable of locating these bugs. Over 30% of the catastrophic failures would have been pre-vented had Aspirator been used and the identified bugs fixed. Running Aspirator on the code of 9 distributed sys-tems located 143 bugs and bad practices that have been fixed or confirmed by the developers.
Content may be subject to copyright.
A preview of the PDF is not available
... Description. Developers usually rely on logs for error diagnostics when exceptions occur [47]. However, we find that Table 2, in the ParamProcessWorker class in CloudStack, the try block contains two catch blocks; however, the log messages in these two catch blocks are identical. ...
... In all of the 11 cases, we find that one log would record the stack trace for Exception, and the duplicate log would only record the type of the occurred exception (e.g., by calling e.getMessage()) for a more specific exception. The rationale may be that generic exceptions, once occurred, are often not expected by developers [47], so it is important that developers record more error-diagnostic information. ...
... Similar to what we observed in IE, we find that for 9/90 instances, the log level for a more generic exception is usually more severe (e.g., error level for the generic Java Exception and info level for an application-specific exception). Generic exceptions might be unexpected to developers [47], so developers may use a higher log level (e.g., error) to record exception messages. ...
Preprint
In this paper, we focus on studying duplicate logging statements, which are logging statements that have the same static text message. We manually studied over 4K duplicate logging statements and their surrounding code in five large-scale open source systems. We uncovered five patterns of duplicate logging code smells. For each instance of the duplicate logging code smell, we further manually identify the potentially problematic and justifiable cases. Then, we contact developers to verify our manual study result. We integrated our manual study result and the feedback of developers into our automated static analysis tool, DLFinder, which automatically detects problematic duplicate logging code smells. We evaluated DLFinder on the five manually studied systems and three additional systems. In total, combining the results of DLFinder and our manual analysis, we reported 91 problematic duplicate logging code smell instances to developers and all of them have been fixed. We further study the relationship between duplicate logging statements, including the problematic instances of duplicate logging code smells, and code clones. We find that 83% of the duplicate logging code smell instances reside in cloned code, but 17% of them reside in micro-clones that are difficult to detect using automated clone detection tools. We also find that more than half of the duplicate logging statements reside in cloned code snippets, and a large portion of them reside in very short code blocks which may not be effectively detected by existing code clone detection tools. Our study shows that, in addition to general source code that implements the business logic, code clones may also result in bad logging practices that could increase maintenance difficulties.
... In this paper, we study the usefulness of logs in bug reports and the challenges that developers may encounter when analyzing such logs. We conduct our study on 10 open-source systems (i.e., ActiveMQ, AspectJ, Hadoop Common, HDFS, MapReduce, YARN, Hive, PDE, Storm, and Zookeeper), which are commonly used in prior log-related studies (Chen and Jiang 2017b;Yuan et al. 2014;Li et al. 2019Li et al. , 2020a. In particular, we seek to answer the following research questions: -RQ1) Are bug reports with logs resolved faster than bug reports without logs? ...
... The size of the studied systems ranges from 144K to 1.7M lines of code. These studied systems are widely used in prior log-related studies and have high-quality logs (Chen and Jiang 2017b;Li et al. 2019;Yuan et al. 2014). The studied systems also cover different domains, varying from virtual machine deployment systems to data warehousing solutions. ...
... To collect the bug reports, we built a web crawler that sends REST API calls to the Jira repositories. We select the bug reports based on the criteria that are used in prior bug report studies (Chen et al. 2014(Chen et al. , 2017bYuan et al. 2014). Namely, we select bug reports of the type "Bug", whose status are "Closed" or "Resolved", with the resolution "Fixed" and priority marked as "Major" or above. ...
Article
Full-text available
Logs in bug reports provide important debugging information for developers. During the debugging process, developers need to study the bug report and examine user-provided logs to understand the system executions that lead to the problem. Intuitively, user-provided logs illustrate the problems that users encounter and may help developers with the debugging process. However, some logs may be incomplete or inaccurate, which can cause difficulty for developers to diagnose the bug, and thus, delay the bug fixing process. In this paper, we conduct an empirical study on the challenges that developers may encounter when analyzing the user-provided logs and their benefits. In particular, we study both log snippets and exception stack traces in bug reports. We conduct our study on 10 large-scale open-source systems with a total of 1,561 bug reports with logs (BRWL) and 7,287 bug reports without logs (BRNL). Our findings show that: 1) BRWL takes longer time (median ranges from 3 to 91 days) to resolve compared to BRNL (median ranges from 1 to 25 days). We also find that reporters may not attach accurate or sufficient logs (i.e., developers often ask for additional logs in the Comments section of a bug report), which extends the bug resolution time. 2) Logs often provide a good indication of where a bug is located. Most bug reports (73%) have overlaps between the classes that generate the logs and their corresponding fixed classes. However, there is still a large number of bug reports where there is no overlap between the logged and fixed classes. 3) Our manual study finds that there is often missing system execution information in the logs. Many logs only show the point of failure (e.g., exception) and do not provide a direct hint on the actual root cause. In fact, through call graph analysis, we find that 28% of the studied bug reports have the fixed classes reachable from the logged classes, while they are not visible in the logs attached in bug reports. In addition, some logging statements are removed in the source code as the system evolves, which may cause further challenges in analyzing the logs. In short, our findings highlight possible future research directions to better help practitioners attach or analyze logs in bug reports.
... Several studies have identified and characterised EH code anomalies, referring to them using their own nomenclature such as bug patterns [21], inappropriate coding patterns [22], bad smells [23,24], and anti-patterns [14]. Barbosa et al. [19] identify software failures caused by such kind of code anomalies. ...
... Yuan et al. [21] found in their study that approximately 92% of catastrophic failures comes from non-fatal errors signalled by the software itself. In addition, 35% of such failures were caused by empty handlers, wrong actions in generic handlers, and handlers with comments suggesting the need for further implementation and correction (e.g., "FIXME" and "TODO"). ...
... Inadequate documentation of EH is already a known issue [12]. It impacts the developers' understanding of the consequences of not offering adequate EH [21]. Also, it has implications in their comprehension of the type of exceptions thrown by a method [14]. ...
Article
Full-text available
Abstract Exception handling is a well-known technique used to improve software robustness. However, recent studies report that developers typically neglect exception handling (mostly novice ones). We believe the quality of exception handling code in a software project is directly affected (i) by the absence, or lack of awareness, of an explicit exception handling policy and guidelines and (ii) by a silent rising of exception handling anti-patterns. In this paper, we investigate this phenomenon in a case study of a long-lived large-scale Java Web system in a Public Education Institution, trying to better understand the relationship between (i) and (ii), and the impact of developers’ turnover, skills, and guidance in (ii). Our case study takes into account the technical and human aspects. As a first step, we surveyed 21 developers regarding their perception of exception handling in the system’s institution. Next, we analysed the evolution of exception handling anti-patterns across 15 releases of the target system. We conducted a semi-structured interview with three senior software engineers, representatives of the development team, to present partial results of the case and raise possible causes for the found problems. The interviewed professionals and a second analysis of the code identified the high team turnover as the source of this phenomenon, since the public procurement process for hiring new developers has mostly attracted novice ones. These findings suggest that the absence of an explicit exception handling policy impacts negatively in the developers’ perception and implementation of exception handling. Furthermore, the absence of such policy has been leading developers to replicate existing anti-patterns and spread them through new features added during system evolution. We also observed that most developers have low skills regarding exception handling in general and low knowledge regarding the design and implementation of exception handling in the system. The system maintainer now has a diagnosis of the major causes of the quality problems in the exception handling code and was able to lead the required measures to repair this technical debt.
... Some of the research in this thread overlaps with logging practices and logging code progression, as some of the logging issues are uncovered during the examination of logging practices and their evolution. Yaun et al. [110] presented a characteristic study on real-world failures in distributed systems, and observed that the majority of failures print explicit failure-related log messages which can be used to replay (i.e., recreate) the failures. However, recorded log messages are noisy, which makes the analysis of logs tedious. ...
... Simple testing can prevent most critical failures: An analysis of production failures in distributed data-intensive systems [110]. ...
Preprint
Full-text available
Logs are widely used to record runtime information of software systems, such as the timestamp and the importance of an event, the unique ID of the source of the log, and a part of the state of a task's execution. The rich information of logs enables system developers (and operators) to monitor the runtime behaviors of their systems and further track down system problems and perform analysis on log data in production settings. However, the prior research on utilizing logs is scattered and that limits the ability of new researchers in this field to quickly get to the speed and hampers currently active researchers to advance this field further. Therefore, this paper surveys and provides a systematic literature review of the contemporary logging practices and log statements' mining and monitoring techniques and their applications such as in system failure detection and diagnosis. We study a large number of conference and journal papers that appeared on top-level peer-reviewed venues. Additionally, we draw high-level trends of ongoing research and categorize publications into subdivisions. In the end, and based on our holistic observations during this survey, we provide a set of challenges and opportunities that will lead the researchers in academia and industry in moving the field forward.
... However, an observation known as the 'small scope hypothesis' [12] states that analyzing small system instances suffices in practice since a high proportion of bugs can be found by verifying a system for all inputs within some (usually small) scope. A plethora of empirical studies [1,20,25] support this hypothesis. For example, Yuan et al. [25] analyzed production failures in distributed data intensive systems and showed that simple testing can prevent most critical failures. ...
... A plethora of empirical studies [1,20,25] support this hypothesis. For example, Yuan et al. [25] analyzed production failures in distributed data intensive systems and showed that simple testing can prevent most critical failures. In particular, the aforementioned study showed that out of the 198 bug reports that were analyzed for several distributed systems, 98% of those bugs could be triggered in a verification setting of three or fewer processes. ...
Article
Full-text available
Novel computing paradigms, e.g., the Cloud, introduce new requirements with regard to access control such as utilization of historical information and continuity of decision. However, these concepts may introduce an additional level of complexity to the underpinning model, rendering its definition and verification a cumbersome and prone to errors process. Using a formal language to specify a model and formally verify it may lead to a rigorous definition of the interactions amongst its components, and the provision of formal guarantees for its correctness. In this paper, we consider a case study where we specify a formal model in TLA\(^+\) for both a policy-neutral and policy-specific UseCON usage control model. Through that, we anticipate to shed light in the analysis and verification of usage control models and policies by sharing our experience when using TLA\(^+\) specific tools.
... They provide understanding for different types of failures and their causes and impacts on cloud services. Yuan et al. [49] studies user-reported failures in five popular distributed data-analytic and storage systems. The goal of their study is to identify failure event sequences to improve the availability and resilience of the data-analytic and storage systems. ...
Preprint
Full-text available
Cloud computing is the backbone of the digital society. Digital banking, media, communication, gaming, and many others depend on cloud services. Unfortunately, cloud services may fail, leading to damaged services, unhappy users, and perhaps millions of dollars lost for companies. Understanding a cloud service failure requires a detailed report on why and how the service failed. Previous work studies how cloud services fail using logs published by cloud operators. However, information is lacking on how users perceive and experience cloud failures. Therefore, we collect and characterize the data for user-reported cloud failures from Down Detector for three cloud service providers over three years. We count and analyze time patterns in the user reports, and derive failures from those user reports and characterize their duration and interarrival time. We characterize provider-reported cloud failures and compare the results with the characterization of user-reported failures. The comparison reveals the information of how users perceive failures and how much of the failures are reported by cloud service providers. Overall, this work provides a characterization of user- and provider-reported cloud failures and compares them with each other.
... Owing to the complexity and inevitable weaknesses in the software and hardware, the systems are prone to failures. Several studies showed that such failures lead to a decreased reliability, high financial costs, and can impact critical applications [8,24,26]. Therefore, loss of control is not allowed for any system or infrastructure, as the quality of service (QoS) is of high importance. ...
Preprint
Full-text available
Artificial Intelligence for IT Operations (AIOps) is an emerging interdisciplinary field arising in the intersection between the research areas of machine learning, big data, streaming analytics, and the management of IT operations. AIOps, as a field, is a candidate to produce the future standard for IT operation management. To that end, AIOps has several challenges. First, it needs to combine separate research branches from other research fields like software reliability engineering. Second, novel modelling techniques are needed to understand the dynamics of different systems. Furthermore, it requires to lay out the basis for assessing: time horizons and uncertainty for imminent SLA violations, the early detection of emerging problems, autonomous remediation, decision making, support of various optimization objectives. Moreover, a good understanding and interpretability of these aiding models are important for building trust between the employed tools and the domain experts. Finally, all this will result in faster adoption of AIOps, further increase the interest in this research field and contribute to bridging the gap towards fully-autonomous operating IT systems. The main aim of the AIOPS workshop is to bring together researchers from both academia and industry to present their experiences, results, and work in progress in this field. The workshop aims to strengthen the community and unite it towards the goal of joining the efforts for solving the main challenges the field is currently facing. A consensus and adoption of the principles of openness and reproducibility will boost the research in this emerging area significantly.
... We believe that a modern software fault injection tool has to be able to modify the fault model for the following reasons. First, a typical necessity in industry, which arises when a critical failure occurs, is to introduce regression tests against the fault that caused the failure, to assure that the same failure cannot occur again [15]. Second, to preserve the efficiency of the fault injection campaign, it is important to avoid injecting bugs that are unlikely to affect a system; e.g., some classes of faults may be prevented by testing and static analysis policies adopted by the company [16]. ...
... We believe that a modern software fault injection tool has to be able to modify the fault model for the following reasons. First, a typical necessity in industry, which arises when a critical failure occurs, is to introduce regression tests against the fault that caused the failure, to assure that the same failure cannot occur again [15]. Second, to preserve the efficiency of the fault injection campaign, it is important to avoid injecting bugs that are unlikely to affect a system; e.g., some classes of faults may be prevented by testing and static analysis policies adopted by the company [16]. ...
Preprint
Full-text available
In this paper, we present a new fault injection tool (ProFIPy) for Python software. The tool is designed to be programmable, in order to enable users to specify their software fault model, using a domain-specific language (DSL) for fault injection. Moreover, to achieve better usability, ProFIPy is provided as software-as-a-service and supports the user through the configuration of the faultload and workload, failure data analysis, and full automation of the experiments using container-based virtualization and parallelization.
... As many bug reports show, these systems often contain software bugs related to handling nondeterminism. Previous studies reported such bugs in MySQL [45,11], Post-greSQL [44], NoSQL systems [39,69], and database-backed applications [8], and showed that the bugs can cause crashes, unresponsiveness, and data corruptions. It is, therefore, crucial to identify and fix these bugs as early as possible. ...
Preprint
Runtime nondeterminism is a fact of life in modern database applications. Previous research has shown that nondeterminism can cause applications to intermittently crash, become unresponsive, or experience data corruption. We propose Adaptive Interventional Debugging (AID) for debugging such intermittent failures. AID combines existing statistical debugging, causal analysis, fault injection, and group testing techniques in a novel way to (1) pinpoint the root-cause of an application's intermittent failure and (2) generate an explanation of how the root-cause triggers the failure. AID works by first identifying a set of runtime behaviors (called predicates) that are strongly correlated to the failure. It then utilizes temporal properties of the predicates to (over)-approximate their causal relationships. Finally, it uses a sequence of runtime interventions of the predicates to discover their true causal relationships. This enables AID to identify the true root-cause and its causal relationship to the failure. We theoretically analyze how fast AID can converge to the identification. We evaluate AID with three real-world applications that intermittently fail under specific inputs. In each case, AID was able to identify the root-cause and explain how the root-cause triggered the failure, much faster than group testing and more precisely than statistical debugging. We also evaluate AID with many synthetically generated applications with known root-causes and confirm that the benefits also hold for them.
Article
Full-text available
Multithreaded programs are hard to get right. A key reason is that the contract between developers and runtimes grants exponentially many schedules to the runtimes. We present Parrot, a simple, practical runtime with a new contract to developers. By default, it orders thread synchronizations in the well-defined round-robin order, vastly reducing schedules to provide determinism (more precisely, deterministic synchronizations) and stability (i.e., robustness against input or code perturbations, a more useful property than determinism). When default schedules are slow, it allows developers to write intuitive performance hints in their code to switch or add schedules for speed. We believe this "meet in the middle" contract eases writing correct, efficient programs. We further present an ecosystem formed by integrating Parrot with a model checker called dbug. This ecosystem is more effective than either system alone: dbug checks the schedules that matter to Parrot, and Parrot greatly increases the coverage of dbug. Results on a diverse set of 108 programs, roughly 10× more than any prior evaluation, show that Parrot is easy to use (averaging 1.2 lines of hints per program); achieves low overhead (6.9% for 55 real-world programs and 12.7% for all 108 programs), 10× better than two prior systems; scales well to the maximum allowed cores on a 24-core server and to different scales/types of workloads; and increases Dbug's coverage by 10⁶--10¹⁹⁷³⁴ for 56 programs. Parrot's source code, entire benchmark suite, and raw results are available at github.com/columbia/smt-mc.
Conference Paper
Full-text available
As the cloud era begins and failures become commonplace, failure recovery becomes a critical factor in the availability, reliability and performance of cloud services. Unfortunately, recovery problems still take place, causing downtimes, data loss, and many other problems. We propose a new testing framework for cloud recovery: FATE (Failure Testing Service) and DESTINI (Declarative Testing Specifications). With FATE, recovery is systematically tested in the face of multiple failures. With DESTINI, correct recovery is specified clearly, concisely, and precisely. We have integrated our framework to several cloud systems (e.g., HDFS [33]), explored over 40,000 failure scenarios, wrote 74 specifications, found 16 new bugs, and reproduced 51 old bugs.
Article
Full-text available
Diagnosis and correction of performance issues in modern, large-scale distributed systems can be a daunting task, since a single developer is unlikely to be familiar with the entire system and it is hard to characterize the behavior of a software system without completely understanding its internal components. This paper describes DISTALYZER, an automated tool to support developer investigation of performance issues in distributed systems. We aim to leverage the vast log data available from large scale systems, while reducing the level of knowledge required for a developer to use our tool. Specifically, given two sets of logs, one with good and one with bad performance, DISTALYZER uses machine learning techniques to compare system behaviors extracted from the logs and automatically infer the strongest associations between system components and performance. The tool outputs a set of inter-related event occurrences and variable values that exhibit the largest divergence across the logs sets and most directly affect the overall performance of the system. These patterns are presented to the developer for inspection, to help them understand which system component(s) likely contain the root cause of the observed performance issue, thus alleviating the need for many human hours of manual inspection. We demonstrate the generality and effectiveness of DISTALYZER on three real distributed systems by showing how it discovers and highlights the root cause of six performance issues across the systems. DISTALYZER has broad applicability to other systems since it is dependent only on the logs for input, and not on the source code.
Article
Similar to software bugs, configuration errors are also one of the major causes of today's system failures. Many configuration issues manifest themselves in ways similar to software bugs such as crashes, hangs, silent failures. It leaves users clueless and forced to report to developers for technical support, wasting not only users' but also developers' precious time and effort. Unfortunately, unlike software bugs, many software developers take a much less active, responsible role in handling configuration errors because "they are users' faults." This paper advocates the importance for software developers to take an active role in handling misconfigurations. It also makes a concrete first step towards this goal by providing tooling support to help developers improve their configuration design, and harden their systems against configuration errors. Specifically, we build a tool, called Spex, to automatically infer configuration requirements (referred to as constraints) from software source code, and then use the inferred constraints to: (1) expose misconfiguration vulnerabilities (i.e., bad system reactions to configuration errors such as crashes, hangs, silent failures); and (2) detect certain types of error-prone configuration design and handling. We evaluate Spex with one commercial storage system and six open-source server applications. Spex automatically infers a total of 3800 constraints for more than 2500 configuration parameters. Based on these constraints, Spex further detects 743 various misconfiguration vulnerabilities and at least 112 error-prone constraints in the latest versions of the evaluated systems. To this day, 364 vulnerabilities and 80 inconsistent constraints have been confirmed or fixed by developers after we reported them. Our results have influenced the Squid Web proxy project to improve its configuration parsing library towards a more user-friendly design.
Conference Paper
When systems fail in the field, logged error or warning messages are frequently the only evidence available for assessing and diagnosing the underlying cause. Consequently, the efficacy of such logging--how often and how well error causes can be determined via postmortem log messages--is a matter of significant practical importance. However, there is little empirical data about how well existing logging practices work and how they can yet be improved. We describe a comprehensive study characterizing the efficacy of logging practices across five large and widely used software systems. Across 250 randomly sampled reported failures, we first identify that more than half of the failures could not be diagnosed well using existing log data. Surprisingly, we find that majority of these unreported failures are manifested via a common set of generic error patterns (e.g., system call return errors) that, if logged, can significantly ease the diagnosis of these unreported failure cases. We further mechanize this knowledge in a tool called Errlog, that proactively adds appropriate logging statements into source code while adding only 1.4% performance overhead. A controlled user study suggests that Errlog can reduce diagnosis time by 60.7%.
Article
We conduct a comprehensive study of file-system code evolution. By analyzing eight years of Linux file-system changes across 5079 patches, we derive numerous new (and sometimes surprising) insights into the file-system development process; our results should be useful for both the development of file systems themselves as well as the improvement of bug-finding tools.
Conference Paper
Logging system behavior is a staple development practice. Numerous powerful model inference algorithms have been proposed to aid developers in log analysis and system understanding. Unfortunately, existing algorithms are difficult to understand, extend, and compare. This paper presents InvariMint, an approach to specify model inference algorithms declaratively. We applied InvariMint to two model inference algorithms and present evaluation results to illustrate that InvariMint (1) leads to new fundamental insights and better understanding of existing algorithms, (2) simplifies creation of new algorithms, including hybrids that extend existing algorithms, and (3) makes it easy to compare and contrast previously published algorithms. Finally, algorithms specified with InvariMint can outperform their procedural versions.
Article
This article describes an examination of a sample of several hundred support tickets for the Hadoop ecosystem, a widely used group of big data storage and processing systems; a taxonomy of errors and how they are addressed by supporters; and the misconfigurations that are the dominant cause of failures. Some design "antipatterns" and missing platform features contribute to these problems. Developers can use various methods to build more robust distributed systems, thereby helping users and administrators prevent some of these rough edges.
Article
Modern software model checkers Þndsafetyviolations: breaches where the system enters some bad state. However, we argue that checking liveness properties offers both a richer and more natu- ral way to search for errors, particularly in complex concurrent and distributed systems. Liveness properties specify desirable system behaviors which must be satisÞed eventually, but are not always satisÞed, perhaps as a result of failure or during system initialization. Existing software model checkers cannot verify liveness be- cause doing so requires Þnding an inÞnite execution that does not satisfy a liveness property. We present heuristics to Þnd a large class of liveness violations and the critical transition of the execution. The critical transition is the step in an execution that moves the system from a state that does not currently sat- isfy some liveness property—but where recovery is possible in the future—to a dead state that can never achieve the liveness property. Our software model checker, MACEMC,isolates com- plex liveness errors in our implementations of PASTRY ,C HORD, a reliable transport protocol, and an overlay tree.
Article
dependability?". The question which comes next is "What are the challenges which we are faced with, as a result of these limits, and in order to overcome them?". Responses to these questions need to be formulated within a conceptual and terminological framework, which in turns is influenced by the analysis of the limits in dependability , and by the challenges raised by dependability . Such a framework can hardly be found in the many standardization efforts: as a consequence of their specialization (telecommunications, avionics, rail transportation, nuclear plant control, etc.), they usually do not consider all possible sources of failures which can affect computing systems, nor do they consider all attributes of dependability. Our society is faced with an ever increasing dependence on computing systems, which lead to question ourselves about the limits of their dependability, and about the challenges raised by those limits. In order to respond these questions, a global conceptual and terminological framework is needed, which is first given. The limits and challenges in dependability are then addressed, from technical and financial viewpoints. The recognition that design faults are the major limiting factor leads to recommending the extension of fault tolerance from products to their production process.