Conference Paper

Development of a benchmark to measure system robustness: experiences and lessons learned

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Performance benchmarks are used to help decide the question: `which system is faster?' With the increased use of computers in critical systems, there are more and more applications of resources to improve system quality. However, there are no benchmarks that can be used to compare the dependability and robustness of systems in order to answer the question: `which system is more reliable?' The authors present an attempt at the development of a benchmark to gauge a system's robustness as measured by its ability to tolerate errors. The initial effort produced four primitive benchmark programs. They include file management system, memory access, user application, and C library functions. Each primitive benchmark targets a system functionality and measures its behavior, given erroneous inputs. The authors present the motivation and experimental results for one of these primitive benchmarks in detail followed by an analysis of the results. A methodology is presented to combine the primitive benchmarks to form an overall robustness figure

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The overall setup used is illustrated in Figure 1. Cotroneo et al. [46,47], Feng and Shin [60], Iannillo et al. [78], Liu et al. [118], Maji et al. [126] UNIX-like Acharya et al. [1], Albinet et al. [3], Cong et al. [41], Jarboui et al. [82], Kanoun et al. [96], Koopman and DeVale [101,102], Koopman et al. [103], Kropp et al. [104], Miller et al. [139][140][141], Montrucchio et al. [142,143], Shelton et al. [155], Siewiorek et al. [187], Suh et al. [190], Velasco et al. [196], Xiang et al. [211] 89:6 N. Laranjeiro et al. ...
... There are clear challenges associated with the evaluation of robustness of new types of systems, among which we identify the selection of the technique (e.g., model-based, experimental), the target of the evaluation (e.g., an API, a message, message elds) the selection of the faults (e.g., timing faults, boundary values), and nally how to classify behavior (i.e., the selection/adaption A Systematic Review on Soware Robustness Assessment 89:23 [187], Suh et al., [190], Tarhini et al. [191], Ufuktepe and Tuglular [193], Vasan and Memon [195], Velasco et al. [196], Winter et al. [204], Xiang et al. [212], Xu et al. [214], Yang et al. [215], Ye et al. [216], Yi et al. [218], Zamli et al. [219] of a failure mode scale and how to retrieve the necessary information from the system to allow classication). It became clear that, at the time of writing, there are types of systems for which robustness evaluation techniques are either unknown or rising. ...
... Suh et al. present a robustness benchmark suite, composed of four primitive robustness benchmarks targeting specic operating system functionalities, such as the le management system, memory access, user application, and C library functions [190]. The authors tested the le management system using random inputs, the memory access functionality was tested by reading from and writing to random addresses, and by attempting to write past the boundaries of a static string, the user application was tested with stuck-at memory faults (i.e., forcing one or more bits to the value 1), and the C library functions were tested using random input values. ...
Article
Full-text available
Robustness is the degree to which a certain system or component can operate correctly in the presence of invalid inputs or stressful environmental conditions. With the increasing complexity and widespread use of computer systems, obtaining assurances regarding their robustness has become of vital importance. This survey discusses the state of the art on software robustness assessment, with emphasis on key aspects like types of systems being evaluated, assessment techniques used, the target of the techniques, the types of faults used, and how system behavior is classified. The survey concludes with the identification of gaps and open challenges related with robustness assessment.
... The FSM model was constructed by surveying previous fault injection studies. Four studies of fault manifestations formed the basis for the model [1,2,4,5]. The fault manifestations are aligned in Table 1, where similar categories between the studies are horizontally aligned. ...
... Suh et. al. [4] were concerned with creating a benchmark to measure system robustness rather than a theoretical approach to defining fault manifestations, as was the focus of Cristian's work. In the work of Koopman and colleagues [1], Ballista automatically generates test to measure the robustness of COTS operating systems . ...
Article
Full-text available
While there are an enormous number of faults that can occur in the hardware and software of a computing system, several research studies have indicated that the number of fault manifestations is usually very small, often resulting in less than a dozen different states. The paper surveys studies of both naturally occurring and artificially induced faults. A composite set of faulty states is proposed as well as transition probabilities between states. The transitions with the largest range of valued are identified. Future work will simulate the model and do a sensitivity analysis on it to determine the critical transitions. The model will be used to study the placement of fault detection monitors that will predict failure in a computer system.
... Siewiorek et al. [127] proposed a set of primitive benchmarks, each targeting a specific subsystem, to quantify system robustness. Overall robustness is simply defined as the average of the fraction of correct responses returned from the system across the various benchmarks. ...
... Perhaps, most similar to our work is [7], which outlines a methodology for benchmarking systems' availability. Other works have proposed robustness [32] and reliability benchmarks [36] that quantify the degradation of system performance under faults. ...
Article
In this paper, we propose a two-phase methodology for systematically evaluating the performability (performance and availability) of cluster-based Internet services. In the first phase, evaluators use a fault-injection infrastructure to characterize the service's behavior in the presence of faults. In the second phase, evaluators use an analytical model to combine an expected fault load with measurements from the first phase to assess the service's performability. Using this model, evaluators can study the service's sensitivity to different design decisions, fault rates, and other environmental factors. To demonstrate our methodology, we study the performability of a multitier Internet service. In particular, we evaluate the performance and availability of three soft state maintenance strategies for an online bookstore service in the presence of seven classes of faults. Among other interesting results, we clearly isolate the effect of different faults, showing that the tier of Web servers is responsible for an often dominant fraction of the service unavailability. Our results also demonstrate that storing the soft state in a database achieves better performability than storing it in main memory (even when the state is efficiently replicated) when we weight performance and availability equally. Based on our results, we conclude that service designers may want an unbalanced system in which they heavily load highly available components and leave more spare capacity for components that are likely to fail more often.
... Software-implemented fault injection (SWIFI) is a common technique in the fault-tolerant community [12]. Software fault injection can be used for many purposes, such as comparing the robustness of different systems [24,29,16], understanding how systems behave during a fault [8,4,15], and validating fault-tolerant mechanisms [3,11,21,6]. However, there are very few case studies that use fault injection for all three of these purposes to guide the design and implementation of a fault-tolerant system. ...
Article
: Fault injection is typically used to characterize failures and to validate and compare fault-tolerant mechanisms. However, fault injection is rarely used for all these purposes to guide the design and implementation of a fault-tolerant system. We present a systematic and quantitative approach for using software-implemented fault injection to guide the design and implementation of a fault-tolerant system. Our system design goal is to build a write-back file cache on Intel PCs that is as reliable as a write-through file cache. We follow an iterative approach to improve robustness in the presence of operating system errors. In each iteration, we measure the reliability of the system, analyze the fault symptoms that lead to data corruption, and apply fault-tolerant mechanisms that address the fault symptoms. Our initial system is 13 times less reliable than a writethrough file cache. The result of several iterations is a design that is both more reliable (1.9% vs. 3.1% corruption rate) a...
... The fault manager is thereby forced to react to this triggered fault. As will be de ned later, fault coverage (FC) is, roughly speaking, the percentage of successful fault handlings on T. Indeed, FC as well as other coverage estimators 21,18,14] do indicate the capability of FM to some extent. Simply using such metrics is, however, insu cient to infer the correctness of FM. ...
Article
Full-text available
The need for fault tolerant software has grown significantly with the need for providing computer-based continuous service in a variety of areas that include telecommunications, air and ground transportation, and defense. TAMER (a Testing, Analysis, and Measurement Environment for Robustness) is a tool designed to assess the dependability of such systems. Three key ideas make TAMER different from several existing tools aimed at dependability assessment of distributed fault tolerant systems. These three ideas are incorporated in: (a) a two-dimensional criterion for dependability assessment, (b) interface fault injection, and (c) a scheme for partitioning the system under assessment into subsystems that could be analyzed "off-line". The interactive nature of TAMER allows an assessor to identify portions of software that may need attention for additional testing, redesign, or recoding. Such identification becomes possible with the help of code and fault coverage information der...
... Works in the past have proposed robustness [25] and reliability benchmarks [29] that quantify the degradation of system performance under failures. Previous work has noted that different cluster organizations have different availability impacts [11]. ...
Article
We propose a two-phase methodology for quantifying the performability (performance + availability) of cluster-based Internet services. In the first phase, evaluators use a fault-injection infrastructure to measure the impact of faults on the server's performance. In the second phase, evaluators use an analytical model to combine an expected fault load with measurements from the first phase to assess the server's performability. Using this model, evaluators can study the server's sensitivity to different design decisions, fault rates, and other environmental factors. To demonstrate our methodology, we study the performability of 4 versions of the PRESS Web server against 5 classes of faults. We use Mendosus, a new fault-injection and network emulation infrastructure, to effect phase 1 of our methodology. We then use our model to quantify the performability of the different versions of PRESS. We also use the model to study the impact of reducing live operator support and adding RAIDs on PRESS's performability. 1
... Perhaps more similar to our work is that of [6], which outlines a methodology for benchmarking systems' availability. Other works have proposed robustness [29] and reliability benchmarks [34] that quantify the degradation of system performance under faults. Our work here differs from these previous studies in that we focus on cluster-based servers. ...
Article
We propose a two-phase methodology for quantifying the performability (performance and availability) of cluster-based Internet services. In the first phase, evaluators use a fault-injection infrastructure to measure the impact of faults on the server's performance. In the second phase, evaluators use an analytical model to combine an expected fault load with measurements from the first phase to assess the server's performability. Using this model, evaluators can study the server's sensitivity to different design decisions, fault rates, and environmental factors. To demonstrate our methodology, we study the performability of 4 versions of the PRESS Web server against 5 classes of faults, quantifying the effects of different design decisions on performance and availability. Finally, to further show the utility of our model, we also quantify the impact of two hypothetical changes, reduced human operator response time and the use of RAIDs.
Article
ContextWith the increased use of software for running key functions in modern society it is of utmost importance to understand software robustness and how to support it. Although there have been many contributions to the field there is a lack of a coherent and summary view.Objective To address this issue, we have conducted a literature review in the field of robustness.Method This review has been conducted by following guidelines for systematic literature reviews. Systematic reviews are used to find and classify all existing and available literature in a certain field.ResultsFrom 9193 initial papers found in three well-known research databases, the 144 relevant papers were extracted through a multi-step filtering process with independent validation in each step. These papers were then further analyzed and categorized based on their development phase, domain, research, contribution and evaluation type. The results indicate that most existing results on software robustness focus on verification and validation of Commercial of the shelf (COTS) or operating systems or propose design solutions for robustness while there is a lack of results on how to elicit and specify robustness requirements. The research is typically solution proposals with little to no evaluation and when there is some evaluation it is primarily done with small, toy/academic example systems.Conclusion We conclude that there is a need for more software robustness research on real-world, industrial systems and on software development phases other than testing and design, in particular on requirements engineering.
Chapter
As computer applications extend to areas which require extreme dependability, their designs mandate the ability to operate in the presence of faults. The problem of assuring that the design goals are achieved requires the observation and measurement of fault behavior parameters under various input conditions. One means to characterize systems is fault injection, but injection of internal faults is difficult due to the complexity and level of integration of contemporary VLSI implementations. This chapter explores the effects of gate-level faults on system operation as a basis for fault models at the program level. A new fault model for processors based on a register-transfer-level (RTL) description is presented. This model addresses time, cost, and accuracy limitations imposed by current fault-injection techniques. It is designed to be used with existing software-implemented fault-injection (SWIFI) tools, but the error patterns it generates are designed to be more representative of actual transient hardware faults than the ad-hoc patterns currently injected via most SWIFI experiments.
Article
Software is being used for building applications requiring extreme dependability. In many cases, systems must have high availability and fault tolerance. With the increasing complexity of software, testing becomes difficult and expensive. This report summarizes the goals and research issues involved in the area of testing distributed systems. It describes and analyzes prior work done in the area of fault injection testing. It describes sources of errors and failures in distributed systems that are compliant to distributed object architectures like CORBA and COM. A methodology is proposed for testing distributed software. Index terms: CORBA, COM, distributed systems, errors, failures, fault injection, test adequacy. 1 Introduction Computer systems are being used for applications that require high dependability. They are used in areas where failures could be catastrophic, as for controlling nuclear plants, air traffic, space exploration and defense. Systems have to possess a hig...
Article
Issues in testing distributed component-based systems are discussed. Differences in testing such systems and other systems are identified. Several limitations and shortcomings of the existing test methodologies are also identified and a new methodology proposed. Keywords: CORBA, DCOM, component-based distributed systems, fault-tolerance, Java RMI, test adequacy, test methodology. 1 Introduction Testing software systems is a complex problem in itself. With the increasing trend in using distributed software, the task of testing becomes even more complicated. The scalability of testing methodologies and development of testing tools need to keep up with new technologies such as CORBA, DCOM and Java RMI. The process of testing is further complicated by the use of COTS components in the systems. Testers need to test the behavior of such components in systems even if the components have been tested before. Sometimes the components that are reused may not have been designed for systems ...
Article
Full-text available
A benchmark program may be used to measure cpu performance speed for a variety of machines. An example benchmark for measuring the processor power of scientific computers is presented and compared with other methods of assessing computer power. The program provides a measure of computer speed of a similar level of usefulness as the Gibson mix, etc. but taking account of the actual compilers available on the machine. It is particularly suitable on machines with unusual architecture where the Gibson mix is difficult to interpret (e. g. KDF9, Burroughs).
Article
Full-text available
This project started as a simple experiment to try to better understand an observed phenomenon, that of programs crashing when a noisy dial-up line is used. As a result of testing a comprehensive list of utility programs on several versions of Unix, it appears that this is not an isolated problem. Thus, this paper supplies a list of bug reports to fix the utilities that we were able to crash. This should also improve the quality reliability of Unix utilities. This paper also supplies a simple, but effective test method (and tools).
Article
This note compares the performance of different computer systems while solving dense systems of linear equations using the LINPACK software in a Fortran environment. About 100 computers, ranging from a CRAY X-MP to the 68000 based systems such as the Apollo and SUN Workstations to IBM PC's, are compared.
Article
A computer performance test that measures a realistic floating-point performance range for Fortran applications is described. A variety of computer performance analyses may be easily carried out using this small central processing unit (cpu) test that would be infeasible or too costly using complete applications as benchmarks, particularly in the developmental phase of an immature computer system. The problem of benchmarking numerical applications sufficiently, especially on new supercomputers, is analyzed to identify several useful roles for the Livermore Fortran Kernal (LFK) test. The 24 LFK contain enough samples of Fortran practice to expose many specific inefficiencies in the formulation of the Fortran source, in the quality of compiled cpu code, and in the capability of the instruction architecture. Examples show how the LFK may be used to study compiled Fortran code efficiency, to test the ability of compilers to vectorize Fortran, to simulate mature coding of Fortran on new computers, and to estimate the effective subrange of supercomputer performance for Fortran applications. Cpu performance measurements of several Fortran benchmarks and numerical applications that correlate well with the cpu performance range measured by the LFK test are presented. The numerical performance metric Mflops, first introduced in 1970 in this cpu test to quantify the cpu performance range of numerical applications, is discussed. Analyses of the LFK performance results argue against reducing the cpu performance range of supercomputers to a single number. The 24 LFK measured rates show a realistic variance in Fortran cpu performance that is essential data for circumspect computer evaluations. Cpu performance data measured by the LFK test on a number of recent computer systems are tabulated for reference.
Article
Reflecting current data on the use of programming language constructs in systems programming, a synthetic benchmark is constructed based on the distribution appearing in the data. The benchmark executes 100 Ada statements that are balanced in terms of the distribution of statement types, data types, and data locality. Pascal and C versions of the benchmark are discussed.
Article
The controversy surrounding single number performance reduction is examined and solutions are suggested through a comparison of measures.
Article
The results of several experiments conducted using the fault-injection-based automated testing (FIAT) system are presented. FIAT is capable of emulating a variety of distributed system architectures, and it provides the capabilities to monitor system behavior and inject faults for the purpose of experimental characterization and validation of a system's dependability. The experiments consists of exhaustively injecting three separate fault types into various locations, encompassing both the code and data portions of memory images, of two distinct applications executed with several different data values and sizes. Fault types are variations of memory bit faults. The results show that there are a limited number of system-level fault manifestations. These manifestations follow a normal distribution for each fault type. Error detection latencies are found to be normally distributed. The methodology can be used to predict the system-level fault responses during the system design stage
Article
We propose a small number of basic concepts that can be used to explain the architecture of fault-tolerant distributed systems and we discuss a list of architectural issues that we find useful to consider when designing or examining such systems. For each issue we present known solutions and design alternatives, we discuss their relative merits and we give examples of systems which adopt one approach or the other. The aim is to introduce some order in the complex discipline of designing and understanding fault-tolerant distributed systems. 1 1 Introduction Computing systems consist of a multitude of hardware and software components that are bound to fail eventually. In many systems, such component failures can lead to unanticipated, potentially disruptive failure behavior and to service unavailability. Some systems are designed to be fault-tolerant: they either exhibit a well-defined failure behavior when components fail or mask component failures to users, that is, continue t...
Development of a Robustness Benchmark Dhrystone: A synthetic Systems Programming Benchmark
  • B H Suh
  • J Hudak
  • D P Z Siewiorek
  • Segall
B. H. Suh, J. Hudak, D. P. Siewiorek, 2. Z. Segall. " Development of a Robustness Benchmark, " Carnegie Mellon University, Center for Dependable Systems, Technical Report to be published, 1992. [l 11 R.P. Weicker, " Dhrystone: A synthetic Systems Programming Benchmark, " Communications of the ACM, Vol. 27, No. 10, October 1984, pp 1013 -1030.
Development of a Robustness Benchmark
  • B H Suh
  • J Hudak
  • D P Siewiorek