Preprint

Test Adequacy for Metamorphic Testing: Criteria, Measurement, and Implication

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

Metamorphic testing (MT) is a simple yet effective technique to alleviate the oracle problem in software testing. The underlying idea of MT is to test a software system by checking whether metamorphic relations (MRs) hold among multiple test inputs (including source and follow-up inputs) and the actual output of their executions. Since MRs and source inputs are two essential components of MT, considerable efforts have been made to examine the systematic identification of MRs and the effective generation of source inputs, which has greatly enriched the fundamental theory of MT since its invention. However, few studies have investigated the test adequacy assessment issue of MT, which hinders the objective measurement of MT's test quality as well as the effective construction of test suites. Although in the context of traditional software testing, there exist a number of test adequacy criteria that specify testing requirements to constitute an adequate test from various perspectives, they are not in line with MT's focus which is to test the software under testing (SUT) from the perspective of necessary properties. In this paper, we proposed a new set of criteria that specifies testing requirements from the perspective of necessary properties satisfied by the SUT, and designed a test adequacy measurement that evaluates the degree of adequacy based on both MRs and source inputs. The experimental results have shown that the proposed measurement can effectively indicate the fault detection effectiveness of test suites, i.e., test suites with increased test adequacy usually exhibit higher effectiveness in fault detection. Our work made an attempt to assess the test adequacy of MT from a new perspective, and our criteria and measurement provide a new approach to evaluate the test quality of MT and provide guidelines for constructing effective test suites of MT.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Due to the effectiveness and efficiency in detecting defects caused by interactions of multiple factors, Combinatorial Testing (CT) has received considerable scholarly attention in the last decades. Despite numerous practical test case generation techniques being developed, there remains a paucity of studies addressing the automated oracle generation problem, which holds back the overall automation of CT. As a consequence, much human intervention is inevitable, which is time-consuming and error-prone. This costly manual task also restricts the application of higher testing strength, inhibiting the full exploitation of CT in the industrial practice. To bridge the gap between test designs and fully automated test flows, and to extend the applicability of CT, this paper presents a novel CT methodology, named COMER, to enhance the traditional CT by accounting for Metamorphic Relations (MRs). COMER puts a high priority on generating pairs of test cases which match the input rules of MRs, i.e., the Metamorphic Group (MG), such that the correctness can be automatically determined by verifying whether the outputs of these test cases violate their MRs. As a result, COMER can not only satisfy the t-way coverage as what CT does, but also automatically check test oracle as many violations as possible. Several empirical studies conducted on 31 real-world software projects have shown that COMER increased the number of metamorphic groups by an average factor of 75.9 and also increased the failure detection rate by an average factor of 11.3, when compared with CT, while the overall number of test cases generated by COMER barely increased.
Article
Full-text available
Context: Microservices as a lightweight and decentralized architectural style with fine-grained services promise several beneficial characteristics for sustainable long-term software evolution. Success stories from early adopters like Netflix, Amazon, or Spotify have demonstrated that it is possible to achieve a high degree of flexibility and evolvability with these systems. However, the described advantageous characteristics offer no concrete guidance and little is known about evolvability assurance processes for microservices in industry as well as challenges in this area. Insights into the current state of practice are a very important prerequisite for relevant research in this field. Objective: We therefore wanted to explore how practitioners structure the evolvability assurance processes for microservices, what tools, metrics, and patterns they use, and what challenges they perceive for the evolvability of their systems. Method: We first conducted 17 semi-structured interviews and discussed 14 different microservice-based systems and their assurance processes with software professionals from 10 companies. Afterwards, we performed a systematic grey literature review (GLR) and used the created interview coding system to analyze 295 practitioner online resources. Results: The combined analysis revealed the importance of finding a sensible balance between decentralization and standardization. Guidelines like architectural principles were seen as valuable to ensure a base consistency for evolvability and specialized test automation was a prevalent theme. Source code quality was the primary target for the usage of tools and metrics for our interview participants, while testing tools and productivity metrics were the focus of our GLR resources. In both studies, practitioners did not mention architectural or service-oriented tools and metrics, even though the most crucial challenges like Service Cutting or Microservices Integration were of an architectural nature. Conclusions: Practitioners relied on guidelines, standardization, or patterns like Event-Driven Messaging to partially address some reported evolvability challenges. However, specialized techniques, tools, and metrics are needed to support industry with the continuous evaluation of service granularity and dependencies. Future microservices research in the areas of maintenance, evolution, and technical debt should take our findings and the reported industry sentiments into account.
Article
Full-text available
Metamorphic testing is well known for its ability to alleviate the oracle problem in software testing. The main idea of metamorphic testing is to test a software system by checking whether each identified metamorphic relation (MR) holds among several executions. In this regard, identifying MRs is an essential task in metamorphic testing. In view of the importance of this identification task, METRIC (METamorphic Relation Identification based on Category-choice framework) was developed to help software testers identify MRs from a given set of complete test frames. However, during MR identification, METRIC primarily focuses on the input domain without sufficient attention given to the output domain, thereby hindering the effectiveness of METRIC. Inspired by this problem, we have extended METRIC into METRIC+ by incorporating the information derived from the output domain for MR identification. A tool implementing METRIC+ has also been developed. Two rounds of experiments, involving four real-life specifications, have been conducted to evaluate the effectiveness and efficiency of METRIC+. The results have confirmed that METRIC+ is highly effective and efficient in MR identification. Additional experiments have been performed to compare the fault detection capability of the MRs generated by METRIC+ and those by mMT (another MR identification technique). The comparison results have confirmed that the MRs generated by METRIC+ are highly effective in fault detection.
Article
Full-text available
We propose a robustness testing approach for software systems that process large amounts of data. Our method uses metamorphic relations to check software output for erroneous input in the absence of a tangible test oracle. We use this technique to test two major citation database systems: Scopus and the Web of Science. We report a surprising finding that the inclusion of hyphens in paper titles impedes citation counts, and that this is a result of the lack of robustness of the citation database systems in handling hyphenated paper titles. Our results are valid for the entire literature as well as for individual fields such as chemistry. We further find a strong and significant negative correlation between the journal impact factor (JIF) of IEEE Transactions on Software Engineering (TSE) and the percentage of hyphenated paper titles published in TSE. Similar results are found for ACM Transactions on Software Engineering and Methodology. A software engineering field-wide study reveals that the higher JIF-ranked journals are publishing a lower percentage of papers with hyphenated titles. Our results challenge the common belief that citation counts and JIFs are reliable measures of the impact of papers and journals, as they can be distorted simply by the presence of hyphens in paper titles.
Article
Full-text available
Modern information technology paradigms, such as online services and off-the-shelf products, often involve a wide variety of users with different or even conflicting objectives. Every software output may satisfy some users, but may also fail to satisfy others. Furthermore, users often do not know the internal working mechanisms of the systems. This situation is quite different from bespoke software, where developers and users usually know each other. This paper proposes an approach to help users to better understand the software that they use, and thereby more easily achieve their objectives—even when they do not fully understand how the system is implemented. Our approach borrows the concept of metamorphic relations from the field of metamorphic testing (MT), using it in an innovative way that extends beyond MT. We also propose a "symmetry" metamorphic relation pattern and a "change direction" metamorphic relation input pattern that can be used to derive multiple concrete metamorphic relations. Empirical studies reveal previously unknown failures in some of the most popular applications in the world, and show how our approach can help users to better understand and better use the systems. The empirical results provide strong evidence of the simplicity, applicability, and effectiveness of our methodology.
Conference Paper
Full-text available
Metamorphic testing is a well known approach to tackle the oracle problem in software testing. This technique requires the use of source test cases that serve as seeds for the generation of follow-up test cases. Systematic design of test cases is crucial for the test quality. Thus, source test case generation strategy can make a big impact on the fault detection effectiveness of metamorphic testing. Most of the previous studies on metamorphic testing have used either random test data or existing test cases as source test cases. There has been limited research done on systematic source test case generation for metamorphic testing. This paper provides a comprehensive evaluation on the impact of source test case generation techniques on the fault finding effectiveness of metamorphic testing. We evaluated the effectiveness of line coverage, branch coverage, weak mutation and random test generation strategies for source test case generation. The experiments are conducted with 77 methods from 4 open source code repositories. Our results show that by systematically creating source test cases, we can significantly increase the fault finding effectiveness of metamorphic testing. Further, in this paper we introduce a simple metamorphic testing tool called "METtester" that we use to conduct metamorphic testing on these methods.
Article
Full-text available
Metamorphic testing is an approach to both test case generation and test result verification. A central element is a set of metamorphic relations, which are necessary properties of the target function or algorithm in relation to multiple inputs and their expected outputs. Since its first publication, we have witnessed a rapidly increasing body of work examining metamorphic testing from various perspectives, including metamorphic relation identification, test case generation, integration with other software engineering techniques, and the validation and evaluation of software systems. In this article, we review the current research of metamorphic testing and discuss the challenges yet to be addressed. We also present visions for further improvement of metamorphic testing and highlight opportunities for new research.
Article
Full-text available
The terms “Oracle Problem” and “Non-testable system” interchangeably refer to programs in which the application of test oracles is infeasible. Test oracles are an integral part of conventional testing techniques; thus, such techniques are inoperable in these programs. The prevalence of the oracle problem has inspired the research community to develop several automated testing techniques that can detect functional software faults in such programs. These techniques include N-Version testing, Metamorphic Testing, Assertions, Machine Learning Oracles, and Statistical Hypothesis Testing. This paper presents a Mapping Study that covers these techniques. The Mapping Study presents a series of discussions about each technique, from different perspectives, e.g. effectiveness, efficiency, and usability. It also presents a comparative analysis of these techniques in terms of these perspectives. Finally, potential research opportunities within the non-testable systems problem domain are highlighted within the Mapping Study. We believe that the aforementioned discussions and comparative analysis will be invaluable for new researchers that are attempting to familiarise themselves with the field, and be a useful resource for practitioners that are in the process of selecting an appropriate technique for their context, or deciding how to apply their selected technique. We also believe that our own insights, which are embedded throughout these discussions and the comparative analysis, will be useful for researchers that are already accustomed to the field. It is our hope that the potential research opportunities that have been highlighted by the Mapping Study will steer the direction of future research endeavours.
Article
Full-text available
While the relation between code coverage measures and fault detection is actively studied, only few works have investigated the correlation between measures of coverage and of reliability. In this work, we introduce a novel approach to measuring code coverage, called the operational coverage, that takes into account how much the program’s entities are exercised so to reflect the profile of usage into the measure of coverage. Operational coverage is proposed as (i) an adequacy criterion, i.e., to assess the thoroughness of a black box test suite derived from the operational profile, and as (ii) a selection criterion, i.e., to select test cases for operational profile-based testing. Our empirical evaluation showed that operational coverage is better correlated than traditional coverage with the probability that the next test case derived according to the user’s profile will not fail. This result suggests that our approach could provide a good stopping rule for operational profile-based testing. With respect to test case selection, our investigations revealed that operational coverage outperformed the traditional one in terms of test suite size and fault detection capability when we look at the average results.
Article
Full-text available
Diversity has been widely studied in software testing as a guidance towards effective sampling of test inputs in the vast space of possible program behaviors. However, diversity has received relatively little attention in mutation testing. The traditional mutation adequacy criterion is a one-dimensional measure of the total number of killed mutants. We propose a novel, diversity-aware mutation adequacy criterion called distinguishing mutation adequacy criterion, which is fully satisfied when each of the considered mutants can be identified by the set of tests that kill it, thereby encouraging inclusion of more diverse range of tests. This paper presents the formal definition of the distinguishing mutation adequacy and its score. Subsequently, an empirical study investigates the relationship among distinguishing mutation score, fault detection capability, and test suite size. The results show that the distinguishing mutation adequacy criterion detects 1.33 times more unseen faults than the traditional mutation adequacy criterion, at the cost of a 1.56 times increase in test suite size, for adequate test suites that fully satisfies the criteria. The results show a better picture for inadequate test suites; on average, 8.63 times more unseen faults are detected at the cost of a 3.14 times increase in test suite size.
Article
Full-text available
One of the grand challenges for adequately testing complex software is due to the oracle problem. Metamorphic Testing (MT) is a promising technique to alleviate the oracle problem through using one or multiple Metamorphic Relations (MRs) as test oracles. MT checks the satisfaction of every MR among the outputs of the MR-related tests instead of the correctness of individual test outputs. In practice, it is fairly easy to find MRs for testing any program, but it is very difficult to develop “good” MRs and evaluate their adequacy. A systematic approach for developing MRs and evaluating their adequacy in MT remains to be developed. In this paper, we propose a framework for evaluating MT and iteratively developing adequate MRs monitored by MT adequacy evaluation. The MT adequacy is measured by program coverages, mutation testing, and testing MRs with mutation tests. The MT evaluation results are used for guiding the iterative development of MRs, generating tests, and analyzing test outputs. We explain the framework through a testing example on an image processing program that is used for building the 3-dimensional structure of a biology cell based on its confocal image sections. In order to demonstrate the effectiveness of the proposed framework, we reported a case study of testing a complex scientific program: a Monte Carlo modeling program that simulates photon propagations in turbid tissue phantoms for accurate and efficient generation of reflectance images from biological tissues. The case study has shown the effectiveness of proposed MT framework for testing scientific software in general and the necessity of the MT enhancement in the development of adequate MRs. The case study results can be easily adapted for testing other software.
Article
Full-text available
A test oracle determines whether a test execution reveals a fault, often by comparing the observed program output to the expected output. This is not always practical, for example when a program's input-output relation is complex and difficult to capture formally. Metamorphic testing provides an alternative, where correctness is not determined by checking an individual concrete output, but by applying a transformation to a test input and observing how the program output 'morphs' into a different one as a result. Since the introduction of such metamorphic relations in 1998, many contributions on metamorphic testing have been made, and the technique has seen successful applications in a variety of domains, ranging from web services to computer graphics. This article provides a comprehensive survey on metamorphic testing: It summarises the research results and application areas, and analyses common practice in empirical studies of metamorphic testing as well as the main open challenges.
Article
Full-text available
Metamorphic testing is a promising technique for testing software systems when the oracle problem exists, and has been successfully applied to various application domains and paradigms. An important and essential task in metamorphic testing is the identification of metamorphic relations, which, due to the absence of a systematic and specification-based methodology, has often been done in an ad hoc manner—something which has hindered the applicability and effectiveness of metamorphic testing. To address this, a systematic methodology for identifying metamorphic relations based on the category-choice framework, called metric, is introduced in this paper. A tool implementing this methodology has been developed and examined in an experiment to determine the viability and effectiveness of metric, with the results of the experiment confirming that metric is both effective and efficient at identifying metamorphic relations.
Conference Paper
Full-text available
Mutation testing is a valuable experimental research technique that has been used in many studies. It has been experimentally compared with other test criteria, and also used to support experimental comparisons of other test criteria, by using mutants as a method to create faults. In effect, mutation is often used as a "gold standard" for experimental evaluations of test methods. Although mutation testing is powerful, it is a complicated and computationally expensive testing method. Therefore, automated tool support is indispensable for conducting mutation testing. This demo presents a publicly available mutation system for Java that supports both method-level mutants and class-level mutants. MuJava can be freely downloaded and installed with relative ease under both Unix and Windows. MuJava is offered as a free service to the community and we hope that it will promote the use of mutation analysis for experimental research in software testing.
Article
Full-text available
Testing involves examining the behaviour of a system in order to discover potential faults. Given an input for a system, the challenge of distinguishing the corresponding desired, correct behaviour from potentially incorrect behavior is called the “test oracle problem”. Test oracle automation is important to remove a current bottleneck that inhibits greater overall test automation. Without test oracle automation, the human has to determine whether observed behaviour is correct. The literature on test oracles has introduced techniques for oracle automation, including modelling, specifications, contract-driven development and metamorphic testing. When none of these is completely adequate, the final source of test oracle information remains the human, who may be aware of informal specifications, expectations, norms and domain specific information that provide informal oracle guidance. All forms of test oracles, even the humble human, involve challenges of reducing cost and increasing benefit. This paper provides a comprehensive survey of current approaches to the test oracle problem and an analysis of trends in this important area of software testing research and practice.
Article
Full-text available
A test data adequacy criterion is a set of rules used to determine whether or not sufficient testing has been performed. A general axiomatic theory of test data adequacy is developed, and five previously proposed adequacy criteria are examined to see which of the axioms are satisfied. It is shown that the axioms are consistent, but that only two of the criteria satisfy all of the axioms.
Article
Concurrent programs are normally composed of multiple concurrent threads sharing memory space. These threads are often interleaved, which may lead to some non-determinism in execution results, even for the same program input. This poses huge challenges to the testing of concurrent programs, especially on the test result verification, that is, the prevalent existence of the oracle problem. In this paper, we investigate the application of metamorphic testing, a mainstream technique to address the oracle problem, into the testing of concurrent programs. Based on the unique features of interleaved executions in concurrent programming, we propose an extended notion of metamorphic relations, the core part of metamorphic testing, which are particularly designed for the testing of concurrent programs. A comprehensive testing approach, namely ConMT, is thus developed and a tool is built to automate its implementation on concurrent programs written in Java. Empirical studies have been conducted to evaluate the performance of ConMT, and the experimental results show that in addition to addressing the oracle problem, ConMT outperforms the baseline traditional testing techniques with respect to higher degree of automation, better bug-detection capability, and shorter testing time. It is clear that ConMT can significantly improve the cost-effectiveness for the testing of concurrent programs and thus advances the state of the art in the field. The study also brings novelty into metamorphic testing, hence promoting the fundamental research of software testing.
Article
Metamorphic testing addresses the issue of the oracle problem by comparing results transformation from multiple test executions. The relationship that governs the output transformation is called metamorphic relation. Metamorphic relations require expert knowledge and the generation of them is considered a time‐consuming task. Researchers have proposed various techniques to automate metamorphic testing, generation, and selection. Although there are several research articles on this issue, there is a lack of overview of the state‐of‐the‐art of metamorphic relation automation. As such, we performed a systematic literature review study to collect, extract, and synthesize the required data. Based on our research questions, the literature was categorized and summarized into different categories. We found that the automation of metamorphic relation is most effective in mathematical and scientific applications. We concluded that some approaches involve analysis of different forms of software‐related information such as control flow graph and program dependence graph as well as an initial set of metamorphic relations. On the other hand, other methods involve analysis of executions of the software functions with random and specific inputs. The results show that this field is still in its infancy with opportunities for novel work, especially in methods utilizing machine learning. Key FindingsIn this systematic literature review (SLR), we found that the automation of metamorphic relation is most effective in mathematical and scientific applications. We concluded that some approaches involve analysis of different forms of software‐related information such as control flow graph and program dependence graph as well as an initial set of metamorphic relations. On the other hand, other methods involve analysis of executions of the software functions with random and specific inputs.
Article
Over the past decade, metamorphic testing has gained rapidly increasing attention from both academia and industry, particularly thanks to its high efficacy on revealing real-life software faults in a wide variety of application domains. On the basis of a set of metamorphic relations among multiple software inputs and their expected outputs, metamorphic testing not only provides a test case generation strategy by constructing new (or follow-up) test cases from some original (or source) test cases, but also a test result verification mechanism through checking the relationship between the outputs of source and follow-up test cases. Many efforts have been made to further improve the cost-effectiveness of metamorphic testing from different perspectives. Some studies attempted to identify “good” metamorphic relations, while other studies were focused on applying effective test case generation strategies especially for source test cases. In this paper, we propose improving the cost-effectiveness of metamorphic testing by leveraging the feedback information obtained in the test execution process. Consequently, we develop a new approach, namely feedback-directed metamorphic testing, which makes use of test execution information to dynamically adjust the selection of metamorphic relations and selection of source test cases. We conduct an empirical study to evaluate the proposed approach based on four laboratory programs, one GNU program, and one industry program. The empirical results show that feedback-directed metamorphic testing can use fewer test cases and take less time than the traditional metamorphic testing for detecting the same number of faults. It is clearly demonstrated that the use of feedback information about test execution does help enhance the cost-effectiveness of metamorphic testing. Our work provides a new perspective to improve the efficacy and applicability of metamorphic testing as well as many other software testing techniques.
Article
The prosperous trend of deploying deep neural network (DNN) models to diverse hardware platforms has boosted the development of deep learning (DL) compilers. DL compilers take the high-level DNN model specifications as input and generate optimized DNN executables for diverse hardware architectures like CPUs, GPUs, and various hardware accelerators. Compiling DNN models into high-efficiency executables is not easy: the compilation procedure often involves converting high-level model specifications into several different intermediate representations (IR), e.g., graph IR and operator IR, and performing rule-based or learning-based optimizations from both platform-independent and platform-dependent perspectives. Despite the prosperous adoption of DL compilers in real-world scenarios, principled and systematic understanding toward the correctness of DL compilers does not yet exist. To fill this critical gap, this paper introduces MT-DLComp, a metamorphic testing framework specifically designed for DL compilers to effectively uncover erroneous compilations. Our approach leverages deliberately-designed metamorphic relations (MRs) to launch semantics-preserving mutations toward DNN models to generate their variants. This way, DL compilers can be automatically examined for compilation correctness utilizing DNN models and their variants without requiring manual intervention. We also develop a set of practical techniques to realize an effective workflow and localize identified error-revealing inputs. Real-world DL compilers exhibit a high level of engineering quality. Nevertheless, we detected over 435 inputs that can result in erroneous compilations in four popular DL compilers, all of which are industry-strength products maintained by Amazon, Facebook, Microsoft, and Google. While the discovered error-triggering inputs do not cause the DL compilers to crash directly, they can lead to the generation of incorrect DNN executables. With substantial manual effort and help from the DL compiler developers, we uncovered four bugs in these DL compilers by debugging them using the error-triggering inputs. Our proposed testing frameworks and findings can be used to guide developers in their efforts to improve DL compilers.
Article
Metamorphic testing is a technique that makes use of some necessary properties of the software under test, termed as metamorphic relations, to construct new test cases, namely follow-up test cases, based on some existing test cases, namely source test cases. Due to the ability of verifying testing results without the need of test oracles, it has been widely used in many application domains and detected lots of real-life faults. Numerous investigations have been conducted to further improve the effectiveness of metamorphic testing, most of which were focused on the identification and selection of “good” metamorphic relations. Recently, a few studies emerged on the research direction of how to generate and select source test cases that are effective in fault detection. In this paper, we propose a novel approach to generating source test cases based on their associated path constraints, which are obtained through symbolic execution. The path distance among test cases is leveraged to guide the prioritization of source test cases, which further improve the efficiency. A tool has been developed to automate the proposed approach as much as possible. Empirical studies have also been conducted to evaluate the fault-detection effectiveness of the approach. The results show that this approach enhances both the performance and automation of metamorphic testing. It also highlights interesting research directions for further improving metamorphic testing.
Article
Context Recently, advances in Deep Learning (DL) have promoted the development of DL-driven image recognition systems in various fields, such as medical treatment, face detection, etc., almost achieving the same level of performance as the human brain. Nevertheless, using DL-driven image recognition systems in these safety-critical domains requires ensuring the accuracy and the stability of these systems. Recent research in this direction mainly focuses on using the image transformations for the overall image to detect the inconsistency of image recognition systems. However, the influence of the image background region (i.e., the region of the image other than the target object) on the recognition result of the systems and the robustness evaluation of the systems are not considered. Objective To evaluate the robustness of DL-driven image recognition systems about image background region changes, this paper introduces DeepBackground, a novel metamorphic testing method for DL-driven image recognition systems. Method First, we define a new metric, termed Background-Relevance (BRC) to assess the influence degree of the image background region on the recognition result of the image recognition systems. DeepBackground defines a series of domain-specific metamorphic relations (MRs) combined with BRC and automatically generates many follow-up test images based on these MRs. Finally, DeepBackground detects the inconsistency of these systems and evaluates their robustness about image background changes according to BRC. Results Our empirical validation on 3 commercial image recognition services and 6 popular convolutional neural networks (CNNs) models shows that DeepBackground can not only evaluate the robustness of these image recognition systems about image background changes according to BRC, but also can detect their inconsistent behaviors. Conclusion DeepBackground is capable of automatically generating high-quality test input images to detect the inconsistency of the image recognition systems, and evaluating the robustness of these systems about image background changes according to BRC.
Article
Software testing depends on effective oracles. Implicit oracles, such as checks for program crashes, are widely applicable but narrow in scope. Oracles based on formal specifications can reveal application-specific failures, but specifications are expensive to obtain and maintain. Metamorphic oracles are somewhere in-between. They test equivalence among different procedures to detect semantic failures. Until now, the identification of metamorphic relations has been a manual and expensive process, except for few specific domains where automation is possible. We present MeMo, a technique and a tool to automatically derive metamorphic equivalence relations from natural language documentation, and we use such metamorphic relations as oracles in automatically generated test cases. Our experimental evaluation demonstrates that 1) MeMo can effectively and precisely infer equivalence metamorphic relations, 2) MeMo complements existing state-of-the-art techniques that are based on dynamic program analysis, and 3) metamorphic relations discovered with MeMo effectively detect defects when used as test oracles in automatically-generated or manually-written test cases.
Article
Metamorphic Relations (MRs) play a key role in determining the fault detection capability of Metamorphic Testing (MT). As human judgement is required for MR identification, systematic MR generation has long been an important research area in MT. Additionally, due to the extra program executions required for follow-up test cases, some concerns have been raised about MT cost-effectiveness. Consequently, the reduction in testing costs associated with MT has become another important issue to be addressed. MR composition can address both of these problems. This technique can automatically generate new MRs by composing existing ones, thereby reducing the number of follow-up test cases. Despite this advantage, previous studies on MR composition have empirically shown that some composite MRs have lower fault detection capability than their corresponding component MRs. To investigate this issue, we performed theoretical and empirical analyses to identify what characteristics component MRs should possess so that their corresponding composite MR has at least the same fault detection capability as the component MRs do. We have also derived a convenient, but effective guideline so that the fault detection capability of MT will most likely not be reduced after composition.
Article
Metamorphic Testing is a software testing paradigm which aims at using necessary properties of a system under test, called metamorphic relations, to either check its expected outputs, or to generate new test cases. Metamorphic Testing has been successful to test programs for which a full oracle is not available or to test programs for which there are uncertainties on expected outputs such as learning systems. In this article, we propose Adaptive Metamorphic Testing as a generalization of a simple yet powerful reinforcement learning technique, namely contextual bandits, to select one of the multiple metamorphic relations available for a program. By using contextual bandits, Adaptive Metamorphic Testing learns which metamorphic relations are likely to transform a source test case, such that it has higher chance to discover faults. We present experimental results over two major case studies in machine learning, namely image classification and object detection, and identify weaknesses and robustness boundaries. Adaptive Metamorphic Testing efficiently identifies weaknesses of the tested systems in context of the source test case.
Conference Paper
Coverage-guided kernel fuzzing is a widely-used technique that has helped kernel developers and testers discover numerous vulnerabilities. However, due to the high complexity of application and hardware environment, there is little study on deploying fuzzing to the enterprise-level Linux kernel. In this paper, collaborating with the enterprise developers, we present the industry practice to deploy kernel fuzzing on four different enterprise Linux distributions that are responsible for internal business and external services of the company. We have addressed the following outstanding challenges when deploying a popular kernel fuzzer, syzkaller, to these enterprise Linux distributions: coverage support absence, kernel configuration inconsistency, bugs in shallow paths, and continuous fuzzing complexity. This leads to a vulnerability detection of 41 reproducible bugs which are previous unknown in these enterprise Linux kernel and 6 bugs with CVE IDs in U.S. National Vulnerability Database, including flaws that cause general protection fault, deadlock, and use-after-free.
Conference Paper
Recent advances in Deep Neural Networks (DNNs) have led to the development of DNN-driven autonomous cars that, using sensors like camera, LiDAR, etc., can drive without any human intervention. Most major manufacturers including Tesla, GM, Ford, BMW, and Waymo/Google are working on building and testing different types of autonomous vehicles. The lawmakers of several US states including California, Texas, and New York have passed new legislation to fast-track the process of testing and deployment of autonomous vehicles on their roads. However, despite their spectacular progress, DNNs, just like traditional software, often demonstrate incorrect or unexpected corner-case behaviors that can lead to potentially fatal collisions. Several such real-world accidents involving autonomous cars have already happened including one which resulted in a fatality. Most existing testing techniques for DNN-driven vehicles are heavily dependent on the manual collection of test data under different driving conditions which become prohibitively expensive as the number of test conditions increases. In this paper, we design, implement, and evaluate DeepTest, a systematic testing tool for automatically detecting erroneous behaviors of DNN-driven vehicles that can potentially lead to fatal crashes. First, our tool is designed to automatically generated test cases leveraging real-world changes in driving conditions like rain, fog, lighting conditions, etc. DeepTest systematically explore different parts of the DNN logic by generating test inputs that maximize the numbers of activated neurons. DeepTest found thousands of erroneous behaviors under different realistic driving conditions (e.g., blurring, rain, fog, etc.) many of which lead to potentially fatal crashes in three top performing DNNs in the Udacity self-driving car challenge.
Conference Paper
Though mutation testing has been widely studied for more than thirty years, the prevalence and properties of equivalent mutants remain largely unknown. We report on the causes and prevalence of equivalent mutants and their relationship to stubborn mutants (those that remain undetected by a high quality test suite, yet are non-equivalent). Our results, based on manual analysis of 1,230 mutants from 18 programs, reveal a highly uneven distribution of equivalence and stubbornness. For example, the ABS class and half UOI class generate many equivalent and almost no stubborn mutants, while the LCR class generates many stubborn and few equivalent mutants. We conclude that previous test effectiveness studies based on fault seeding could be skewed, while developers of mutation testing tools should prioritise those operators that we found generate disproportionately many stubborn (and few equivalent) mutants.
Conference Paper
Metamorphic testing uses domain-specific properties about a program's intended behaviour to alleviate the oracle problem. From a given set of source test inputs, a set of follow-up test inputs are generated which have some relation to the source inputs, and their outputs are compared to outputs from the source tests, using metamorphic relations. We evaluate the use of an automated test input generation technique called dynamic symbolic execution (DSE) to generate the source test inputs for metamorphic testing. We investigate whether DSE increases source-code coverage and fault finding effectiveness of metamorphic testing compared to the use of random testing, and whether the use of metamorphic relations as a supportive technique improves the test inputs generated by DSE. Our results show that DSE improves the coverage and fault detection rate of metamorphic testing compared to random testing using significantly smaller test suites, and the use of metamorphic relations increases code coverage of both DSE and random tests considerably, but the improvement in the fault detection rate may be marginal and depends on the used metamorphic relations.
Conference Paper
Metamorphic Testing (MT) aims to alleviate the oracle problem. In MT, testers define metamorphic relations (MRs) which are used to generate new test cases (referred to as follow-up test cases) from the available test cases (referred to as source test cases). Both source and follow-up test cases are executed and their outputs are verified against the relevant MRs, of which any violation implies that the software under test is faulty. So far, the research on the effectiveness of MT has been focused on the selection of better MRs (that is, MRs that are more likely to be violated). In addition to MR selection, the source and follow-up test cases may also affect the effectiveness of MT. Since follow-up test cases are defined by the source test cases and MRs, selection of source test cases will then affect the effectiveness of MT. However, in existing MT studies, random testing is commonly adopted as the test case selection strategy for source test cases. This study aims to investigate the impact of source test cases on the effectiveness of MT. Since Adaptive Random Testing (ART) has been developed as an enhancement to Random Testing (RT), this study will focus on comparing the performance of RT and ART as source test case selection strategies on the effectiveness of MT. Experiment results show that ART outperforms RT on enhancing the effectiveness of MT.
Conference Paper
When figuring out the expected output for each test case is difficult, metamorphic testing can be applied to alleviate such situations. An involved key challenge is to derive metamorphic relations for the program under test. This paper proposes a datamutation directed metamorphic relation acquisition methodology called μMT. Experimental results on three case studies show that μMT is feasible in deriving metamorphic relations for numeric applications and the derived metamorphic relations show reasonable fault detection effectiveness.
Article
Random testing (RT) has been widely used in the testing of various software and hardware systems. Adaptive random testing (ART) is a family of random testing techniques that aim to enhance the failure-detection effectiveness of RT by spreading random test cases evenly throughout the input domain. ART has been empirically shown to be effective on software with numeric inputs. However, there are two aspects of ART that need to be addressed to render its adoption more widespread - applicability to programs with non-numeric inputs, and the high computation overhead of many ART algorithms. We present a linear-order ART algorithm for software with non-numeric inputs. The key requirement for using ART with non-numeric inputs is an appropriate "distance" measure. We use the concepts of categories and choices from category-partition testing to formulate such a measure. We investigate the failure-detection effectiveness of our technique by performing an empirical study on 14 object programs, using two standard metrics - F-measure and P-measure. Our ART algorithm statistically significantly outperforms RT on 10 of the 14 programs studied, and exhibits performance similar to RT on three of the four remaining programs. The selection overhead of our ART algorithm is close to that of RT.
Article
Comprehensive, automated software testing requires an oracle to check whether the output produced by a test case matches the expected behaviour of the programme. But the challenges in creating suitable oracles limit the ability to perform automated testing in some programmes, and especially in scientific software. Metamorphic testing is a method for automating the testing process for programmes without test oracles. This technique operates by checking whether the programme behaves according to properties called metamorphic relations. A metamorphic relation describes the change in output when the input is changed in a prescribed way. Unfortunately, finding the metamorphic relations satisfied by a programme or function remains a labour-intensive task, which is generally performed by a domain expert or a programmer. In this work, we propose a machine learning approach for predicting metamorphic relations that uses a graph-based representation of a programme to represent control flow and data dependency information. In earlier work, we found that simple features derived from such graphs provide good performance. An analysis of the features used in this earlier work led us to explore the effectiveness of several representations of those graphs using the machine learning framework of graph kernels, which provide various ways of measuring similarity between graphs. Our results show that a graph kernel that evaluates the contribution of all paths in the graph has the best accuracy and that control flow information is more useful than data dependency information. The data used in this study are available for download at http://www.cs.colostate.edu/saxs/MRpred/functions.tar.gz to help researchers in further development of metamorphic relation prediction methods.
Article
Metamorphic testing is very practical and effective for programs with oracle problems. Much research has been done in this field. Based upon existed methods of metamorphic testing and program path-analysis, the authors first present a set of metamorphic testing criteria for the test with binary metamorphic relations. These criteria define the adequacy of metamorphic test suites at several different levels. Then, three new testing algorithms are given to generate test suites that could satisfy the criteria above. Finally, these algorithms' performances are fully proved with the technique of mutation analysis. The experiment results show that testing effects are greatly decided by the selection of metamorphic relations and testing criteria, and the algorithm APCEMST could detect faults quickly and exactly with fewer test cases than traditional method.
Article
In recent years, a variety of encryption algorithms were proposed to enhance the security of software and systems. Validating whether encryption algorithms are correctly implemented is a challenging issue. Software testing delivers an effective and practical solution, but it also faces the oracle problem (that is, under many practical situations, it is impossible or too computationally expensive to know whether the output for any given input is correct). In this paper, we propose a property-based approach to testing encryption programs in the absence of oracles. Our approach makes use of the so-called metamorphic properties of encryption algorithms to generate test cases and verify test results. Two case studies were conducted to illustrate the proposed approach and validate its effectiveness. Experimental results show that even without oracles, the proposed approach can detect nearly 50% inserted faults with at most three metamorphic relations (MRs) and fifty test cases.
Conference Paper
The information contained in the successful test case has been fully tapped by metamorphic testing which can effectively solve the oracle problem of software testing. One of the key factors affecting the results of the metamorphic testing is the generation of test cases. In this paper, we propose a criterion called ECCEM (Equivalence-Class Coverage for Every Metamorphic Relation), which covers the test cases based on equivalence classes, the criterion can availably generate fewer test case sets with high detection rate. This paper also proposes a new measure of test cases - the Test Case Rate of utilization (TCR), which can comprehensively assess the generated test suite.
Article
In software testing, something which can verify the correctness of test case execution results is called an oracle. The oracle problem occurs when either an oracle does not exist, or exists but is too expensive to be used. Metamorphic testing is a testing approach which uses metamorphic relations, properties of the software under test represented in the form of relations among inputs and outputs of multiple executions, to help verify the correctness of a program. This paper presents new empirical evidence to support this approach, which has been used to alleviate the oracle problem in various applications and to enhance several software analysis and testing techniques. It has been observed that identification of a sufficient number of appropriate metamorphic relations for testing, even by inexperienced testers, was possible with a very small amount of training. Furthermore, the cost-effectiveness of the approach could be enhanced through the use of more diverse metamorphic relations. The empirical studies presented in this paper clearly show that a small number of diverse metamorphic relations, even those identified in an ad hoc manner, had a similar fault-detection capability to a test oracle, and could thus effectively help alleviate the oracle problem.
Article
ContextBecause of its simplicity and effectiveness, Spectrum-Based Fault Localization (SBFL) has been one of the popular approaches towards fault localization. It utilizes the execution result of failure or pass, and the corresponding coverage information (such as program slice) to estimate the risk of being faulty for each program entity (such as statement). However, all existing SBFL techniques assume the existence of a test oracle to determine the execution result of a test case. But, it is common that test oracles do not exist, and hence the applicability of SBFL has been severely restricted.Objective We aim at developing a framework that can extend the application of SBFL to the common situations where test oracles do not exist.Method Our approach uses a new concept of metamorphic slice resulting from the integration of metamorphic testing and program slicing. In SBFL, instead of using the program slice and the result of failure or pass for an individual test case, a metamorphic slice and the result of violation or non-violation of a metamorphic relation are used. Since we need not know the execution result for an individual test case, the existence of a test oracle is no longer a requirement to apply SBFL.ResultsAn experimental study involving nine programs and three risk evaluation formulas was conducted. The results show that our proposed solution delivers a performance comparable to the performance observed by existing SBFL techniques for the situations where test oracles exist.Conclusion With respect to the problem that SBFL is only applicable to programs with test oracles, we propose an innovative solution. Our solution is not only intuitively appealing and conceptually feasible, but also practically effective. Consequently, test oracles are no longer mandatory for SBFL, and hence the applicability of SBFL is significantly extended.