ArticlePDF Available

Abstract and Figures

Metamorphic testing is well known for its ability to alleviate the oracle problem in software testing. The main idea of metamorphic testing is to test a software system by checking whether each identified metamorphic relation (MR) holds among several executions. In this regard, identifying MRs is an essential task in metamorphic testing. In view of the importance of this identification task, METRIC (METamorphic Relation Identification based on Category-choice framework) was developed to help software testers identify MRs from a given set of complete test frames. However, during MR identification, METRIC primarily focuses on the input domain without sufficient attention given to the output domain, thereby hindering the effectiveness of METRIC. Inspired by this problem, we have extended METRIC into METRIC+ by incorporating the information derived from the output domain for MR identification. A tool implementing METRIC+ has also been developed. Two rounds of experiments, involving four real-life specifications, have been conducted to evaluate the effectiveness and efficiency of METRIC+. The results have confirmed that METRIC+ is highly effective and efficient in MR identification. Additional experiments have been performed to compare the fault detection capability of the MRs generated by METRIC+ and those by mMT (another MR identification technique). The comparison results have confirmed that the MRs generated by METRIC+ are highly effective in fault detection.
Content may be subject to copyright.
A preview of the PDF is not available
... Our empirical studies involved seven subject programs, among which four implement real-life business workflows in different application domains (namely, PHONE, BAGGAGE, EXPENSE and MEAL), two are lexical analyzers (namely, print_tokens and print_tokens2), and the last one is a pattern matching engine (namely grep). Details of the subject programs, including the programming language, the lines of code, and a brief introduction, are shown in Table V. Interested readers can refer to these studies [17], [41] for the detailed functional specification of the subject programs. ...
... For each of the subject programs, we used METRIC + [17] to identify the MRs from their corresponding program spec-ification. The basic principle of METRIC + is to identify an MR by comprising a sub-relation on inputs and a subrelation on outputs which are deduced from a pair of inputand-output-based complete test frames (IO-CTFs). ...
... On the other hand, we would like to point out that a vast majority of existing studies only reported a small number of MRs for various kinds of software systems, which hardly provides substantial data set that is helpful to give an insight into the test adequacy assessment issue of MT. In view of these problems, we decided to reuse the subject systems reported in [17] which offered a relatively large number of MRs, and then we included three additional subject programs which have been widely adopted in existing studies [23], [54], [55]. Previous studies has demonstrated the appropriateness of these subject programs. ...
Preprint
Metamorphic testing (MT) is a simple yet effective technique to alleviate the oracle problem in software testing. The underlying idea of MT is to test a software system by checking whether metamorphic relations (MRs) hold among multiple test inputs (including source and follow-up inputs) and the actual output of their executions. Since MRs and source inputs are two essential components of MT, considerable efforts have been made to examine the systematic identification of MRs and the effective generation of source inputs, which has greatly enriched the fundamental theory of MT since its invention. However, few studies have investigated the test adequacy assessment issue of MT, which hinders the objective measurement of MT's test quality as well as the effective construction of test suites. Although in the context of traditional software testing, there exist a number of test adequacy criteria that specify testing requirements to constitute an adequate test from various perspectives, they are not in line with MT's focus which is to test the software under testing (SUT) from the perspective of necessary properties. In this paper, we proposed a new set of criteria that specifies testing requirements from the perspective of necessary properties satisfied by the SUT, and designed a test adequacy measurement that evaluates the degree of adequacy based on both MRs and source inputs. The experimental results have shown that the proposed measurement can effectively indicate the fault detection effectiveness of test suites, i.e., test suites with increased test adequacy usually exhibit higher effectiveness in fault detection. Our work made an attempt to assess the test adequacy of MT from a new perspective, and our criteria and measurement provide a new approach to evaluate the test quality of MT and provide guidelines for constructing effective test suites of MT.
... This problem has inspired a growing number of investigations into the systematic generation of MRs. For example, some studies focus on deriving new MRs based on existing ones [23,48,61]; others attempt to construct MRs by leveraging the basic ideas of some mainstream test case generation methods [15,73]; still others make use of machine learning techniques to obtain valid MRs [37][38][39]. MR generation has recently emerged as one of the most important and popular topics in MT research. ...
... Note that some papers were produced from the collaborations among these researchers -for example, Tsong Yueh Chen, Zhi Quan Zhou and Dave Towey have co-authored in three papers on MR patterns. It is also understandable that these top authors have collaborated with other active researchers, such as Antonio Ruiz-Cortés [80], Chang-Ai Sun [73], and Xiaoyuan Xie [15], in developing different types of MR generation techniques. ...
... Sun et al. [73] extended the original METRIC technique by considering both the input and output domains of the software under test, resulting in a new technique called METRIC + . In METRIC + , the "original" categories and choices solely defined based on the input domain were called I-categories and I-choices ("I" stands for "Input"), respectively. ...
Article
Metamorphic testing has become one mainstream technique to address the notorious oracle problem in software testing, thanks to its great successes in revealing real-life bugs in a wide variety of software systems. Metamorphic relations, the core component of metamorphic testing, have continuously attracted research interests from both academia and industry. In the last decade, a rapidly increasing number of studies have been conducted to systematically generate metamorphic relations from various sources and for different application domains. In this article, based on the systematic review on the state of the art for metamorphic relations' generation, we summarize and highlight visions for further advancing the theory and techniques for identifying and constructing metamorphic relations, and discuss promising research directions in related areas.
... However, testing translation systems presents a challenge compared to other supervised tasks, such as classification, due to the complexity of the output format. To determine the correctness of a translation, metamorphic relations are typically used as the mainstream testing methodology because of their universal applicability and cost-effectiveness (Sun et al. 2021). Recent works He et al. 2021He et al. , 2020Gupta et al. 2020;Pesu et al. 2018;Wang et al. 2019;Zhou and Sun 2018;Xie et al. 2020Xie et al. , 2022 have explored the use of metamorphic testing approaches, where a token in the input sentence is mutated to test resulting translations. ...
Article
Full-text available
Neural Machine Translation (NMT) has experienced significant growth over the last decade. Despite these advancements, machine translation systems still face various issues. In response, metamorphic testing approaches have been introduced for testing machine translation systems. Such approaches involve token replacement, where a single token in the original source sentence is substituted to create mutants. By comparing the translations of mutants with the original translation, potential bugs in the translation systems can be detected. However, the selection of tokens for replacement in the original sentence remains an intriguing problem, deserving further exploration in testing approaches. To address this problem, we design two white-box approaches to identify vulnerable tokens in the source sentence, whose perturbation is most likely to induce translation bugs for a translation system. The first approach, named GRI, utilizes the GRadient Information to identify the vulnerable tokens for replacement, and our second approach, named WALI, uses Word ALignment Information to locate the vulnerable tokens. We evaluate the proposed approaches on a Transformer-based translation system with the News Commentary dataset and 200 English sentences extracted from CNN articles. The results show that both GRI and WALI can effectively generate high-quality test cases for revealing translation bugs. Specifically, our approaches can always outperform state-of-the-art automatic machine translation testing approaches from two aspects: (1) under a certain testing budget (i.e., number of executed test cases), both GRI and WALI can reveal a larger number of bugs than baseline approaches, and (2) when given a predefined testing goal (i.e., number of detected bugs), our approaches always require fewer testing resources (i.e., a reduced number of test cases to execute).
Article
Practical problems in scientific computation that solve differential equations rarely have explicit exact solutions. Therefore, verifying the correctness of such programs has long been a challenge due to the difficulty of producing expected outputs on test cases. In this paper, the principles of metamorphic testing are applied to verify programs that solve second‐order elliptic differential equations. We present a testing process specifically tailored for the verification testing of scientific computation programs and integrate it to the process of developing scientific software. Unlike existing approaches, we formally derive metamorphic relations from the numerical models of differential equations built in development process of scientific computing programs. The experimental results clearly show that our approach is effective in detecting faults commonly found in scientific computing programs. It outperforms the fault detecting ability of the trend method, which is a traditional testing method for scientific software.
Article
Metamorphic testing (MT) is effective in detecting software failures; it detects failures by examining the metamorphic relations (MRs) among source test cases (STCs), follow‐up test cases (FTCs) and their respective outputs. The STCs together with the corresponding FTCs, considered as a whole, are called metamorphic groups (MGs). MT performance relies heavily on the MRs and MGs. Previous studies have mainly focused on improving MT performance by identifying effective MRs, or through generation of MGs with high quality, but have somewhat neglected the selection of MRs and MGs from existing ones. In this paper, we address this issue by introducing a new metric for guiding the selection of effective MR‐MG pairs from a new perspective: The MR‐MG pair is chosen such that the MR makes the current MG as far away as possible from the executed MGs. We design an MR‐MG pair selection algorithm, named metamorphic relation and group selection based on adaptive random testing (MRGS‐ART), to implement our metric. The intuition behind MRGS‐ART is that we attempt to improve MT performance by achieving an even distribution of STCs and FTCs in their corresponding input domains for all the MRs used. Experimental results indicate that MRGS‐ART can enhance MT performance. We believe that this is the first comprehensive and systematic demonstration, from the perspective of both MRs and MGs, that making STCs and FTCs evenly distributed in their corresponding input domains can improve MT performance. Finally, by analysing the experimental results, we provide guidance on how to most effectively implement MRGS‐ART.
Article
Although the security testing of Web systems can be automated by generating crafted inputs, solutions to automate the test oracle, i.e., vulnerability detection, remain difficult to apply in practice. Specifically, though previous work has demonstrated the potential of metamorphic testing-security failures can be determined by metamorphic relations that turn valid inputs into malicious inputs-metamorphic relations are typically executed on a large set of inputs, which is time-consuming and thus makes metamorphic testing impractical. We propose AIM, an approach that automatically selects inputs to reduce testing costs while preserving vulnerability detection capabilities. AIM includes a clustering-based black-box approach, to identify similar inputs based on their security properties. It also relies on a novel genetic algorithm to efficiently select diverse inputs while minimizing their total cost. Further, it contains a problem-reduction component to reduce the search space and speed up the minimization process. We evaluated the effectiveness of AIM on two well-known Web systems, Jenkins and Joomla, with documented vulnerabilities. We compared AIM's results with four baselines involving standard search approaches. Overall, AIM reduced metamorphic testing time by 84% for Jenkins and 82% for Joomla, while preserving the same level of vulnerability detection. Furthermore, AIM significantly outperformed all the considered baselines regarding vulnerability coverage.</p
Article
The DHR architecture provides a revolutionary security defense structure for cyberspace. The multimode ruling in DHR is expected to alleviate the oracle problem, which still suffers from the existence of common model vulnerability. In this work, we design a test segmentation method to transform multimode ruling to a metamorphic testing problem. The text test input that causes inconsistency of heterogeneous executors is converted to a condition set, and we extract subsets of conditions based on its syntax tree. The original test can exploit a specific vulnerability, the follow‐up tests are composed by different subsets of conditions within the original test. We collect the execution matrix for the follow‐up tests to analyse the impact of each subset of conditions on ruling decision. Metamorphic relations are extracted based on the localization of independent condition, that is, the subsets of conditions that can impact ruling decision independently. The executors in an inconsistent ruling should be examined with metamorphic testing methods, rather than traditional majority voting mechanism. The proposed test segmentation and improved multimode ruling methods are evaluated on two DHR‐based cases, SQL injection in cyber‐range system and deserialization attack in ‐ project. The experimental results show that our test segmentation can help to locate malicious expressions and the metamorphic testing‐based multimode ruling can generate more correct results than majority voting mechanism with an average 15.8% performance loss.
Article
Metamorphic testing (MT) is an effective testing technique having a broad range of applications. One key task for MT is the identification of metamorphic relations (MRs), which is a fundamental mechanism in MT and is critical to the automation of MT. Prior studies have proposed approaches for predicting MRs (PMR). One major idea behind these PMR approaches is to represent program source code information via manually designed code features and then to apply machine‐learning–based classifiers to automatically predict whether a specific MR can be applied on the target program. Nevertheless, the human‐involved procedure of selecting and extracting code features is costly, and it may not be easy to obtain sufficiently comprehensive features for representing source code. To overcome this limitation, in this study, we explore and evaluate the effectiveness of code representation learning techniques for PMR. By applying neural code representation models for automatically mapping program source code to code vectors, the PMR procedure can be boosted with learned code representations. We develop 32 PMR instances by, respectively, combining 8 code representation models with 4 typical classification models and conduct an extensive empirical study to investigate the effectiveness of code representation learning techniques in the context of MR prediction. Our findings reveal that code representation learning can positively contribute to the prediction of MRs and provide insights into the practical usage of code representation models in the context of MR prediction. Our findings could help researchers and practitioners to gain a deeper understanding of the strength of code representation learning for PMR and, hence, pave the way for future research in deriving or extracting MRs from program source code.
Article
Full-text available
We propose a robustness testing approach for software systems that process large amounts of data. Our method uses metamorphic relations to check software output for erroneous input in the absence of a tangible test oracle. We use this technique to test two major citation database systems: Scopus and the Web of Science. We report a surprising finding that the inclusion of hyphens in paper titles impedes citation counts, and that this is a result of the lack of robustness of the citation database systems in handling hyphenated paper titles. Our results are valid for the entire literature as well as for individual fields such as chemistry. We further find a strong and significant negative correlation between the journal impact factor (JIF) of IEEE Transactions on Software Engineering (TSE) and the percentage of hyphenated paper titles published in TSE. Similar results are found for ACM Transactions on Software Engineering and Methodology. A software engineering field-wide study reveals that the higher JIF-ranked journals are publishing a lower percentage of papers with hyphenated titles. Our results challenge the common belief that citation counts and JIFs are reliable measures of the impact of papers and journals, as they can be distorted simply by the presence of hyphens in paper titles.
Article
Full-text available
Metamorphic testing can test untestable software, detecting fatal errors in autonomous vehicles' onboard computer systems.
Article
Full-text available
Modern information technology paradigms, such as online services and off-the-shelf products, often involve a wide variety of users with different or even conflicting objectives. Every software output may satisfy some users, but may also fail to satisfy others. Furthermore, users often do not know the internal working mechanisms of the systems. This situation is quite different from bespoke software, where developers and users usually know each other. This paper proposes an approach to help users to better understand the software that they use, and thereby more easily achieve their objectives—even when they do not fully understand how the system is implemented. Our approach borrows the concept of metamorphic relations from the field of metamorphic testing (MT), using it in an innovative way that extends beyond MT. We also propose a "symmetry" metamorphic relation pattern and a "change direction" metamorphic relation input pattern that can be used to derive multiple concrete metamorphic relations. Empirical studies reveal previously unknown failures in some of the most popular applications in the world, and show how our approach can help users to better understand and better use the systems. The empirical results provide strong evidence of the simplicity, applicability, and effectiveness of our methodology.
Article
Full-text available
Cryptographic hash functions are security-critical algorithms with many practical applications, notably in digital signatures. Developing an approach to test them can be particularly difficult, and bugs can remain unnoticed for many years. We revisit the National Institute of Standards and Technology hash function competition, which was used to develop the SHA-3 standard, and apply a new testing strategy to all available reference implementations. Motivated by the cryptographic properties that a hash function should satisfy, we develop four tests. The Bit-Contribution Test checks if changes in the message affect the hash value, and the Bit-Exclusion Test checks that changes beyond the last message bit leave the hash value unchanged. We develop the Update Test to verify that messages are processed correctly in chunks, and then use combinatorial testing methods to reduce the test set size by several orders of magnitude while retaining the same fault-detection capability. Our tests detect bugs in 41 of the 86 reference implementations submitted to the SHA-3 competition, including the rediscovery of a bug in all submitted implementations of the SHA-3 finalist BLAKE. This bug remained undiscovered for seven years, and is particularly serious because it provides a simple strategy to modify the message without changing the hash value returned by the implementation. We detect these bugs using a fully automated testing approach.
Article
Full-text available
Metamorphic testing is an approach to both test case generation and test result verification. A central element is a set of metamorphic relations, which are necessary properties of the target function or algorithm in relation to multiple inputs and their expected outputs. Since its first publication, we have witnessed a rapidly increasing body of work examining metamorphic testing from various perspectives, including metamorphic relation identification, test case generation, integration with other software engineering techniques, and the validation and evaluation of software systems. In this article, we review the current research of metamorphic testing and discuss the challenges yet to be addressed. We also present visions for further improvement of metamorphic testing and highlight opportunities for new research.
Article
Full-text available
The terms “Oracle Problem” and “Non-testable system” interchangeably refer to programs in which the application of test oracles is infeasible. Test oracles are an integral part of conventional testing techniques; thus, such techniques are inoperable in these programs. The prevalence of the oracle problem has inspired the research community to develop several automated testing techniques that can detect functional software faults in such programs. These techniques include N-Version testing, Metamorphic Testing, Assertions, Machine Learning Oracles, and Statistical Hypothesis Testing. This paper presents a Mapping Study that covers these techniques. The Mapping Study presents a series of discussions about each technique, from different perspectives, e.g. effectiveness, efficiency, and usability. It also presents a comparative analysis of these techniques in terms of these perspectives. Finally, potential research opportunities within the non-testable systems problem domain are highlighted within the Mapping Study. We believe that the aforementioned discussions and comparative analysis will be invaluable for new researchers that are attempting to familiarise themselves with the field, and be a useful resource for practitioners that are in the process of selecting an appropriate technique for their context, or deciding how to apply their selected technique. We also believe that our own insights, which are embedded throughout these discussions and the comparative analysis, will be useful for researchers that are already accustomed to the field. It is our hope that the potential research opportunities that have been highlighted by the Mapping Study will steer the direction of future research endeavours.
Article
Full-text available
Web Application Programming Interfaces (APIs) allow systems to interact with each other over the network. Modern Web APIs often adhere to the REST architectural style, being referred to as RESTful Web APIs. RESTful Web APIs are decomposed into multiple resources (e.g., a video in the YouTube API) that clients can manipulate through HTTP interactions. Testing Web APIs is critical but challenging due to the difficulty to assess the correctness of API responses, i.e., the oracle problem. Metamorphic testing alleviates the oracle problem by exploiting relations (so-called metamorphic relations) among multiple executions of the program under test. In this paper, we present a metamorphic testing approach for the detection of faults in RESTful Web APIs. We first propose six abstract relations that capture the shape of many of the metamorphic relations found in RESTful Web APIs, we call these Metamorphic Relation Output Patterns (MROPs). Each MROP can then be instantiated into one or more concrete metamorphic relations. The approach was evaluated using both automatically seeded and real faults in six subject Web APIs. Among other results, we identified 60 metamorphic relations (instances of the proposed MROPs) in the Web APIs of Spotify and YouTube. Each metamorphic relation was implemented using both random and manual test data, running over 4.7K automated tests. As a result, 11 issues were detected (3 in Spotify and 8 in YouTube), 10 of them confirmed by the API developers or reproduced by other users, supporting the effectiveness of the approach.
Conference Paper
Recent advances in Deep Neural Networks (DNNs) have led to the development of DNN-driven autonomous cars that, using sensors like camera, LiDAR, etc., can drive without any human intervention. Most major manufacturers including Tesla, GM, Ford, BMW, and Waymo/Google are working on building and testing different types of autonomous vehicles. The lawmakers of several US states including California, Texas, and New York have passed new legislation to fast-track the process of testing and deployment of autonomous vehicles on their roads. However, despite their spectacular progress, DNNs, just like traditional software, often demonstrate incorrect or unexpected corner-case behaviors that can lead to potentially fatal collisions. Several such real-world accidents involving autonomous cars have already happened including one which resulted in a fatality. Most existing testing techniques for DNN-driven vehicles are heavily dependent on the manual collection of test data under different driving conditions which become prohibitively expensive as the number of test conditions increases. In this paper, we design, implement, and evaluate DeepTest, a systematic testing tool for automatically detecting erroneous behaviors of DNN-driven vehicles that can potentially lead to fatal crashes. First, our tool is designed to automatically generated test cases leveraging real-world changes in driving conditions like rain, fog, lighting conditions, etc. DeepTest systematically explore different parts of the DNN logic by generating test inputs that maximize the numbers of activated neurons. DeepTest found thousands of erroneous behaviors under different realistic driving conditions (e.g., blurring, rain, fog, etc.) many of which lead to potentially fatal crashes in three top performing DNNs in the Udacity self-driving car challenge.
Article
We present an automated technique for finding defects in compilers for graphics shading languages. key challenge in compiler testing is the lack of an oracle that classifies an output as correct or incorrect; this is particularly pertinent in graphics shader compilers where the output is a rendered image that is typically under-specified. Our method builds on recent successful techniques for compiler validation based on metamorphic testing, and leverages existing high-value graphics shaders to create sets of transformed shaders that should be semantically equivalent. Rendering mismatches are then indicative of shader compilation bugs. Deviant shaders are automatically minimized to identify, in each case, a minimal change to an original high-value shader that induces a shader compiler bug. We have implemented the approach as a tool, GLFuzz, targeting the OpenGL shading language, GLSL. Our experiments over a set of 17 GPU and driver configurations, spanning the main 7 GPU designers, have led to us finding and reporting more than 60 distinct bugs, covering all tested configurations. As well as defective rendering, these issues identify security-critical vulnerabilities that affect WebGL, including a significant remote information leak security bug where a malicious web page can capture the contents of other browser tabs, and a bug whereby visiting a malicious web page can lead to a ``blue screen of death'' under Windows 10. Our findings show that shader compiler defects are prevalent, and that metamorphic testing provides an effective means for detecting them automatically.
Conference Paper
Automated test framework plays a significant role in test driven software development methodologies. The XUnit family of testing tools has been widely used in the industry. However, they are weak in supporting test case generation and test result checking. In this paper we propose a new kind of test automation framework by integrating data mutation testing and metamorphic testing methods. A tool for unit testing of Java class called JFuzz is presented. Its uses are illustrated by examples.