Conference Paper

FAST approaches to scalable similarity-based test case prioritization

Authors:
  • Vrije Universiteit Amsterdam (VU)
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Many test case prioritization criteria have been proposed for speeding up fault detection. Among them, similarity-based approaches give priority to the test cases that are the most dissimilar from those already selected. However, the proposed criteria do not scale up to handle the many thousands or even some millions test suite sizes of modern industrial systems and simple heuristics are used instead. We introduce the FAST family of test case prioritization techniques that radically changes this landscape by borrowing algorithms commonly exploited in the big data domain to find similar items. FAST techniques provide scalable similarity-based test case prioritization in both white-box and black-box fashion. The results from experimentation on real world C and Java subjects show that the fastest members of the family outperform other black-box approaches in efficiency with no significant impact on effectiveness, and also outperform white-box approaches, including greedy ones, if preparation time is not counted. A simulation study of scalability shows that one FAST technique can prioritize a million test cases in less than 20 minutes.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The test suite reduction problem [24,2,18,1,29] is the problem of reducing the size of a given test suite while satisfying a given test criterion. Typical criteria are the so-called coverage-based criteria, which ensure that the coverage of the reduced test suite is above a certain minimal threshold. ...
... Typical criteria are the so-called coverage-based criteria, which ensure that the coverage of the reduced test suite is above a certain minimal threshold. The test case selection problem [24,2,18,1,29] is the dual problem, in that it tries to determine the minimal number of tests to be added to a given test suite so that a given test criterion is attained. As most of these algorithms are targeted at the industrial setting, they assume severe time constraints on the test selection process. ...
... As most of these algorithms are targeted at the industrial setting, they assume severe time constraints on the test selection process. Hence, the vast majority of the proposed approaches for test suite reduction and selection are based on approximate algorithms, such as similarity-based algorithms [2,18], which are not guaranteed to find the optimal test suite even when given enough resources. In order to achieve a compromise between precision and scalability, the authors of [1] proposed a combination of standard ILP encodings and heuristic approaches. ...
Preprint
Full-text available
Computer Science course instructors routinely have to create comprehensive test suites to assess programming assignments. The creation of such test suites is typically not trivial as it involves selecting a limited number of tests from a set of (semi-)randomly generated ones. Manual strategies for test selection do not scale when considering large testing inputs needed, for instance, for the assessment of algorithms exercises. To facilitate this process, we present TestSelector, a new framework for automatic selection of optimal test suites for student projects. The key advantage of TestSelector over existing approaches is that it is easily extensible with arbitrarily complex code coverage measures, not requiring these measures to be encoded into the logic of an exact constraint solver. We demonstrate the flexibility of TestSelector by extending it with support for a range of classical code coverage measures and using it to select test suites for a number of real-world algorithms projects, further showing that the selected test suites outperform randomly selected ones in finding bugs in students' code.
... That is, considering the practical application of TCP, including the GA algorithm, both effectiveness and efficiency are important. However, existing TCP approaches, including the GA algorithm, suffer from the efficiency problem, e.g., the previous work shows that most existing TCP approaches cannot deal with large-scale application scenarios [13], [15], [18]. Furthermore, some work [13], [15], [18] points out that the GA algorithm spends dramatically long time on prioritization. ...
... However, existing TCP approaches, including the GA algorithm, suffer from the efficiency problem, e.g., the previous work shows that most existing TCP approaches cannot deal with large-scale application scenarios [13], [15], [18]. Furthermore, some work [13], [15], [18] points out that the GA algorithm spends dramatically long time on prioritization. Note that in the 20-year history of GA, there is no approach proposed to improve its efficiency while preserving the high effectiveness. ...
... We also empirically compared AGA with FAST [18], which focuses on the TCP efficiency problem. As FAST [18] targets a different problem, improving the time efficiency by sacrificing effectiveness, such a comparison in terms of efficiency may be a bit unfair to our AGA approach. ...
Preprint
Full-text available
In recent years, many test case prioritization (TCP) techniques have been proposed to speed up the process of fault detection. However, little work has taken the efficiency problem of these techniques into account. In this paper, we target the Greedy Additional (GA) algorithm, which has been widely recognized to be effective but less efficient, and try to improve its efficiency while preserving effectiveness. In our Accelerated GA (AGA) algorithm, we use some extra data structures to reduce redundant data accesses in the GA algorithm and thus the time complexity is reduced from $\mathcal{O}(m^2n)$ to $\mathcal{O}(kmn)$ when $n > m$, where $m$ is the number of test cases, $n$ is the number of program elements, and $k$ is the iteration number. Moreover, we observe the impact of iteration numbers on prioritization efficiency on our dataset and propose to use a specific iteration number in the AGA algorithm to further improve the efficiency. We conducted experiments on 55 open-source subjects. In particular, we implemented each TCP algorithm with two kinds of widely-used input formats, adjacency matrix and adjacency list. Since a TCP algorithm with adjacency matrix is less efficient than the algorithm with adjacency list, the result analysis is mainly conducted based on TCP algorithms with adjacency list. The results show that AGA achieves 5.95X speedup ratio over GA on average, while it achieves the same average effectiveness as GA in terms of Average Percentage of Fault Detected (APFD). Moreover, we conducted an industrial case study on 22 subjects, collected from Baidu, and find that the average speedup ratio of AGA over GA is 44.27X, which indicates the practical usage of AGA in real-world scenarios.
... If the population is large, this can be an expensive process. To reduce this cost, we perform a selection procedure on a randomly-chosen subset of the population (lines [19][20], explained below: Identify the best solution in the subset. ...
... Later research have proposed ways to speed diversity calculations up. One study used locality-sensitive hashing to speed up the diversity calculations [20]. Another study used the pair-wise distance values of all test cases as input to a dimensionality reduction algorithm so that a twodimensional (2D) visual "map" of industrial test suites could be provided to software engineers [21]. ...
... Try random mutations until we see a better solutions,20 # or until we exhaust the number of tries. ...
Preprint
Full-text available
Unit testing is a stage of testing where the smallest segment of code that can be tested in isolation from the rest of the system - often a class - is tested. Unit tests are typically written as executable code, often in a format provided by a unit testing framework such as pytest for Python. Creating unit tests is a time and effort-intensive process with many repetitive, manual elements. To illustrate how AI can support unit testing, this chapter introduces the concept of search-based unit test generation. This technique frames the selection of test input as an optimization problem - we seek a set of test cases that meet some measurable goal of a tester - and unleashes powerful metaheuristic search algorithms to identify the best possible test cases within a restricted timeframe. This chapter introduces two algorithms that can generate pytest-formatted unit tests, tuned towards coverage of source code statements. The chapter concludes by discussing more advanced concepts and gives pointers to further reading for how artificial intelligence can support developers and testers when unit testing software.
... APFD continues to be one of the standard and widely used (Miranda et al., 2018) metric to evaluate prioritization effectiveness. This is defined as follows: ...
... The original definition of APFD specifies faults instead of failures. However, as the fault knowledge ceases to exist apriori, we followed the assumptions of (Miranda et al., 2018;Chen et al., 2018;Yu et al., 2019; Mondal and Nasre, 2019) by treating each failure as a different fault. ...
... Our choice was to disregard fault knowledge although works having controlled experimentation using SIR subjects have extensively utilized this information. We observed that treating each failure as a different fault has already been leveraged in some existing works (Yang et al., 2011;Lidbury et al., 2015;Chen et al., 2018;Miranda et al., 2018;Yu et al., 2019;Peng et al., 2020;Lam et al., 2020). We followed this approach due to a wider scope as software testing can be performed even when fault information was not available. ...
Article
Traditionally, given a test-suite and the underlying system-under-test, existing test-case prioritization heuristics report a permutation of the original test-suite that is seemingly best according to their criteria. However, we observe that a single heuristic does not perform optimally in all possible scenarios, given the diverse nature of software and its changes. Hence, multiple individual heuristics exhibit effectiveness differently. Interestingly, together, the heuristics bear the potential of improving the overall regression test selection across scenarios. In this paper, we pose the test-case prioritization as a rank aggregation problem from social choice theory. Our solution approach, named Hansie , is two-flavored: one involving priority-aware hybridization, and the other involving priority-blind computation of a consensus ordering from individual prioritizations. To speed-up test-execution, Hansie executes the aggregated test-case orderings in a parallel multi-processed manner leveraging regular windows in the absence of ties, and irregular windows in the presence of ties. We show the benefit of test-execution after prioritization and introduce a cost-cognizant metric (EPL) for quantifying overall timeline latency due to load-imbalance arising from uniform or non-uniform parallelization windows. We evaluate Hansie on 20 open-source subjects totaling 287,530 lines of source code, 69,305 test-cases, and with parallelization support of up to 40 logical CPUs.
... The output of this program is sorted into the following upcoming sets:" Select, Medium and, Discard". Input variables, as well as output variables, can take values between (1)(2)(3)(4)(5)(6)(7)(8)(9)(10). In this case study, triangular membership functions are used for mapping random and flexible input sets during fuzzification as well as for making dynamic output and complex sets during defuzzification. ...
... Think of a test suit T with several test cases; F is a set. of m faults detected with test T T. TFi is the first test case in T'(one of T's orders) indicating error i. Thereafter T "s APFD is defined by the following equation [1,4,5,8,9]: ...
Chapter
Full-text available
Strategies for prioritizing test cases plan test cases to reduce the cost of retrospective testing and to enhance a specific objective function. Test cases are prioritized as those most important test cases under certain conditions are made before the re-examination process. There are many strategies available in the literature that focus on achieving various pre-test testing objectives and thus reduce their cost. In addition, inspectors often select a few well-known strategies for prioritizing trial cases. The main reason behind the lack of guidelines for the selection of TCP strategies. Therefore, this part of the study introduces the novel approach to TCP strategic planning using the ambiguous concept to support the effective selection of experimental strategies to prioritize experimental cases. This function is an extension of the already selected selection schemes for the prioritization of probation cases.
... Further heuristics include topic modeling [64], or models of the system [37]. Miranda et al. [51] proposed fast methods to speed up the pair-wise distance computation, namely shingling and locality-sensitive hashing. Recently, Henard et al. [38] empirically compared many white-box and black-box prioritization techniques. ...
... This is an important aspect to investigate since a critical constraint in regression testing is that the cost of prioritizing test cases should be smaller than the time needed to run the test suite [68]. Therefore, fast approaches are fundamental from a practical point of view to enable rapid and continuous test iterations during SDC development [51]. ...
Preprint
Testing with simulation environments helps to identify critical failing scenarios emerging autonomous systems such as self-driving cars (SDCs) and are safer than in-field operational tests. However, these tests are very expensive and are too many to be run frequently within limited time constraints. In this paper, we investigate test case prioritization techniques to increase the ability to detect SDC regression faults with virtual tests earlier. Our approach, called SDC-Prioritizer, prioritizes virtual tests for SDCs according to static features of the roads used within the driving scenarios. These features are collected without running the tests and do not require past execution results. SDC-Prioritizer utilizes meta-heuristics to prioritize the test cases using diversity metrics (black-box heuristics) computed on these static features. Our empirical study conducted in the SDC domain shows that SDC-Prioritizer doubles the number of safety-critical failures that virtual tests can detect at the same level of execution time compared to baselines: random and greedy-based test case orderings. Furthermore, this meta-heuristic search performs statistically better than both baselines in terms of detecting safety-critical failures. SDC-Prioritizer effectively prioritize test cases for SDCs with a large improvement in fault detection while its overhead (up to 0.34% of the test execution cost) is negligible.
... Furthermore, approximate NNS (ANNS) should be able to significantly alleviate the computational overheads of distance calculations, especially in high dimensional input domains [42]. In software testing, NNS has been used to find the most similar test cases in regression testing [43], test case prioritization (TCP) [44], and model-based testing [45]. It has also been used to find the most diverse (opposite to similar) test cases in ART [46] and software product lines [47]. ...
... It has also been used to find the most diverse (opposite to similar) test cases in ART [46] and software product lines [47]. ANNS has also been successfully applied to enhance the efficiency in other areas of software testing, including TCP [44], test suite reduction [48] and prediction of test flakiness [49]. ...
Article
Full-text available
Adaptive random testing (ART) improves the failure-detection effectiveness of random testing by leveraging properties of the clustering of failure-causing inputs of most faulty programs: ART uses a sampling mechanism that evenly spreads test cases within a software’s input domain. The widely-used Fixed-Sized-Candidate-Set ART (FSCS-ART) sampling strategy faces a quadratic time cost, which worsens as the dimensionality of the software input domain increases. In this paper, we propose an approach based on small world graphs that can enhance the computational efficiency of FSCS-ART: SWFC-ART. To efficiently perform nearest neighbor queries for candidate test cases, SWFC-ART incrementally constructs a hierarchical navigable small world graph for previously executed, non-failure-causing test cases. Moreover, SWFC-ART has shown consistency in programs with high dimensional input domains. Our simulation and empirical studies show that SWFC-ART reduces the computational overhead of FSCS-ART from quadratic to log-linear order while maintaining the failure-detection effectiveness of FSCS-ART, and remaining consistent in high dimensional input domains. We recommend using SWFC-ART in practical software testing scenarios, where real-life programs often have high dimensional input domains and low failure rates.
... A FAST family of prioritization techniques has been described in [23]. The FAST techniques handle huge size test suites by utilizing Big Data techniques to achieve scalability in TCP to meet current industrial demands. ...
... Berolino et al. [10] present the FAST group of PTC techniques which alterations profoundly these scenes from end to end acquiring calculations broadly utilized in the large information for tracking down related things. Quick procedure gives versatile likeness founded PTC in WB and BB modus operandi. ...
... Due to the relatively high computation cost of TCP algorithms, proposing TCP methods with lower computation costs for large-scale test suites has been investigated. Miranda et al. (2018) propose using hashing-based approaches to provide faster TCP algorithms. ...
Article
Full-text available
Regression testing activities greatly reduce the risk of faulty software release. However, the size of the test suites grows throughout the development process, resulting in time-consuming execution of the test suite and delayed feedback to the software development team. This has urged the need for approaches such as test case prioritization (TCP) and test-suite reduction to reach better results in case of limited resources. In this regard, proposing approaches that use auxiliary sources of data such as bug history can be interesting. We aim to propose an approach for TCP that takes into account test case coverage data, bug history, and test case diversification. To evaluate this approach we study its performance on real-world open-source projects. The bug history is used to estimate the fault-proneness of source code areas. The diversification of test cases is preserved by incorporating fault-proneness on a clustering-based approach scheme. The proposed methods are evaluated on datasets collected from the development history of five real-world projects including 357 versions in total. The experiments show that the proposed methods are superior to coverage-based TCP methods. The proposed approach shows that improvement of coverage-based and fault-proneness-based methods is possible by using a combination of diversification and fault-proneness incorporation.
... Despite the large body of research on coveragebased TCP [66,68,69,70], the total-greedy and additional-greedy greedy strategies remain the most widely investigated prioritization strategies [7]. In addition to the above greedy-based strategies, researchers have also investigated other generic strategies [30,31]. ...
Preprint
Full-text available
Test case prioritization (TCP) aims to reorder the regression test suite with a goal of increasing the fault detection rate. Various TCP techniques have been proposed based on different prioritization strategies. Among them, the greedy-based techniques are the most widely-used TCP techniques. However, existing greedy-based techniques usually reorder all candidate test cases in prioritization iterations, resulting in both efficiency and effectiveness problems. In this paper, we propose a generic partial attention mechanism, which adopts the previous priority values (i.e., the number of additionally-covered code units) to avoid considering all candidate test cases. Incorporating the mechanism with the additional-greedy strategy, we implement a novel coverage-based TCP technique based on partition ordering (OCP). OCP first groups the candidate test cases into different partitions and updates the partitions on the descending order. We conduct a comprehensive experiment on 19 versions of Java programs and 30 versions of C programs to compare the effectiveness and efficiency of OCP with six state-of-the-art TCP techniques: total-greedy, additional-greedy, lexicographical-greedy, unify-greedy, art-based, and search-based. The experimental results show that OCP achieves a better fault detection rate than the state-of-the-arts. Moreover, the time costs of OCP are found to achieve 85%-99% improvement than most state-of-the-arts.
... There exist several benchmarks in recent APR literature [62]- [64]. After searching the literature for benchmarks, we adopt Defects4J [15], as it has been continuously developed for a long time and has become the most widely studied dataset in APR studies [7], [8], [16], [37], or even other software engineering research (e.g., fault localization [65]- [68] and test case prioritization [69], [70], etc.) in general. Defects4J consists hundreds of known and reproducible real-world bugs from a collection of 16 real-world Java programs. ...
Preprint
Full-text available
Various automated program repair (APR) techniques have been proposed to fix bugs automatically in the last decade. Although recent researches have made significant progress on the effectiveness and efficiency, it is still unclear how APR techniques perform with human intervention in a real debugging scenario. To bridge this gap, we conduct an extensive study to compare three state-of-the-art APR tools with manual program repair, and further investigate whether the assistance of APR tools (i.e., repair reports) can improve manual program repair. To that end, we recruit 20 participants for a controlled experiment, resulting in a total of 160 manual repair tasks and a questionnaire survey. The experiment reveals several notable observations that (1) manual program repair may be influenced by the frequency of repair actions sometimes; (2) APR tools are more efficient in terms of debugging time, while manual program repair tends to generate a correct patch with fewer attempts; (3) APR tools can further improve manual program repair regarding the number of correctly-fixed bugs, while there exists a negative impact on the patch correctness; (4) participants are used to consuming more time to identify incorrect patches, while they are still misguided easily; (5) participants are positive about the tools' repair performance, while they generally lack confidence about the usability in practice. Besides, we provide some guidelines for improving the usability of APR tools (e.g., the misleading information in reports and the observation of feedback).
... Miranda et al. [10] presented the FAST family of TCP methods which radically changes the landscapes through borrowing algorithms, widely used in big data, to find relevant items. FAST technique provides scalable similarity-based TCP in white-box and black-box methods. ...
... The last Table A4 show the review protocol process. [26], [52], [55], [56], [58], [76], [77], [80]- [83], [28], [84], [87], [94], [95], [100], [104]- [106], [108], [109], [44], [110], [112], [113], [115], [116], [119]- [121], [124], [125], [45], VOLUME XX, 2017 9 ...
Article
Full-text available
Software quality can be assured by passing the process of software testing. However, software testing process involve many phases which lead to more resources and time consumption. To reduce these downsides, one of the approaches is to adopt test case prioritization (TCP) where numerous works has indicated that TCP do improve the overall software testing performance. TCP does have several kinds of techniques which have their own strengths and weaknesses. As for this review paper, the main objective of this paper is to examine deeper on machine learning (ML) techniques based on research questions created. The research method for this paper was designed in parallel with the research questions. Consequently, 110 primary studies were selected where, 58 were journal articles, 50 were conference papers and 2 considered as others articles. For overall result, it can be said that ML techniques in TCP has trending in recent years yet some improvements are certainly welcomed. There are multiple ML techniques available, in which each technique has specified potential values, advantages, and limitation. It is notable that ML techniques has been considerably discussed in TCP approach for software testing.
... Test report prioritization simulates the ideal of test case prioritization which aims to rank test cases to reveal bugs earlier [18]. In evaluating the effectiveness of various test case prioritization techniques, the Average Percentage of Fault Detected (APFD) [12] is widely adopted to measure how rapidly a prioritized test suite detects defects when executing the test suite [19]. Therefore, we also employ APFD to evaluate the effectiveness of DivClass. ...
Article
Full-text available
In crowdsourced testing, crowd workers from different places help developers conduct testing and submit test reports for the observed abnormal behaviors. Developers manually inspect each test report and make an initial decision for the potential bug. However, due to the poor quality, test reports are handled extremely slowly. Meanwhile, due to the limitation of resources, some test reports are not handled at all. Therefore, some researchers attempt to resolve the problem of test report prioritization and have proposed many methods. However, these methods do not consider the impact of duplicate test reports. In this paper, we focus on the problem of test report prioritization and present a new method named DivClass by combining a diversity strategy and a classification strategy. First, we leverage Natural Language Processing (NLP) techniques to preprocess crowdsourced test reports. Then, we build a similarity matrix by introducing an asymmetric similarity computation strategy. Finally, we combine the diversity strategy and the classification strategy to determine the inspection order of test reports. To validate the effectiveness of DivClass, experiments are conducted on five crowdsourced test report datasets. Experimental results show that DivClass achieves 0.8887 in terms of APFD (Average Percentage of Fault Detected) and improves the state-of-the-art technique DivRisk by 14.12% on average. The asymmetric similarity computation strategy can improve DivClass by 4.82% in terms of APFD on average. In addition, empirical results show that DivClass can greatly reduce the number of inspected test reports.
... Graphical representation of the Dynamic Pairs Prioritization Approach example, where the nStartUp pairs ( PR 1 , PR 2 and PR 3 shadowed with diagonal lines) are placed into the DynamicPairs output during the first stage. The first iteration of the second stage is also represented, where the pair with highest priority ( PR 6 emphasized with dark gray shadow) is selected to be placed into the DynamicPairs output and these have been successfully used as criteria to prioritize test cases both in general purpose software systems (Fang et al., 2014;Miranda et al., 2018;Noor & Hemmati, 2015;Thomas et al., 2014), as well as in the context of product line engineering at the Domain engineering level (Al-Hajjaji et al., 2014). The second reason is that, unlike output-based test similarity, or other quality metrics proposed in (Arrieta et al., 2018, input-based test similarity does not require tests to be executed beforehand. ...
Article
Full-text available
Product line testing is challenging due to the potentially huge number of configurations. Several approaches have tackled this challenge; most of them focused on reducing the number of tested products by selecting a representative subset. However, little attention has been paid to product line test optimization using test results, while tests are executed. This paper aims at optimizing the testing process of product lines by increasing the fault detection rate. To this end we propose a dynamic test prioritization approach. In contrast to traditional static test prioritization, our dynamic test prioritization leverages information of tests being executed in specific products. Processing this information, the initially prioritized tests are rearranged in order to find non-discovered faults. The proposed approach is valid for any kind of product lines, but we have adapted it to the context of configurable simulation models, an area where testing is especially time-consuming and optimization methods are paramount. The approach was empirically evaluated by employing two case studies. The results of this evaluation reveal that the proposed test prioritization approach improves both the static prioritization algorithm and the selected baseline technique. The results provide a basis for suggesting that the proposed dynamic test prioritization approach is appropriate to optimize the testing process of product lines.
... First two, accounts for reducing the expense of testing process by selecting the relevant subset and by minimizing test suite to a subset, satisfying the prior coverage criteria respectively. Prioritization organize and rank test cases in a way that aims to improve code coverage efficiency and, thus deal efficiently with early detection of faults (Miranda et al., 2018). Besides, it provides faster feedbacks, thereby allowing developers to debug as early as possible. ...
Article
Full-text available
Consistent regression testing (RT) is an abstract class, that considered indispensable for assuring the quality of software systems but it is too expensive. To minimize the computational cost of RT, test case prioritization (TCP) is the most adopted methodology in literature. The implementation of TCP process, performed using various hard clustering techniques but fuzzy clustering, one of the most sought clustering technique for selecting appropriate test cases had not been discover at a wider platform. Therefore, the proposed work discusses a novel density based fuzzy c-mean (NDB-FCM) algorithm with newly derived initialize membership function for prioritizing the test cases. It first, generates optimal number of cluster (Copt) using a density based algorithm, which in turn minimizes the search criteria to find the ‘Copt’, especially in cases where a given data set does not follow the empirical rule. Then, creates an initial fuzzy partition matrix based upon newly suggested initial membership method. In addition, a novel multiobjective prioritization model (NDS-FCMPM) proposed to achieve the performance goalof enhanced fault recognition. Initially, feature extraction carried out by exploiting the dependencies between test cases, and then test cases are clustered using proposed fuzzy clustering approach, which finally, prioritized using a newly developed prioritization algorithm. To validate the performance of suggested fuzzy clustering algorithm twoperformance measure namely “Fuzzy Rand Index” and “Run Time” exercised and for prioritization algorithm “APFD” metrics is analysed. The proposed model is assessed using eclipse data extracted from Github Repository. Inferences generated depict that NDB-FCM clustering provide more stable results in terms of classification accuracy, run time and quick convergence when compared with other state-of-the-art techniques. Also, it is verified that NDS-FCMPM observes an improved rate of fault identification at early stage © The Author(s). This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted distribution provided the original author and source are cited
... heuristically driven to improve the rate of failure observation. The approximation of faults by failures is not new and had been followed by prior approaches [30], [31], [32]. We adopt this simplifying approximation (treating each failure as a different fault) due to two reasons. ...
Article
Full-text available
The problem of test-case prioritization has been pursued for over three decades now and continues to be one of the active topics in software testing research. In this paper, we focus on a code-coverage based regression test-prioritization solution (Colosseum) that takes into account the position of changed (delta) code elements (basic-blocks) along the loop-free straight-line execution path of the regression test-cases. We propose a heuristic that logically associates each of these paths with three parameters: (i) the offset (displacement a) of the first delta from the starting basic-block, (ii) the offset (displacement c) of the last delta from the terminating basic block, and (iii) the average scattering (displacement b) within all the intermediate basic-blocks. We hypothesize that a regression test-case path with a shorter overall displacement has a good chance of propagating the affects of the code-changes to the observable outputs in the program. Colosseum prioritizes test-cases with smaller overall displacements and executes them early in the regression test-execution cycle. The underlying intuition is that the probability of a test-case revealing a regression fault depends on the probability of the corresponding change propagation. The change in this context can potentially lead to an error. Extending this logic, delta displacement provides an approximation to failed error propagation. Evaluation on 20 open-source C projects from the Software-artifact Infrastructure Repository and GitHub (totaling: 694,512 SLOC, 280 versions, and 69,305 test-cases) against four state-of-the-art prioritizations reveals that: Colosseum outperforms the competitors with an overall 84.61% success in terms of 13 prioritization effectiveness metrics, majority of which prefer to execute top-k% prioritized test-cases.
... Proposed TCP approaches do not scale up in handling of test cases of large size projects. There are industrial projects with millions of test suites, and simple heuristic TCP approaches are used instead [118]. ...
... Proposed TCP approaches do not scale up in handling of test cases of large size projects. There are industrial projects with millions of test suites, and simple heuristic TCP approaches are used instead [118]. ...
Article
Full-text available
Regression testing, as an important part of the software life cycle, ensures the validity of modified software. Researchers’ focus of this research is on functional requirement-based ‘Test Case Prioritization’ (TCP) because requirement specifications help keep the software correctness on customers’ perceived priorities. This research study is aimed to investigate requirement-based TCP approaches, regression testing aspects, applications regarding the validation of proposed TCP approaches, systems’ size under regression testing, test case size and relevant revealed faults, TCP related issues, TCP issues and types of primary studies. Researchers of this paper examined research publications, which have been published between 2009 and 2019, within the seven most significant digital repositories. These repositories are popular, and mostly used for searching papers on topics in software engineering domain. We have performed a meticulous screening of research studies and selected 35 research papers through which to investigate the answers to the proposed research questions. The final outcome of this paper showed that functional requirement-based TCP approaches have been widely covered in primary studies. The results indicated that fault size and the number of test cases are mostly discussed as regression testing aspects within primary studies. In this review paper, it has been identified that iTrust system is widely examined by researchers in primary studies. This paper’s conclusion indicated that most of the primary studies have been demonstrated in the real-world settings by respective researchers of focused primary studies. The findings of this “Systematic Literature Review” (SLR) reveal some suggestions to be undertaken in future research works, such as improving the software quality, and conducting evaluations of larger systems.
... ere is no clear winner from these popular testing models and frameworks too [11]. e other factors which may affect the accuracy of prioritization techniques are the size of the software under testing, size of the test suites available for testing, testing scenarios under these prioritization techniques, and testing environment supporting these prioritization techniques [12,15,16]. e limitations of the previous frameworks for unit testing with these factors impacting accuracy and usefulness of prioritization techniques put challenge in terms of multiobjective and multicriterion test suite prioritization research space [17]. ...
Article
Full-text available
Modified source code validation is done by regression testing. In regression testing, the time and resources are limited, in which we have to select the minimal test cases from test suites to reduce execution time. The test case minimization process deals with the optimization of the regression testing by removing redundant test cases or prioritizing the test cases. This study proposed a test case prioritization approach based on multiobjective particle swarm optimization (MOPSO) by considering minimum execution time, maximum fault detection ability, and maximum code coverage. The MOPSO algorithm is used for the prioritization of test cases with parameters including execution time, fault detection ability, and code coverage. Three datasets are selected to evaluate the proposed MOPSO technique including TreeDataStructure, JodaTime, and Triangle. The proposed MOPSO is compared with the no ordering, reverse ordering, and random ordering technique for evaluating the effectiveness. The higher values of results represent the more effectiveness and the efficiency of the proposed MOPSO as compared to other approaches for TreeDataStructure, JodaTime, and Triangle datasets. The result is presented to 100-index mode relevant from low to high values; after that, test cases are prioritized. The experiment is conducted on three open-source java applications and evaluated using metrics inclusiveness, precision, and size reduction of a matrix of the test suite. The results revealed that all scenarios performed well in acceptable mode, and the technique is 17% to 86% more effective in terms of inclusiveness, 33% to 85% more effective in terms of precision, and 17% minimum to 86% maximum in size reduction of metrics.
... In previous studies [19], [20] we have shown that test code similarity can provide an effective instrument for test suite prioritization and reduction. Inspired by such studies, in this work we leverage test code similarity for identifying flaky tests. ...
Article
Full-text available
Context: Flaky tests plague regression testing in Continuous Integration environments by slowing down change releases and wasting testing time and effort. Despite the growing interest in mitigating the burden of test flakiness, how to efficiently and effectively detect flaky tests is still an open problem. Objective: In this study, we present and evaluate FLAST, an approach designed to statically predict test flakiness. FLAST leverages vector-space modeling, similarity search, dimensionality reduction, and k-Nearest Neighbor classification in order to timely and efficiently detect test flakiness. Method: In order to gain insights into the efficiency and effectiveness of FLAST, we conduct an empirical evaluation of the approach by considering 13 real-world projects, for a total of 1,383 flaky and 26,702 non-flaky tests.We carry out a quantitative comparison of FLAST with the state-of-the-art methods to detect test flakiness, by considering a balanced dataset comprising 1,402 real-world flaky and as many non-flaky tests. Results: From the results we observe that the effectiveness of FLAST is comparable with the state-of-the-art, while providing considerable gains in terms of efficiency. In addition, the results demonstrate how by tuning the threshold of the approach FLAST can be made more conservative, so to reduce false positives, at the cost of missing more potentially flaky tests. Conclusion: The collected results demonstrate that FLAST provides a fast, low-cost and reliable approach that can be used to guide test rerunning, or to gate the inclusion of new potentially flaky tests.
... Many studies have studied diversity in test generation [11,23,24,43]. For example, researchers have studied the diversity of test inputs and outputs [2]. ...
Preprint
There have been numerous studies on mining temporal specifications from execution traces. These approaches learn finite-state automata (FSA) from execution traces when running tests. To learn accurate specifications of a software system, many tests are required. Existing approaches generalize from a limited number of traces or use simple test generation strategies. Unfortunately, these strategies may not exercise uncommon usage patterns of a software system. To address this problem, we propose a new approach, adversarial specification mining, and develop a prototype, DICE (Diversity through Counter-Examples). DICE has two components: DICE-Tester and DICE-Miner. After mining Linear Temporal Logic specifications from an input test suite, DICE-Tester adversarially guides test generation, searching for counterexamples to these specifications to invalidate spurious properties. These counterexamples represent gaps in the diversity of the input test suite. This process produces execution traces of usage patterns that were unrepresented in the input test suite. Next, we propose a new specification inference algorithm, DICE-Miner, to infer FSAs using the traces, guided by the temporal specifications. We find that the inferred specifications are of higher quality than those produced by existing state-of-the-art specification miners. Finally, we use the FSAs in a fuzzer for servers of stateful protocols, increasing its coverage.
... Although Defects4J is widely used in SE experimentation [4,28] and has allowed us to know actual bugs and have more reliable BIC commits as compared to other solutions we cannot guarantee that our results can be generalized to other contexts (e.g., Ray et al.'s dataset, industrial projects, non-Java projects, and so on) or the universe of Java projects. When we planned our study, we had to reach a trade-off between generalizability of results and reliability of measures. ...
Chapter
Researchers have shown a growing interest in the affective states (i.e., emotions and moods) of developers while performing software engineering tasks. We investigate the association between developers’ sentiment polarity—i.e., negativity and positivity—and bug introduction. To pursue our research objective, we executed a case-control study in the Mining Software Repository (MSR) context. Our exposures are developers’ negativity and positivity captured, by using sentiment analysis, from commit comments of software repositories; while our “disease” is bug introduction—i.e., if the changes of a commit introduce bugs. We found that developers’ negativity is associated to bug introduction, as well as developers’ positivity. These findings seem to foster a continuous monitoring of developers’ affective states so as to prevent the introduction of bugs or discover bugs as early as possible.
... A considerable amount of research has been conducted into regression testing techniques with a goal of improving the testing performance. This includes test case prioritization [1,50], reduction [51,52] and selection [53,54]. This Related Work section focuses on test case prioritization, which aims to detect faults as early as possible through the reordering of regression test cases [55,56]. ...
Article
Full-text available
Regression test case prioritization (RTCP) aims to improve the rate of fault detection by executing more important test cases as early as possible. Various RTCP techniques have been proposed based on different coverage criteria. Among them, a majority of techniques leverage code coverage information to guide the prioritization process, with code units being considered individually, and in isolation. In this paper, we propose a new coverage criterion, code combinations coverage, that combines the concepts of code coverage and combination coverage. We apply this coverage criterion to RTCP, as a new prioritization technique, code combinations coverage based prioritization (CCCP). We report on empirical studies conducted to compare the testing effectiveness and efficiency of CCCP with four popular RTCP techniques: total, additional, adaptive random, and search-based test prioritization. The experimental results show that even when the lowest combination strength is assigned, overall, the CCCP fault detection rates are greater than those of the other four prioritization techniques. The CCCP prioritization costs are also found to be comparable to the additional test prioritization technique. Moreover, our results also show that when the combination strength is increased, CCCP provides higher fault detection rates than the state-of-the-art, regardless of the levels of code coverage.
... A considerable amount of research has been conducted into regression testing techniques with a goal of improving the testing performance. This includes test case prioritization [1,50], reduction [51,52] and selection [53,54]. This Related Work section focuses on test case prioritization, which aims to detect faults as early as possible through the reordering of regression test cases [55,56]. ...
Preprint
Full-text available
Regression test case prioritization (RTCP) aims to improve the rate of fault detection by executing more important test cases as early as possible. Various RTCP techniques have been proposed based on different coverage criteria. Among them, a majority of techniques leverage code coverage information to guide the prioritization process, with code units being considered individually, and in isolation. In this paper, we propose a new coverage criterion, code combinations coverage, that combines the concepts of code coverage and combination coverage. We apply this coverage criterion to RTCP, as a new prioritization technique, code combinations coverage based prioritization (CCCP). We report on empirical studies conducted to compare the testing effectiveness and efficiency of CCCP with four popular RTCP techniques: total, additional, adaptive random, and search-based test prioritization. The experimental results show that even when the lowest combination strength is assigned, overall, the CCCP fault detection rates are greater than those of the other four prioritization techniques. The CCCP prioritization costs are also found to be comparable to the additional test prioritization technique. Moreover, our results also show that when the combination strength is increased, CCCP provides higher fault detection rates than the state-of-the-art, regardless of the levels of code coverage.
Article
Context: Software regression testing refers to rerunning test cases after the system under test is modified, ascertaining that the changes have not (re-)introduced failures. Not all researchers’ approaches consider applicability and scalability concerns, and not many have produced an impact in practice. Objective: One goal is to investigate industrial relevance and applicability of proposed approaches. Another is providing a live review, open to continuous updates by the community. Method: A systematic review of regression testing studies that are clearly motivated by or validated against industrial relevance and applicability is conducted. It is complemented by follow-up surveys with authors of the selected papers and 23 practitioners. Results: A set of 79 primary studies published between 2016-2022 is collected and classified according to approaches and metrics. Aspects relative to their relevance and impact are discussed, also based on their authors’ feedback. All the data are made available from the live repository that accompanies the study. Conclusions: While widely motivated by industrial relevance and applicability, not many approaches are evaluated in industrial or large-scale open-source systems, and even fewer approaches have been adopted in practice. Some challenges hindering the implementation of relevant approaches are synthesized, also based on the practitioners’ feedback.
Article
Test case prioritization (TCP) has been widely studied in regression testing, which aims to optimize the execution order of test cases so as to detect more faults earlier. TCP has been divided into white-box test case prioritization (WTCP) and black-box test case prioritization (BTCP). WTCP can achieve better prioritization effectiveness by utilizing source code information, but is not applicable in many practical scenarios (where source code is unavailable, e.g., outsourced testing). BTCP has the benefit of not relying on source code information, but tends to be less effective than WTCP. That is, both WTCP and BTCP suffer from limitations in the practical use. To improve the practicability of TCP, we aim to explore better BTCP, significantly bridging the effectiveness gap between BTCP and WTCP. In this work, instead of statically analyzing test cases themselves in existing BTCP techniques, we conduct the first study to explore whether this goal can be achieved via log analysis. Specifically, we propose to mine test logs produced during test execution to more sufficiently reflect test behaviors, and design a new BTCP framework (called LogTCP), including log pre-processing, log representation, and test case prioritization components. Based on the LogTCP framework, we instantiate seven log-based BTCP techniques by combining different log representation strategies with different prioritization strategies. We conduct an empirical study to explore the effectiveness of LogTCP. Based on 10 diverse open-source Java projects from GitHub, we compared LogTCP with three representative BTCP techniques and four representative WTCP techniques. Our results show that all of our LogTCP techniques largely perform better than all the BTCP techniques in average fault detection, to the extent that then become competitive to the WTCP techniques. That demonstrates the great potential of logs in practical TCP.
Chapter
Computer Science course instructors routinely have to create comprehensive test suites to assess programming assignments. The creation of such test suites is typically not trivial as it involves selecting a limited number of tests from a set of (semi-)randomly generated ones. Manual strategies for test selection do not scale when considering large testing inputs needed, for instance, for the assessment of algorithms exercises. To facilitate this process, we present TestSelector, a new framework for automatic selection of optimal test suites for student projects. The key advantage of TestSelector over existing approaches is that it is easily extensible with arbitrarily complex code coverage measures, not requiring these measures to be encoded into the logic of an exact constraint solver. We demonstrate the flexibility of TestSelector by extending it with support for a range of classical code coverage measures and using it to select test suites for a number of real-world algorithms projects, further showing that the selected test suites outperform randomly selected ones in finding bugs in students’ code.
Chapter
Chapter 3 focuses on transformation, vectorization, and optimization.
Article
Test case prioritization (TCP) aims to reorder the regression test suite with a goal of increasing the fault detection rate. Various TCP techniques have been proposed based on different prioritization strategies. Among them, the greedy-based techniques are the most widely-used TCP techniques. However, existing greedy-based techniques usually reorder all candidate test cases in prioritization iterations, resulting in both efficiency and effectiveness problems. In this paper, we propose a generic partial attention mechanism, which adopts the previous priority values (i.e., the number of additionally-covered code units) to avoid considering all candidate test cases. Incorporating the mechanism with the additional-greedy strategy, we implement a novel coverage-based TCP technique based on partition ordering (OCP). OCP first groups the candidate test cases into different partitions and updates the partitions on the descending order. We conduct a comprehensive experiment on 19 versions of Java programs and 30 versions of C programs to compare the effectiveness and efficiency of OCP with six state-of-the-art TCP techniques: total-greedy, additional-greedy, lexicographical-greedy, unify-greedy, art-based, and search-based. The experimental results show that OCP achieves a better fault detection rate than the state-of-the-arts. Moreover, the time costs of OCP are found to achieve 85%–99% improvement than most state-of-the-arts.
Article
However, these tests are very expensive and are too many to be run frequently within limited time constraints. In this paper, we investigate test case prioritization techniques to increase the ability to detect SDC regression faults with virtual tests earlier. Our empirical study conducted in the SDC domain shows that
Article
Full-text available
Software Test Case Prioritization (TCP) is an effective approach for regression testing to tackle time and budget constraints. The major benefit of TCP is to save time through the prioritization of important test cases first. Existing TCP techniques can be categorized as value-neutral and value-based approaches. In a value-based fashion, the cost of test cases and severity of faults are considered whereas, in a value-neutral fashion these are not considered. The value-neutral fashion is dominant over value-based fashion, and it assumes that all test cases have equal cost and all software faults have equal severity. But this assumption rarely holds in practice. Therefore, value-neutral TCP techniques are prone to produce unsatisfactory results. To overcome this research gap, a paradigm shift is required from value-neutral to value-based TCP techniques. Currently, very limited work is done in a value-based fashion and to the best of the authors’ knowledge, no comprehensive review of value-based cost-cognizant TCP techniques is available in the literature. To address this problem, a systematic literature review (SLR) of value-based cost-cognizant TCP techniques is presented in this paper. The core objective of this study is to combine the overall knowledge related to value-based cost-cognizant TCP techniques and to highlight some open research problems of this domain. Initially, 165 papers were reviewed from the prominent research repositories. Among these 165 papers, 21 papers were selected by using defined inclusion/exclusion criteria and quality assessment procedures. The established questions are answered through a thorough analysis of the selected papers by comparing their research contributions in terms of the algorithm used, the performance evaluation metric, and the results validation method used. Total 12 papers used an algorithm for their technique but 9 papers didn’t use any algorithm. Particle Swarm Optimization (PSO) Algorithm is dominantly used. For results validation, 4 methods are used including, Empirical study, Experiment, Case study, and Industrial case study. The experiment method is dominantly used. Total 6 performance evaluation metrics are used and the APFDc metric is dominantly used. This SLR yields that value-orientation and cost cognition are vital in the TCP process to achieve its intended goals and there is great research potential in this research domain.
Article
Microservice-based applications consist of multiple services that can evolve independently. When a service must be updated, it is first tested with in-house regression test suites. However, the test suites that are executed are usually designed without the exact knowledge about how the services will be accessed and used in the field; therefore, they may easily miss relevant test scenarios, failing to prevent the deployment of faulty services. To address this problem, we introduce ExVivoMicroTest, an approach that analyzes the execution of deployed services at run-time in the field, in order to generate test cases for future versions of the same services. ExVivoMicroTest implements lightweight monitoring and tracing capabilities, to inexpensively record executions that can be later turned into regression test cases that capture how services are used in the field. To prevent accumulating an excessive number of test cases, ExVivoMicroTest uses a test coverage model that can discriminate the recorded executions between the ones that are worth to be turned into test cases and the ones that should be discarded. The resulting test cases use a mocked environment that fully isolates the service under test from the rest of the system to faithfully reply interactions. We assessed ExVivoMicroTest with the PiggyMetrics and Train Ticket open source microservice applications and studied how different configurations of the monitoring and tracing logic impact on the capability to generate test cases.
Article
Although regression testing is important to guarantee the software quality in software evolution, it suffers from the widely known cost problem. To address this problem, existing researchers made dedicated efforts on test prioritization, which optimizes the execution order of tests to detect faults earlier; while practitioners in industry leveraged more computing resources to save the time cost of regression testing. By combining these two orthogonal solutions, in this article, we define the problem of parallel test prioritization, which is to conduct test prioritization in the scenario of parallel test execution to reduce the cost of regression testing. Different from traditional sequential test prioritization, parallel test prioritization aims at generating a set of test sequences, each of which is allocated in an individual computing resource and executed in parallel. In particular, we propose eight parallel test prioritization techniques by adapting the existing four sequential test prioritization techniques, by including and excluding testing time in prioritization. To investigate the performance of the eight parallel test prioritization techniques, we conducted an extensive study on 54 open-source projects and a case study on 16 commercial projects from Baidu , a famous search service provider with 600M monthly active users. According to the two studies, parallel test prioritization does improve the efficiency of regression testing, and cost-aware additional parallel test prioritization technique significantly outperforms the other techniques, indicating that this technique is a good choice for practical parallel testing. Besides, we also investigated the influence of two external factors, the number of computing resources and time allowed for parallel testing, and find that more computing resources indeed improve the performance of parallel test prioritization. In addition, we investigated the influence of two more factors, test granularity and coverage criterion, and find that parallel test prioritization can still accelerate regression testing in parallel scenario. Moreover, we investigated the benefit of parallel test prioritization on the regression testing process of continuous integration, considering both the cumulative acceleration performance and the overhead of prioritization techniques, and the results demonstrate the superiority of parallel test prioritization.
Article
Test case prioritization (TCP) aims at scheduling test case execution so that more important test cases are executed as early as possible. Many TCP techniques have been proposed, according to different concepts and principles, with dissimilarity-based TCP (DTCP) prioritizing tests based on the concept of test case dissimilarity: DTCP chooses the next test case from a set of candidates such that the chosen test case is farther away from previously selected test cases than the other candidates. DTCP techniques typically only use one aspect/granularity of the information or features from test cases to support the prioritization process. In this article, we adopt the concept of data fusion to propose a new family of DTCP techniques, data-fusion-driven DTCP (DDTCP), which attempts to use different information granularities for prioritizing test cases by dissimilarity. We performed an empirical study involving 30 versions of five subject programs, investigating the testing effectiveness and efficiency by comparing DDTCP against DTCP techniques that use a dissimilarity granularity. The experimental results show that not only does DDTCP have better fault-detection rates than single-granularity DTCP techniques, but it also appears to only incur similar prioritization costs. The results also show that DDTCP remains robust over multiple system releases.
Article
Full-text available
End-to-end tests present many challenges in the industry. The long-running times of these tests make it unsuitable to apply research work on test case prioritization or test case selection, for instance, on them, as most works on these two problems are based on datasets of unit tests. These ones are fast to run, and time is not usually a considered criterion. This is because there is no dataset of end-to-end tests, due to the infrastructure needs for running this kind of tests, the complexity of the setup and the lack of proper characterization of the faults and their fixes. Therefore, running end-to-end tests for any research work is hard and time-consuming, and the availability of a dataset containing regression bugs, documentation and logs for these tests might foster the usage of end-to-end tests in research works. This paper presents a) a dataset for this kind of tests, including six well-documented manually injected regression bugs and their corresponding fixes in three web applications built using Java and the Spring framework; b) tools for easing the execution of these tests no matter the infrastructure; and c) a comparative study with two well-known datasets of unit tests. The comparative study shows that there are important differences between end-to-end and unit tests, such as their execution time and the amount of resources they consume, which are much higher in the end-to-end tests. End-to-end testing deserves some attention from researchers. Our dataset is a first effort toward easing the usage of end-to-end tests in research works.
Article
Full-text available
Test case prioritization (TCP) is deemed valid to improve testing efficiency, especially in regression testing, as retest all is costly. The TCP schedule the test case execution order to detect bugs faster. For such benefit, test case prioritization has been intensively studied. This paper reviews the development of TCP for regression testing with 48 papers from 2017 to 2020. In this paper, we present four critical surveys. First is the development of approaches and techniques in regression TCP studies, second is the identification of software under test (SUT) variations used in TCP studies, third is the trend of metrics used to measure the TCP studies effectiveness, and fourth is the stateof- the-art of requirements-based TCP. Furthermore, we discuss development opportunities and potential future directions on regression TCP. Our review provides evidence that TCP has increasing interests. We also discovered that requirement-based utilization would help to prepare test cases earlier to improve TCP effectiveness.
Thesis
Full-text available
Architectural technical debt (ATD) in a software-intensive system is the sum of all design choices that may have been suitable or even optimal at the time they were made, but which today are significantly impending progress: structure, framework, technology, languages, etc. Unlike code-level technical debt which can be readily detected by static analysers, and can often be refactored with minimal or only incremental efforts, architectural debt is hard to detect, and its remediation rather wide-ranging, daunting, and often avoided. The objective of this thesis is to develop a better understanding of architectural technical debt, and determine what strategies can be used to identify and manage it. In order to do so, we adopt a wide range of research techniques, including literature reviews, case studies, interviews with practitioners, and grounded theory. The result of our investigation, deeply grounded in empirical data, advances the field not only by providing novel insights into ATD related phenomena, but also by presenting approaches to pro-actively identify ATD instances, leading to its eventual management and resolution.
Article
Context User interface testing validates the correctness of an application through visual cues and interactive events emitted in real world usages. Performing user interface tests is a time-consuming process, and thus, many studies have focused on prioritizing test cases to help maintain the effectiveness of testing while reducing the need for a full execution. Objective This paper describes a novel test prioritization method called RLTCP whose goal is to maximize the number of test faults detected while reducing the amount of test. Methods We define a weighted coverage graph to model the underlying association among test cases for the user interface testing. Our method combines Reinforcement Learning (RL) and the coverage graph to prioritize test cases. While RL has been found to be suitable for rapid changing projects with abundant historical data, the coverage graph considers in-depth the event-based aspects of user interface testing and provides a fine-grained level at which the RL system can gain more insights into individual test cases. Results We experiment and assess the proposed method using nine data sets obtained from two mature web applications, finding that the method outperforms the six, including the state-of-the-art, methods. Conclusions The use of both reinforcement learning and the underlying structure of user interface tests modeled via the coverage has the potential to improve the performance of test prioritization methods. Our study also shows the benefit of using the coverage graph to gain insights into test cases, their relationship and execution history.
Article
There have been numerous studies on mining temporal specifications from execution traces. These approaches learn finite-state automata (FSA) from execution traces when running tests. To learn accurate specifications of a software system, many tests are required. Existing approaches generalize from a limited number of traces or use simple test generation strategies. Unfortunately, these strategies may not exercise uncommon usage patterns of a software system. To address this problem, we propose a new approach, adversarial specification mining, and develop a prototype, Diversity through Counter-examples (DICE). DICE has two components: DICE-Tester and DICE-Miner. After mining Linear Temporal Logic specifications from an input test suite, DICE-Tester adversarially guides test generation, searching for counterexamples to these specifications to invalidate spurious properties. These counterexamples represent gaps in the diversity of the input test suite. This process produces execution traces of usage patterns that were unrepresented in the input test suite. Next, we propose a new specification inference algorithm, DICE-Miner, to infer FSAs using the traces, guided by the temporal specifications. We find that the inferred specifications are of higher quality than those produced by existing state-of-the-art specification miners. Finally, we use the FSAs in a fuzzer for servers of stateful protocols, increasing its coverage.
Article
Software engineers get questions of “how much testing is enough” on a regular basis. Existing approaches in software testing management employ experience-, risk-, or value-based analysis to prioritize and manage testing processes. However, very few is applicable to the emerging crowdtesting paradigm to cope with extremely limited information and control over unknown, online crowdworkers. In practice, deciding when to close a crowdtesting task is largely done by experience-based guesswork and frequently results in ineffective crowdtesting. More specifically, it is found that an average of 32% testing cost was wasteful spending in current crowdtesting practice. This article intends to address this challenge by introducing automated decision support for monitoring and determining appropriate time to close crowdtesting tasks. To that end, it first investigates the necessity and feasibility of close prediction of crowdtesting tasks based on an industrial dataset. Next, it proposes a close prediction approach named iSENSE2.0, which applies incremental sampling technique to process crowdtesting reports arriving in chronological order and organizes them into fixed-sized groups as dynamic inputs. Then, a duplicate tagger analyzes the duplicate status of received crowd reports, and a CRC-based (Capture-ReCapture) close estimator generates the close decision based on the dynamic bug arrival status. In addition, a coverage-based sanity checker is designed to reinforce the stability and performance of close prediction. Finally, the evaluation of iSENSE2.0 is conducted on 56,920 reports of 306 crowdtesting tasks from one of the largest crowdtesting platforms. The results show that a median of 100% bugs can be detected with 30% saved cost. The performance of iSENSE2.0 does not demonstrate significant difference with the state-of-the-art approach iSENSE , while the later one relies on the duplicate tag, which is generally considered as time-consuming and tedious to obtain.
Conference Paper
Full-text available
Although white-box regression test prioritization has been well-studied, the more recently introduced black-box prioritization approaches have neither been compared against each other nor against more well-established white-box techniques. We present a comprehensive experimental comparison of several test prioritization techniques, including well-established white-box strategies and more recently introduced black-box approaches. We found that Combinatorial Interaction Testing and diversity-based techniques (Input Model Diversity and Input Test Set Diameter) perform best among the black-box approaches. Perhaps surprisingly, we found little difference between black-box and white-box performance (at most 4% fault detection rate difference). We also found the overlap between black- and white-box faults to be high: the first 10% of the prioritized test suites already agree on at least 60% of the faults found. These are positive findings for practicing regression testers who may not have source code available, thereby making white-box techniques inapplicable. We also found evidence that both black-box and white-box prioritization remain robust over multiple system releases.
Article
Full-text available
Large open and closed source organizations like Google, Facebook and Mozilla are migrating their products towards rapid releases. While this allows faster time-to-market and user feedback, it also implies less time for testing and bug fixing. Since initial research results indeed show that rapid releases fix proportionally less reported bugs than traditional releases, this paper investigates the changes in software testing effort after moving to rapid releases in the context of a case study on Mozilla Firefox, and performs a semi-systematic literature review. The case study analyzes the results of 312,502 execution runs of the 1,547 mostly manual system-level test cases of Mozilla Firefox from 2006 to 2012 (5 major traditional and 9 major rapid releases), and triangulates our findings with a Mozilla QA engineer. We find that rapid releases have a narrower test scope that enables a deeper investigation of the features and regressions with the highest risk. Furthermore, rapid releases make testing more continuous and have proportionally smaller spikes before the main release. However, rapid releases make it more difficult to build a large testing community , and they decrease test suite diversity and make testing more deadline oriented. In addition, our semi-systematic literature review presents the benefits, problems and enablers of rapid releases from 24 papers found using systematic search queries and a similar amount of papers found through other means. The literature review shows that rapid releases are a prevalent industrial practice that are utilized even in some highly critical domains of software engineering, and that rapid releases originated from several software development methodologies such as agile, open source, lean and internet-speed software development. However, empirical studies proving evidence of the claimed advantages and disadvantages of rapid releases are scarce.
Article
Full-text available
A common and natural intuition among software testers is that test cases need to differ if a software system is to be tested properly and its quality ensured. Consequently, much research has gone into formulating distance measures for how test cases, their inputs and/or their outputs differ. However, common to these proposals is that they are data type specific and/or calculate the diversity only between pairs of test inputs, traces or outputs. We propose a new metric to measure the diversity of sets of tests: the test set diameter (TSDm). It extends our earlier, pairwise test diversity metrics based on recent advances in information theory regarding the calculation of the normalized compression distance (NCD) for multisets. An advantage is that TSDm can be applied regardless of data type and on any test-related information, not only the test inputs. A downside is the increased computational time compared to competing approaches. Our experiments on four different systems show that the test set diameter can help select test sets with higher structural and fault coverage than random selection even when only applied to test inputs. This can enable early test design and selection, prior to even having a software system to test, and complement other types of test automation and analysis. We argue that this quantification of test set diversity creates a number of opportunities to better understand software quality and provides practical ways to increase it.
Conference Paper
Full-text available
Empirical studies in software testing research may not be comparable, reproducible, or characteristic of practice. One reason is that real bugs are too infrequently used in software testing research. Extracting and reproducing real bugs is challenging and as a result hand-seeded faults or mutants are commonly used as a substitute. This paper presents Defects4J, a database and extensible framework providing real bugs to enable reproducible studies in software testing research. The initial version of Defects4J contains 357 real bugs from 5 real-world open source pro- grams. Each real bug is accompanied by a comprehensive test suite that can expose (demonstrate) that bug. Defects4J is extensible and builds on top of each program’s version con- trol system. Once a program is configured in Defects4J, new bugs can be added to the database with little or no effort. Defects4J features a framework to easily access faulty and fixed program versions and corresponding test suites. This framework also provides a high-level interface to common tasks in software testing research, making it easy to con- duct and reproduce empirical studies. Defects4J is publicly available at http://defects4j.org.
Article
Full-text available
Test suites often grow very large over many releases, such that it is impractical to re-execute all test cases within limited resources. Test case prioritization rearranges test cases to improve the effectiveness of testing. Code coverage has been widely used as criteria in test case prioritization. However, the simple way may not reveal some bugs, such that the fault detection rate decreases. In this paper, we use the ordered sequences of program entities to improve the effectiveness of test case prioritization. The execution frequency profiles of test cases are collected and transformed into the ordered sequences. We propose several novel similarity-based test case prioritization techniques based on the edit distances of ordered sequences. An empirical study of five open source programs was conducted. The experimental results show that our techniques can significantly increase the fault detection rate and be effective in detecting faults in loops. Moreover, our techniques are more cost-effective than the existing techniques.
Conference Paper
Full-text available
Regression testing in continuous integration environment is bounded by tight time constraints. To satisfy time constraints and achieve testing goals, test cases must be efficiently ordered in execution. Prioritization techniques are commonly used to order test cases to reflect their importance according to one or more criteria. Reduced time to test or high fault detection rate are such important criteria. In this paper, we present a case study of a test prioritization approach ROCKET (Prioritization for Continuous Regression Testing) to improve the efficiency of continuous regression testing of industrial video conferencing software. ROCKET orders test cases based on historical failure data, test execution time and domain-specific heuristics. It uses a weighted function to compute test priority. The weights are higher if tests uncover regression faults in recent iterations of software testing and reduce time to detection of faults. The results of the study show that the test cases prioritized using ROCKET (1) provide faster fault detection, and (2) increase regression fault detection rate, revealing 30% more faults for 20% of the test suite executed, comparing to manually prioritized test cases.
Article
Full-text available
Normalized compression distance (NCD) is a parameter-free, feature-free, alignment-free, similarity measure between a pair of finite objects based on compression. However, it is not sufficient for all applications. We propose an NCD of finite multisets (a.k.a. multiples) of finite objects that is also a metric. Previously, attempts to obtain such an NCD failed. We cover the entire trajectory from theoretical underpinning to feasible practice. The new NCD for multisets is applied to retinal progenitor cell classification questions and to related synthetically generated data that were earlier treated with the pairwise NCD. With the new method we achieved significantly better results. Similarly for questions about axonal organelle transport. We also applied the new NCD to handwritten digit recognition and improved classification accuracy significantly over that of pairwise NCD by incorporating both the pairwise and NCD for multisets. In the analysis we use the incomputable Kolmogorov complexity that for practical purposes is approximated from above by the length of the compressed version of the file involved, using a real-world compression program. Index Terms--- Normalized compression distance, multisets or multiples, pattern recognition, data mining, similarity, classification, Kolmogorov complexity, retinal progenitor cells, synthetic data, organelle transport, handwritten character recognition
Article
Full-text available
An adaptive random (AR) testing strategy has recently been developed and examined by a growing body of research. More recently, this strategy has been applied to prioritizing regression test cases based on code coverage using the concepts of Jaccard Distance (JD) and Coverage Manhattan Distance (CMD). Code coverage, however, does not consider frequency, furthermore, comparison between JD and CMD has not yet been made. This research fills the gap by first investigating the fault-detection capabilities of using frequency information for AR test case prioritization, and then comparing JD and CMD. Experimental results show that "coverage" was more useful than "frequency" although the latter can sometimes complement the former, and that CMD was superior to JD. It is also found that, for certain faults, the conventional "additional" algorithm (widely accepted as one of the best algorithms for test case prioritization) could perform much worse than random testing on large test suites.
Article
Full-text available
Random testing is not only a useful testing technique in itself, but also plays a core role in many other testing methods. Hence, any significant improvement to random testing has an impact throughout the software testing community. Recently, Adaptive Random Testing (ART) was proposed as an effective alternative to random testing. This paper presents a synthesis of the most important research results related to ART. In the course of our research and through further reflection, we have realised how the techniques and concepts of ART can be applied in a much broader context, which we present here. We believe such ideas can be applied in a variety of areas of software testing, and even beyond software testing. Amongst these ideas, we particularly note the fundamental role of diversity in test case selection strategies. We hope this paper serves to provoke further discussions and investigations of these ideas.
Conference Paper
Full-text available
Abstract—Regression testing assures changed ,programs against unintended amendments. Rearranging the execution order of test cases is a key idea to improve their effectiveness. Paradoxically, many test case prioritization techniques resolve tie cases using the random selection approach, and yet random ordering of test cases has been considered as ineffective. Exist- ing unit testing research unveils that adaptive random ,testing (ART) is a promising candidate that may replace random,test- ing (RT). In this paper, we not only propose a new family of coverage-based ART techniques, but also show empirically that they are statistically superior to the RT-based technique in detecting faults. Furthermore, one of the ART prioritization techniques is consistently comparable ,to some ,of the ,best coverage-based prioritization techniques (namely, the “addi- tional” techniques) and yet involves much less time cost. Keywords—Adaptive random testing; test case prioritization
Article
Full-text available
Test case prioritisation aims at finding an ordering which enhances a certain property of an ordered test suite. Traditional techniques rely on the availability of code or a specification of the program under test. We propose to use string distances on the text of test cases for their comparison and elaborate a prioritisation algorithm. Such a prioritisation does not require code or a specification and can be useful for initial testing and in cases when code is difficult to instrument. In this paper, we also report on experiments performed on the “Siemens Test Suite”, where the proposed prioritisation technique was compared with random permutations and four classical string distance metrics were evaluated. The obtained results, confirmed by a statistical analysis, indicate that prioritisation based on string distances is more efficient in finding defects than random ordering of the test suite: the test suites prioritized using string distances are more efficient in detecting the strongest mutants, and, on average, have a better APFD than randomly ordered test suites. The results suggest that string distances can be used for prioritisation purposes, and Manhattan distance could be the best choice.
Conference Paper
The large body of existing research in Test Case Prioritization (TCP) techniques, can be broadly classified into two categories: dynamic techniques (that rely on run-time execution information) and static techniques (that operate directly on source and test code). Absent from this current body of work is a comprehensive study aimed at understanding and evaluating the static approaches and comparing them to dynamic approaches on a large set of projects. In this work, we perform the first extensive study aimed at empirically evaluating four static TCP techniques comparing them with state-of-research dynamic TCP techniques at different test-case granularities (e.g., method and class-level) in terms of effectiveness, efficiency and similarity of faults detected. This study was performed on 30 real-word Java programs encompassing 431 KLoC. In terms of effectiveness, we find that the static call-graph-based technique outperforms the other static techniques at test-class level, but the topic-model-based technique performs better at test-method level. In terms of efficiency, the static call-graph-based technique is also the most efficient when compared to other static techniques. When examining the similarity of faults detected for the four static techniques compared to the four dynamic ones, we find that on average, the faults uncovered by these two groups of techniques are quite dissimilar, with the top 10% of test cases agreeing on only ≈ 25% - 30% of detected faults. This prompts further research into the severity/importance of faults uncovered by these techniques, and into the potential for combining static and dynamic information for more effective approaches.
Conference Paper
Modern cloud-software providers, such as Salesforce.com, increasingly adopt large-scale continuous integration environments. In such environments, assuring high developer productivity is strongly dependent on conducting testing efficiently and effectively. Specifically, to shorten feedback cycles, test prioritization is popularly used as an optimization mechanism for ranking tests to run by their likelihood of revealing failures. To apply test prioritization in industrial environments, we present a novel approach (tailored for practical applicability) that integrates multiple existing techniques via a systematic framework of machine learning to rank. Our initial empirical evaluation on a large real-world dataset from Salesforce.com shows that our approach significantly outperforms existing individual techniques.
Conference Paper
A common and natural intuition among software testers is that test cases need to differ if a software system is to be tested properly and its quality ensured. Consequently, much research has gone into formulating distance measures for how test cases, their inputs and/or their outputs differ. However, common to these proposals is that they are data type specific and/or calculate the diversity only between pairs of test inputs, traces or outputs. We propose a new metric to measure the diversity of sets of tests: the test set diameter (TSDm). It extends our earlier, pairwise test diversity metrics based on recent advances in information theory regarding the calculation of the normalized compression distance (NCD) for multisets. A key advantage is that TSDm is a universal measure of diversity and so can be applied to any test set regardless of data type of the test inputs (and, moreover, to other test-related data such as execution traces). But this universality comes at the cost of greater computational effort compared to competing approaches. Our experiments on four different systems show that the test set diameter can help select test sets with higher structural and fault coverage than random selection even when only applied to test inputs. This can enable early test design and selection, prior to even having a software system to test, and complement other types of test automation and analysis. We argue that this quantification of test set diversity creates a number of opportunities to better understand software quality and provides practical ways to increase it.
Article
Test-case prioritization, proposed at the end of last century, aims to schedule the execution order of test cases so as to improve test effectiveness. In the past years, test-case prioritization has gained much attention, and has significant achievements in five aspects: prioritization algorithms, coverage criteria, measurement, practical concerns involved, and application scenarios. In this article, we will first review the achievements of test-case prioritization from these five aspects and then give our perspectives on its challenges.
Conference Paper
In continuous integration development environments, software engineers frequently integrate new or changed code with the mainline codebase. This can reduce the amount of code rework that is needed as systems evolve and speed up development time. While continuous integration processes traditionally require that extensive testing be performed following the actual submission of code to the codebase, it is also important to ensure that enough testing is performed prior to code submission to avoid breaking builds and delaying the fast feedback that makes continuous integration desirable. In this work, we present algorithms that make continuous integration processes more cost-effective. In an initial pre-submit phase of testing, developers specify modules to be tested, and we use regression test selection techniques to select a subset of the test suites for those modules that render that phase more cost-effective. In a subsequent post-submit phase of testing, where dependent modules as well as changed modules are tested, we use test case prioritization techniques to ensure that failures are reported more quickly. In both cases, the techniques we utilize are novel, involving algorithms that are relatively inexpensive and do not rely on code coverage information -- two requirements for conducting testing cost-effectively in this context. To evaluate our approach, we conducted an empirical study on a large data set from Google that we make publicly available. The results of our study show that our selection and prioritization techniques can each lead to cost-effectiveness improvements in the continuous integration process.
Book
Written by leading authorities in database and Web technologies, this book is essential reading for students and practitioners alike. The popularity of the Web and Internet commerce provides many extremely large datasets from which information can be gleaned by data mining. This book focuses on practical algorithms that have been used to solve key problems in data mining and can be applied successfully to even the largest datasets. It begins with a discussion of the map-reduce framework, an important tool for parallelizing algorithms automatically. The authors explain the tricks of locality-sensitive hashing and stream processing algorithms for mining data that arrives too fast for exhaustive processing. Other chapters cover the PageRank idea and related tricks for organizing the Web, the problems of finding frequent itemsets and clustering. This second edition includes new and extended coverage on social networks, machine learning and dimensionality reduction.
Chapter
The experiment data from the operation is input to the analysis and interpretation. After collecting experimental data in the operation phase, we want to be able to draw conclusions based on this data. To be able to draw valid conclusions, we must interpret the experiment data.
Article
The increase in size and complexity of modern software systems requires scalable, systematic, and automated testing approaches. Model-based testing (MBT), as a systematic and automated test case generation technique, is being successfully applied to verify industrial-scale systems and is supported by commercial tools. However, scalability is still an open issue for large systems, as in practice there are limits to the amount of testing that can be performed in industrial contexts. Even with standard coverage criteria, the resulting test suites generated by MBT techniques can be very large and expensive to execute, especially for system level testing on real deployment platforms and network facilities. Therefore, a scalable MBT technique should be flexible regarding the size of the generated test suites and should be easily accommodated to fit resource and time constraints. Our approach is to select a subset of the generated test suite in such a way that it can be realistically executed and analyzed within the time and resource constraints, while preserving the fault revealing power of the original test suite to a maximum extent. In this article, to address this problem, we introduce a family of similarity-based test case selection techniques for test suites generated from state machines. We evaluate 320 different similarity-based selection techniques and then compare the effectiveness of the best similarity-based selection technique with other common selection techniques in the literature. The results based on two industrial case studies, in the domain of embedded systems, show significant benefits and a large improvement in performance when using a similarity-based approach. We complement these analyses with further studies on the scalability of the technique and the effects of failure rate on its effectiveness. We also propose a method to identify optimal tradeoffs between the number of test cases to run and fault detection.
Conference Paper
The importance of using requirements information in the testing phase has been well recognized by the requirements engineering community, but to date, a vast majority of regression testing techniques have primarily relied on software code information. Incorporating requirements information into the current testing practice could help software engineers identify the source of defects more easily, validate the product against requirements, and maintain software products in a holistic way. In this paper, we investigate whether the requirements-based clustering approach that incorporates traditional code analysis information can improve the effectiveness of test case prioritization techniques. To investigate the effectiveness of our approach, we performed an empirical study using two Java programs with multiple versions and requirements documents. Our results indicate that the use of requirements information during the test case prioritization process can be beneficial.
Article
Test case prioritization techniques, which are used to improve the cost-effectiveness of regression testing, order test cases in such a way that those cases that are expected to outperform others in detecting software faults are run earlier in the testing phase. The objective of this study is to examine what kind of techniques have been widely used in papers on this subject, determine which aspects of test case prioritization have been studied, provide a basis for the improvement of test case prioritization research, and evaluate the current trends of this research area. We searched for papers in the following five electronic databases: IEEE Explorer, ACM Digital Library, Science Direct, Springer, and Wiley. Initially, the search string retrieved 202 studies, but upon further examination of titles and abstracts, 120 papers were identified as related to test case prioritization. There exists a large variety of prioritization techniques in the literature, with coverage-based prioritization techniques (i.e., prioritization in terms of the number of statements, basic blocks, or methods test cases cover) dominating the field. The proportion of papers on model-based techniques is on the rise, yet the growth rate is still slow. The proportion of papers that use datasets from industrial projects is found to be 64 %, while those that utilize public datasets for validation are only 38 %. On the basis of this study, the following recommendations are provided for researchers: (1) Give preference to public datasets rather than proprietary datasets; (2) develop more model-based prioritization methods; (3) conduct more studies on the comparison of prioritization methods; (4) always evaluate the effectiveness of the proposed technique with well-known evaluation metrics and compare the performance with the existing methods; (5) publish surveys and systematic review papers on test case prioritization; and (6) use datasets from industrial projects that represent real industrial problems.
Article
During regression testing, a modified system is often retested using an existing test suite. Since the size of the test suite may be very large, testers are interested in detecting faults in the modified system as early as possible during this retesting process. Test prioritization attempts to order tests for execution so that the chances of early detection of faults during retesting are increased. The existing prioritization methods are based on the source code of the system under test. In this paper, we present and evaluate two model-based selective methods and a dependence-based method of test prioritization utilizing the state-based model of the system under test. These methods assume that the modifications are made both on the system under test and its model. The existing test suite is executed on the system model and information about this execution is used to prioritize tests. Execution of the model is inexpensive as compared with execution of the system under test; therefore, the overhead associated with test prioritization is relatively small. In addition, we present an analytical framework for evaluation of test prioritization methods. This framework may reduce the cost of evaluation as compared with the framework that is based on observation. We have performed an empirical study in which we compared different test prioritization methods. The results of the empirical study suggest that system models may improve the effectiveness of test prioritization with respect to early fault detection.
Conference Paper
Our experience with applying model-based testing on industrial systems showed that the generated test suites are often too large and costly to execute given project deadlines and the limited resources for system testing on real platforms. In such industrial contexts, it is often the case that only a small subset of test cases can be run. In previous work, we proposed novel test case selection techniques that minimize the similarities among selected test cases and outperforms other selection alternatives. In this paper, our goal is to gain insights into why and under which conditions similarity-based selection techniques, and in particular our approach, can be expected to work. We investigate the properties of test suites with respect to similarities among fault revealing test cases. We thus identify the ideal situation in which a similarity-based selection works best, which is useful for devising more effective similarity functions. We also address the specific situation in which a test suite contains outliers, that is a small group of very different test cases, and show that it decreases the effectiveness of similarity-based selection. We then propose, and successfully evaluate based on two industrial systems, a solution based on rank scaling to alleviate this problem.
Article
Test case selection in model-based testing is discussed focusing on the use of a similarity function. Automatically generated test suites usually have redundant test cases. The reason is that test generation algorithms are usually based on structural coverage criteria that are applied exhaustively. These criteria may not be helpful to detect redundant test cases as well as the suites are usually impractical due to the huge number of test cases that can be generated. Both problems are addressed by applying a similarity function. The idea is to keep in the suite the less similar test cases according to a goal that is defined in terms of the intended size of the test suite. The strategy presented is compared with random selection by considering transition-based and fault-based coverage. The results show that, in most of the cases, similarity-based selection can be more effective than random selection when applied to automatically generated test suites. Copyright © 2009 John Wiley & Sons, Ltd.
Article
Where the creation, understanding, and assessment of software testing and regression testing techniques are concerned, controlled experimentation is an indispensable research methodology. Obtaining the infras- tructure necessary to support such experimentation, however, is difficult and expensive. As a result, progress in experimentation with testing techniques has been slow, and empirical data on the costs and effectiveness of techniques remains relatively scarce. To help address this problem, we have been designing and constructing infrastructure to support controlled experimentation with testing and regression testing techniques. This paper reports on the challenges faced by researchers experimenting with testing techniques, including those that inform the design of our infrastructure. The paper then describes the infrastructure that we are creating in response to these challenges, and that we are now making available to other researchers, and discusses the impact that this infrastructure has had and can be expected to have.
Book
Like other sciences and engineering disciplines, software engineering requires a cycle of model building, experimentation, and learning. Experiments are valuable tools for all software engineers who are involved in evaluating and choosing between different methods, techniques, languages and tools. The purpose of Experimentation in Software Engineering is to introduce students, teachers, researchers, and practitioners to empirical studies in software engineering, using controlled experiments. The introduction to experimentation is provided through a process perspective, and the focus is on the steps that we have to go through to perform an experiment. The book is divided into three parts. The first part provides a background of theories and methods used in experimentation. Part II then devotes one chapter to each of the five experiment steps: scoping, planning, execution, analysis, and result presentation. Part III completes the presentation with two examples. Assignments and statistical material are provided in appendixes. Overall the book provides indispensable information regarding empirical studies in particular for experiments, but also for case studies, systematic literature reviews, and surveys. It is a revision of the authors' book, which was published in 2000. In addition, substantial new material, e.g. concerning systematic literature reviews and case study research, is introduced. The book is self-contained and it is suitable as a course book in undergraduate or graduate studies where the need for empirical studies in software engineering is stressed. Exercises and assignments are included to combine the more theoretical material with practical aspects. Researchers will also benefit from the book, learning more about how to conduct empirical studies, and likewise practitioners may use it as a "cookbook" when evaluating new methods or techniques before implementing them in their organization. © Springer-Verlag Berlin Heidelberg 2012. All rights are reserved.
Article
A test coverage criterion defines a set E<sub>r</sub> of entities of the program flowgraph and requires that every entity in this set is covered under some test Case. Coverage criteria are also used to measure the adequacy of the executed test cases. In this paper, we introduce the notion of spanning sets of entities for coverage testing. A spanning set is a minimum subset of E<sub>r</sub>, such that a test suite covering the entities in this subset is guaranteed to cover every entity in E<sub>r</sub>. When the coverage of an entity always guarantees the coverage of another entity, the former is said to subsume the latter. Based on the subsumption relation between entities, we provide a generic algorithm to find spanning sets for control flow and data flow-based test coverage criteria. We suggest several useful applications of spanning sets: They help reduce and estimate the number of test cases needed to satisfy coverage criteria. We also empirically investigate how the use of spanning sets affects the fault detection effectiveness.
Article
Test case prioritization techniques schedule test cases for execution in an order that attempts to increase their effectiveness at meeting some performance goal. Various goals are possible; one involves rate of fault detection, a measure of how quickly faults are detected within the testing process. An improved rate of fault detection during testing can provide faster feedback on the system under test and let software engineers begin correcting faults earlier than might otherwise be possible. One application of prioritization techniques involves regression testing, the retesting of software following modifications; in this context, prioritization techniques can take advantage of information gathered about the previous execution of test cases to obtain test case orderings. We describe several techniques for using test execution information to prioritize test cases for regression testing, including: 1) techniques that order test cases based on their total coverage of code components; 2) techniques that order test cases based on their coverage of code components not previously covered; and 3) techniques that order test cases based on their estimated ability to reveal faults in the code components that they cover. We report the results of several experiments in which we applied these techniques to various test suites for various programs and measured the rates of fault detection achieved by the prioritized test suites, comparing those rates to the rates achieved by untreated, randomly ordered, and optimally ordered suites
Article
To reduce the cost of regression testing, software testers may prioritize their test cases so that those which are more important, by some measure, are run earlier in the regression testing process. One potential goal of such prioritization is to increase a test suite's rate of fault detection. Previous work reported results of studies that showed that prioritization techniques can significantly improve rate of fault detection. Those studies, however, raised several additional questions: 1) Can prioritization techniques be effective when targeted at specific modified versions; 2) what trade-offs exist between fine granularity and coarse granularity prioritization techniques; 3) can the incorporation of measures of fault proneness into prioritization techniques improve their effectiveness? To address these questions, we have performed several new studies in which we empirically compared prioritization techniques using both controlled experiments and case studies
Article
We present a tool, called sif, for finding all similar files in a large file system. Files are considered similar if they have significant number of common pieces, even if they are very different otherwise. For example, one file may be contained, possibly with some changes, in another file, or a file may be a reorganization of another file. The running time for finding all groups of similar files, even for as little as 25% similarity, is on the order of 500MB to 1GB an hour. The amount of similarity and several other customized parameters can be determined by the user at a post-processing stage, which is very fast. Sif can also be used to very quickly identify all similar files to a query file using a preprocessed index. Application of sif can be found in file management, information collecting (to remove duplicates), program reuse, file synchronization, data compression, and maybe even plagiarism detection. 1. Introduction Our goal is to identify files that came from the same source ...
Article
Test case prioritization techniques schedule test cases for execution in an order that attempts to increase their effectiveness at meeting some performance goal. Various goals are possible; one involves rate of fault detection --- a measure of how quickly faults are detected within the testing process. An improved rate of fault detection during testing can provide faster feedback on the system under test and let software engineers begin correcting faults earlier than might otherwise be possible. One application of prioritization techniques involves regression testing -- the retesting of software following modifications; in this context, prioritization techniques can take advantage of information gathered about the previous execution of test cases to obtain test case orderings. In this paper, we describe several techniques for using test execution information to prioritize test cases for regression testing, including: (1) techniques that order test cases based on their total co...
Article
Test case prioritization techniques schedule test cases for execution in an order that attempts to maximize some objective function. A variety of objective functions are applicable; one such function involves rate of fault detection --- a measure of how quickly faults are detected within the testing process. An improved rate of fault detection during regression testing can provide faster feedback on a system under regression test and let debuggers begin their work earlier than might otherwise be possible. In this paper, we describe several techniques for prioritizing test cases and report our empirical results measuring the effectiveness of these techniques for improving rate of fault detection. The results provide insights into the tradeoffs among various techniques for test case prioritization. 1. Introduction Software developers often save the test suites they develop for their software, so that they can reuse those suites later as the software evolves. Such test suite reuse, in t...
Let's assume we had to pay for testing. Keynote at AST 2016
  • Kim Herzig
  • Herzig Kim
Kim Herzig. 2016. Let's assume we had to pay for testing. Keynote at AST 2016. (2016). https://www.kim-herzig.de/2016/06/28/ keynote-ast-2016/
Adaptive Random Testing: The ART of Test Case Diversity
  • Fei-Ching Tsong Yueh Chen
  • Robert G Kuo
  • T H Merkel
  • Tse
Regression Testing Minimization, Selection and Prioritization: A
  • Shin Yoo
  • Mark Harman