Conference PaperPDF Available

REMAP: Using Rule Mining and Multi-objective Search for Dynamic Test Case Prioritization

Authors:

Abstract and Figures

Test case prioritization (TP) prioritizes test cases into an optimal order for achieving specific criteria (e.g., higher fault detection capability) as early as possible. However, the existing TP techniques usually only produce a static test case order before the execution without taking runtime test case execution results into account. In this paper, we propose an approach for black-box dynamic TP using rule mining and multi-objective search (named as REMAP). REMAP has three key components: 1) Rule Miner, which mines execution relations among test cases from historical execution data; 2) Static Prioritizer, which defines two objectives (i.e., fault detection capability (FDC) and test case reliance score (TRS)) and applies multi-objective search to prioritize test cases statically; and 3) Dynamic Executor and Prioritizer, which executes statically- prioritized test cases and dynamically updates the test case order based on the runtime test case execution results. We empirically evaluated REMAP with random search, greedy based on FDC, greedy based on FDC and TRS, static search-based prioritization, and rule-based prioritization using two industrial and three open source case studies. Results showed that REMAP significantly outperformed the other approaches for 96% of the case studies and managed to achieve on average 18% higher Average Percentage of Faults Detected (APFD).
Content may be subject to copyright.
A preview of the PDF is not available
... For our empirical study, we collected 18 search problems from both industrial projects and the literature, across topics of test case minimization (Wang et al. 2015), test case prioritization (Pradhan et al. 2016b(Pradhan et al. , 2018, rule mining and configuration generation (Safdar et al. 2017), requirements allocation for inspection (Yue and Ali 2014), test case selection (Pradhan et al. 2016a), test case minimization (Zhang et al. 2019), uncertainty-wise requirements prioritization (Zhang et al. 2020), testing resource allocation (Pradhan et al. 2021;Wang et al. 2008), integration and test order (Guizzo et al. 2017;Pradhan et al. 2021), and software release planning (Dantas et al. 2015;Greer and Ruhe 2004). We obtained data for this empirical study by following the procedure described in Fig. 1. ...
... However, for largescale and complex VCSs, it is difficult to exhaustively test them due to limited time and resources; thus, their testing needs to be optimized, e.g., with test minimization and test prioritization. To this end, in our previous works, we have implemented a test suite minimization approach (Wang et al. 2015), and two test case prioritization approaches (Pradhan et al. 2016b(Pradhan et al. , 2018 to improve the testing efficiency. In addition, we have implemented a search-based approach to discover faulty configurations caused by interactions among VCSs belonging to different families of VCSs (Safdar et al. 2017). ...
... Test Case Prioritization-2 (Pradhan et al. 2018): For this problem, we proposed an approach for black-box dynamic test case prioritization using rule mining and multiobjective search, defined two objectives (fault detection capability and test case reliance score), and used three case studies to empirically evaluate MOSAs. Two of the three case studies include 60 and 624 test cases and the other one (consisting of 89 test cases) was from ABB Robotics for testing paint control systems of painting robots (Spieker et al. 2017). ...
Article
Full-text available
Multi-Objective Search Algorithms (MOSAs) have been applied to solve diverse Search-Based Software Engineering (SBSE) problems. In most cases, SBSE users select one or more commonly used MOSAs (for instance, Nondominated Sorting Genetic Algorithm II (NSGA-II)) to solve their search problems, without any justification (i.e., not supported by any evidence) on why those particular MOSAs are selected. However, when working with a specific multi-objective SBSE problem, users typically know what kind(s) of qualities they are looking for in solutions. Such qualities are represented by one or more Quality Indicators (QIs), which are often employed to assess various MOSAs to select the best MOSA. However, users usually have limited time budgets, which prevents them from executing multiple MOSAs and consequently selecting the best MOSA in the end. Therefore, for such users, it is highly preferred to select only one MOSA since the beginning. To this end, in this paper, we aim to assist SBSE users in finding appropriate MOSAs for their experiments, given their choices of QIs or quality aspects (e.g., Convergence, Uniformity). To achieve this aim, we conduct an extensive empirical evaluation with 18 search problems from a set of real-world, industrial, and open-source case studies, to study preferences among commonly used QIs and MOSAs in SBSE. We observe that each QI has its own specific most-preferred MOSA and vice versa; NSGA-II and Strength Pareto Evolutionary Algorithm 2 (SPEA2) are the most preferred MOSAs by QIs; no QI is the most preferred by all the MOSAs; the preferences between QIs and MOSAs vary across the search problems; QIs covering the same quality aspect(s) do not necessarily have the same preference for MOSAs. Based on our results, we provide discussions and guidelines for SBSE users to select appropriate MOSAs based on experimental evidence.
... Within the first application domain, we have integrated the proposed seeding strategies in the framework for test case selection proposed in our previous work [7,8], which is an open-source framework. Within the second application domain, we used openly available datasets of industrial and real-world case studies [34,82,83,91]. • We perform an empirical evaluation using six case studies in the first application domain and four in the second one. ...
... However, the codebase can be extremely large, and it might not be possible to execute the entire test suite. Subsequently, regression test selection approaches have been investigated to make continuous integration testing more costeffective [34,49,82,83,91,102]. ...
... CI environments can provide a large amount of historical information in relation to the execution of tests (e.g.., tested versions, verdict information, test case duration). Inspired by previous works [34,82,83,95], we have defined the following objectives functions adapted to the multi-objective test case selection context: ...
Article
Full-text available
The time it takes software systems to be tested is usually long. Search-based test selection has been a widely investigated technique to optimize the testing process. In this paper, we propose a set of seeding strategies for the test case selection problem that generate the initial population of pareto-based multi-objective algorithms, with the goals of (1) helping to find an overall better set of solutions and (2) enhancing the convergence of the algorithms. The seeding strategies were integrated with four state-of-the-art multi-objective search algorithms and applied into two contexts where regression-testing is paramount: (1) Simulation-based testing of Cyber-Physical Systems and (2) Continuous Integration. For the first context, we evaluated our approach by using six fitness function combinations and six independent case studies, whereas in the second context we derived a total of six fitness function combinations and employed four case studies. Our evaluation suggests that some of the proposed seeding strategies are indeed helpful for solving the multi-objective test case selection problem. Specifically, the proposed seeding strategies provided a higher convergence of the algorithms towards optimal solutions in 96% of the studied scenarios and an overall cost-effectiveness with a standard search budget in 85% of the studied scenarios.
... Dependencies and similarities between test cases are often considered to make informed decisions during TCP. On the one hand, dependencies refers to pair of test cases that should be considered together (join execution) [21], implementation dependencies among them [41], or interrelated outcomes (verdict patterns) expressed as association rules [90]. On the other hand, test case similarity is defined as the distance between test cases based on a specific criterion, e.g. ...
... In particular, the historical effectiveness and the execution time are the attributes more frequently appearing in each category. Test case dependencies (67%) have been used in a few studies that trace dependencies from their specification [41] or analyse coincidences of verdicts [90,91]. The cost associated to similarity-based attributes might be one of the reason of their lack of application in industry. ...
... Under controlled environments, test cases are unitary, independent and do not need special considerations to be executed. These assumptions do not hold in industrial case System testing level Rely on information used in manual testing [58] Test case dependencies Include dependency analysis [41,90,91] SUT Faults affect different users Definition of usage patterns as [6] charact. or functionalities attributes Heterogeneity of artefacts and languages Combination of attributes and ad-hoc preprocessing [16] Real-time conditions and specific standards Information at runtime during testing [109] studies, as reflected by the appearance of attributes for inclusion in the taxonomy that were specifically conceived to introduce practicalities. ...
Preprint
Full-text available
Most software companies have extensive test suites and re-run parts of them continuously to ensure recent changes have no adverse effects. Since test suites are costly to execute, industry needs methods for test case prioritisation (TCP). Recently, TCP methods use machine learning (ML) to exploit the information known about the system under test (SUT) and its test cases. However, the value added by ML-based TCP methods should be critically assessed with respect to the cost of collecting the information. This paper analyses two decades of TCP research, and presents a taxonomy of 91 information attributes that have been used. The attributes are classified with respect to their information sources and the characteristics of their extraction process. Based on this taxonomy, TCP methods validated with industrial data and those applying ML are analysed in terms of information availability, attribute combination and definition of data features suitable for ML. Relying on a high number of information attributes, assuming easy access to SUT code and simplified testing environments are identified as factors that might hamper industrial applicability of ML-based TCP. The TePIA taxonomy provides a reference framework to unify terminology and evaluate alternatives considering the cost-benefit of the information attributes.
... Pradhan et al. [25] combined the rule-based mining and multi-objective search method for optimizing the prioritization and test sequence generation. The author used the fault detection capability and reliance score of test cases for assigning the priorities to test cases. ...
... The comparative evaluation and validation of the proposed model are done against the Random Search (RS) [25], Genetics (G) [25], and REMAP [25] methods. These are the optimization and classification methods defined with a different number of objectives. ...
... The comparative evaluation and validation of the proposed model are done against the Random Search (RS) [25], Genetics (G) [25], and REMAP [25] methods. These are the optimization and classification methods defined with a different number of objectives. ...
Article
Full-text available
Network and real-time projects requires special and effective testing consideration before implementing in real environment. The effective test-sequence not only reduces the actual testing time but also reduces the cost and efforts. The design-flow diagram and the module attributes play an essential role in generating a valid path sequence. In this paper, an automated and generalized framework is designed that processes the code project and generates the optimized test sequence. In the earlier stage of this framework, the structural and relational features of program code are extracted, and the design flow-diagram is constructed. While constructing the diagram, the design-time features are computed, connected, and updated with each node. The connectivity, dependency, positional, and contributional features are computed for each node. In the second stage, this weighted design-flow diagram and fault weights are used in a combined form for deciding the low-cost test sequence. The proposed framework is applied to five network, security and robotics based code sources. The comparative analysis is done against the Random Search, Genetics, and REMAP methods for test sequence generation. The proposed model achieved an average APFDc score of 87.11%. The proposed model achieved 3.3% gain over REMAP (Ripper + IBEA(3Obj)), 7.9% gain over REMAP (Ripper + SPEA2(2Obj)), 20.63% gain over Genetics (3Objects), 21.05% gain over Genetics (2Objects), 34.5% gain over Random Forest (3 Objects) and 34.96% gain over Random forests(3Objs). The results confirm that the proposed model achieved the higher APFDc score than state-of-art methods.
... More similar to this study's approach is the one proposed by Pradhan et al. (2018Pradhan et al. ( , 2019. In their approach, they aim at re-prioritizing test cases based on rules inferred from historical data in context of continuous integration, showing improvement over traditional static test prioritization approaches. ...
Article
Full-text available
Product line testing is challenging due to the potentially huge number of configurations. Several approaches have tackled this challenge; most of them focused on reducing the number of tested products by selecting a representative subset. However, little attention has been paid to product line test optimization using test results, while tests are executed. This paper aims at optimizing the testing process of product lines by increasing the fault detection rate. To this end we propose a dynamic test prioritization approach. In contrast to traditional static test prioritization, our dynamic test prioritization leverages information of tests being executed in specific products. Processing this information, the initially prioritized tests are rearranged in order to find non-discovered faults. The proposed approach is valid for any kind of product lines, but we have adapted it to the context of configurable simulation models, an area where testing is especially time-consuming and optimization methods are paramount. The approach was empirically evaluated by employing two case studies. The results of this evaluation reveal that the proposed test prioritization approach improves both the static prioritization algorithm and the selected baseline technique. The results provide a basis for suggesting that the proposed dynamic test prioritization approach is appropriate to optimize the testing process of product lines.
... Test case prioritisation approaches Test Case Prioritisation (TCP) arranges test cases into an optimal order so that a specic criteria is met as early as possible. The work of [13] presents a black-box strategy called REMAP that incorporates three fundamental components: a rule miner, a static prioritiser, and a dynamic executor and prioritiser. ...
Article
Full-text available
Changes in the software necessitate confirmation testing and regression testing to be applied since new errors may be introduced with the modification. Test case prioritization is one method that could be applied to optimize which test cases should be executed first, involving how to schedule them in a certain order that detect faults as soon as possible.The main aim of our paper is to propose a test case prioritization technique by considering defect prediction as a criteria for prioritization in addition to the standard approach which considers the number of discovered faults. We have performed several experiments, considering only faults and the defect prediction values for each class. We compare our approach with random test case execution (for a theoretical example) and with the fault-based approach (for the Mockito project). The results are encouraging, for several class changes we obtained better results with our proposed hybrid approach.
... Nevertheless, it is worth noting that P SI indicators usually need to work together with generic quality indicators (e.g., those in Table 9) to provide reliable evaluations since they may be irrelevant to Pareto-based optimization (e.g., only focusing on particular objectives in evaluation). For example, the study [106] mainly relies on APFD, the average percentage of fault detected, to evaluate the solution set of prioritized test cases. Indeed, APFD is a frequently used P SI in the test case prioritization, but it can only reflect the rate of fault detected, not the reliance of test cases, both of which are the objectives to be optimized for the problem. ...
Article
Full-text available
With modern requirements, there is an increasing tendency of considering multiple objectives/criteria simultaneously in many Software Engineering (SE) scenarios. Such a multi-objective optimization scenario comes with an important issue - how to evaluate the outcome of optimization algorithms, which typically is a set of incomparable solutions (i.e., being Pareto nondominated to each other). This issue can be challenging for the SE community, particularly for practitioners of Search-Based SE (SBSE). On one hand, multi-objective optimization could still be relatively new to SE/SBSE researchers, who may not be able to identify the right evaluation methods for their problems. On the other hand, simply following the evaluation methods for general multi-objective optimization problems may not be appropriate for specific SBSE problems, especially when the problem nature or decision maker's preferences are explicitly/implicitly known. This has been well echoed in the literature by various inappropriate/inadequate selection and inaccurate/misleading use of evaluation methods. In this paper, we first carry out a systematic and critical review of quality evaluation for multi-objective optimization in SBSE. We survey 717 papers published between 2009 and 2019 from 36 venues in seven repositories, and select 95 prominent studies, through which we identify five important but overlooked issues in the area. We then conduct an in-depth analysis of quality evaluation indicators/methods and general situations in SBSE, which, together with the identified issues, enables us to codify a methodological guidance for selecting and using evaluation methods in different SBSE scenarios.
Chapter
Test case prioritization (TCP) is a technique used to prioritize test cases based on some criteria to achieve a high rate of early fault detection and to reduce the cost and time spent on execution of the whole test case. However, many of these proposed techniques are based on some coverage information. Structural coverage such as statement coverage, branch coverage, and method coverage information was largely used for implementing these prioritization techniques. This information will be gathered at either of the two states (static or dynamic state of software). Applying this coverage information as a criterion to prioritize test cases could result in at least two impacts because of the way it is collected. The cost spent on the collection of this information is the first impact, which is the cost of the latter tends to be higher than the former cost. Second, both of these information are imprecise because static coverage information cannot deal with dynamic features and dynamic coverage information is collected based on an early version of the software rather than the current version program. This research was intended to develop a TCP technique without using any of this coverage information and explore potential metrics that could be used for developing test case prioritization to enhance the fault detection rate of test cases at the early execution of the test suite. With that being said, this study explores and examines static software metrics (McCabe’s Cyclomatic complexity, and Halstead’s metrics) to develop our TCP algorithm. The developed algorithm was examined with two benchmark algorithms on four subject programs to evaluate the performance of the algorithm. APFD metrics are used as performance evaluation metrics, and the performance of the developed algorithm outperforms both of the benchmark algorithms.KeywordsSoftware testingRegression testingTest case prioritizationStatic metricsAPFD
Article
Given the large numbers of publications in software engineering, frequent literature reviews are required to keep current on work in specific areas. One tedious work in literature reviews is to find relevant studies amongst thousands of non-relevant search results. In theory, expert systems can assist in finding relevant work but those systems have primarily been tested in simulations rather than in application to actual literature reviews. Hence, few researchers have faith in such expert systems. Accordingly, using a realistic case study, this paper assesses how well our state-of-the-art expert system can help with literature reviews. The assessed literature review aimed at identifying test case prioritization techniques for automated UI testing, specifically from 8,349 papers on IEEE Xplore. This corpus was studied with an expert system that incorporates an incrementally updated human-in-the-loop active learning tool. Using that expert system, in three hours, we found 242 relevant papers from which we identified 12 techniques representing the state-of-the-art in test case prioritization when source code information is not available. These results were then validated by six other graduate students manually exploring the same corpus. Without the expert system, this task would have required 53 h and would have found 27 additional papers. That is, our expert system achieved 90% recall with 6% of the human effort cost when compared to a conventional manual method. Significantly, the same 12 state-of-the-art test case prioritization techniques were identified by both the expert system and the manual method. That is, the 27 papers missed by the expert system would not have changed the conclusion of the literature review. Hence, if this result generalizes, it endorses the use of our expert system to assist in literature reviews.
Conference Paper
Full-text available
The importance of cost-effectively prioritizing test cases is undeniable in automated testing practice in industry. This paper focuses on prioritizing test cases developed to test product lines of Video Conferencing Systems (VCSs) at Cisco Systems, Norway. Each test case requires setting up configurations of a set of VCSs, invoking a set of test APIs with specific inputs, and checking statuses of the VCSs under test. Based on these characteristics and available information related with test case execution (e.g., number of faults detected), we identified that the test case prioritization problem in our particular context should focus on achieving high coverage of configurations, test APIs, statuses, and high fault detection capability as quickly as possible. To solve this problem, we propose a search-based test case prioritization approach (named STIPI) by defining a fitness function with four objectives and integrating it with a widely applied multi-objective optimization algorithm (named Non-dominated Sorting Genetic Algorithm II). We compared STIPI with random search (RS), Greedy algorithm, and three approaches adapted from literature, using three real sets of test cases from Cisco with four time budgets (25 %, 50 %, 75 % and 100 %). Results show that STIPI significantly outperformed the selected approaches and managed to achieve better performance than RS for on average 39.9 %, 18.6 %, 32.7 % and 43.9 % for the coverage of configurations, test APIs, statuses and fault detection capability, respectively.
Article
Full-text available
Test case prioritization schedules test cases for execution in an order that attempts to accelerate the detection of faults. The order of test cases is determined by prioritization objectives such as covering code or critical components as rapidly as possible. The importance of this technique has been recognized in the context of Highly-Configurable Systems (HCSs), where the potentially huge number of configurations makes testing extremely challenging. However, current approaches for test case prioritization in HCSs suffer from two main limitations. First, the prioritization is usually driven by a single objective which neglects the potential benefits of combining multiple criteria to guide the detection of faults. Second, instead of using industry-strength case studies, evaluations are conducted using synthetic data, which provides no information about the effectiveness of different prioritization objectives. In this paper, we address both limitations by studying 63 combinations of up to three prioritization objectives in accelerating the detection of faults in the Drupal framework. Results show that non–functional properties such as the number of changes in the features are more effective than functional metrics extracted from the configuration model. Results also suggest that multi-objective prioritization typically results in faster fault detection than mono-objective prioritization.
Conference Paper
Full-text available
Due to limited time and resources available for execution, test case selection always remains crucial for cost-effective testing. It is even more prominent when test cases require manual steps, e.g., operating physical equipment. Thus, test case selection must consider complicated trade-offs between cost (e.g., execution time) and effectiveness (e.g., fault detection capability). Based on our industrial collaboration within the Maritime domain, we identified a real-world and multi-objective test case selection problem in the context of robustness testing, where test case execution requires human involvement in certain steps, such as turning on the power supply to a device. The high-level goal is to select test cases for execution within a given time budget, where test engineers provide weights for a set of objectives, depending on testing requirements, standards, and regulations. To address the identified test case selection problem, we defined a fitness function including one cost measure, i.e., Time Difference (TD) and three effectiveness measures, i.e., Mean Priority (MPR), Mean Probability (MPO) and Mean Consequence (MC) that were identified together with test engineers. We further empirically evaluated eight multi-objective search algorithms, which include three weight-based search algorithms (e.g., Alternating Variable Method) and five Pareto-based search algorithms (e.g., Strength Pareto Evolutionary Algorithm 2 (SPEA2)) using two weight assignment strategies (WASs). Notice that Random Search (RS) was used as a comparison baseline. We conducted two sets of empirical evaluations: 1) Using a real world case study that was developed based on our industrial collaboration; 2) Simulating the real world case study to a larger scale to assess the scalability of the search algorithms. Results show that SPEA2 with either of the WASs performed the best for both the studies. Overall, SPEA2 managed to improve on average 32.7%, 39% and 33% in terms of MPR, MPO and MC respectively as compared to RS.
Conference Paper
Testing in Continuous Integration (CI) involves test case prioritization, selection, and execution at each cycle. Selecting the most promising test cases to detect bugs is hard if there are uncertainties on the impact of committed code changes or, if traceability links between code and tests are not available. This paper introduces Retecs, a new method for automatically learning test case selection and prioritization in CI with the goal to minimize the round-trip time between code commits and developer feedback on failed test cases. The Retecs method uses reinforcement learning to select and prioritize test cases according to their duration, previous last execution and failure history. In a constantly changing environment , where new test cases are created and obsolete test cases are deleted, the Retecs method learns to prioritize error-prone test cases higher under guidance of a reward function and by observing previous CI cycles. By applying Retecs on data extracted from three industrial case studies, we show for the first time that reinforcement learning enables fruitful automatic adaptive test case selection and prioritization in CI and regression testing.
Article
Associative classification is rule-based involving candidate rules as criteria of classification that provide both highly accurate and easily interpretable results to decision makers. The important phase of associative classification is rule evaluation consisting of rule ranking and pruning in which bad rules are removed to improve performance. Existing association rule mining algorithms relied on frequency-based rule evaluation methods such as support and confidence, failing to provide sound statistical or computational measures for rule evaluation, and often suffer from many redundant rules. In this research we propose predictability-based collective class association rule mining based on cross-validation with a new rule evaluation step. We measure the prediction accuracy of each candidate rule in inner cross-validation steps. We split a training dataset into inner training sets and inner test sets and then evaluate candidate rules’ predictive performance. From several experiments, we show that the proposed algorithm outperforms some existing algorithms while maintaining a large number of useful rules in the classifier. Furthermore, by applying the proposed algorithm to a real-life healthcare dataset, we demonstrate that it is practical and has potential to reveal important patterns in the dataset.
Conference Paper
App developers would like to understand the impact of their own and their competitors' software releases. To address this we introduce Causal Impact Release Analysis for app stores, and our tool, CIRA, that implements this analysis. We mined 38,858 popular Google Play apps, over a period of 12 months. For these apps, we identified 26,339 releases for which there was adequate prior and posterior time series data to facilitate causal impact analysis. We found that 33% of these releases caused a statistically significant change in user ratings. We use our approach to reveal important characteristics that distinguish causal significance in Google Play. To explore the actionability of causal impact analysis, we elicited the opinions of app developers: 56 companies responded, 78% concurred with the causal assessment, of which 33% claimed that their company would consider changing its app release strategy as a result of our findings.
Conference Paper
Test case prioritization is an essential part of test execution systems for large organizations developing software systems in the context that their software versions are released very frequently. They must be tested on a variety of compatible hardware with different configurations to ensure correct functioning of a software version on a compatible hardware. In practice, test case execution must not only execute cost-effective test cases in an optimal order, but also optimally allocate required test resources, in order to deliver high quality software releases. To optimize the current test execution system for testing software releases developed for Videoconferencing Systems (VCSs) at Cisco, Norway, in this paper, we propose a resource-aware multi-objective optimization solution with a fitness function defined based on four cost-effectiveness measures. In this context, a set of software releases must be tested on a set of compatible VCS hardware (test resources) by executing a set of cost-effective test cases in an optimal order within a given test cycle constrained by maximum allowed time budget and maximum available test resources. We empirically evaluated seven search algorithms regarding their performance and scalability by comparing with the current practice (random ordering (RO)). The results show that the proposed solution with the best search algorithm (i.e., Random-Weighted Genetic Algorithm) improved the current practice by reducing on average 40.6% of time for test resource allocation and test case execution, improved test resource usage on average by 37.9% and fault detection on average by 60%.