Content uploaded by Dusica Marijan
Author content
All content in this area was uploaded by Dusica Marijan on Feb 15, 2021
Content may be subject to copyright.
Received 15 May 2018; Revised xxxxx; Accepted xxxxx
DOI: xxx/xxxx
EXTENDED CONFERENCE PAPER
Reducing Ineffective Test Redundancy in Practical Continuous
Regression Testing
Dusica Marijan*1| Blind review2,3 | Blind review3
1Simula, Lysaker, Norway
2Org Division, Org Name, State name,
Country name
3Org Division, Org Name, State name,
Country name
Correspondence
*Dusica Marijan, Email: dusica@simula.no
Present Address
PO Box 134, 1325 Lysaker
Summary
Regression testing is an integral part of the software development lifecycle which
ensures that code changes are not negatively impacting the software functionality.
As, nowadays, software development is often carried out iteratively, with small code
increments continuously developed and regression tested, it is of critical importance
that regression testing is time-efficient. Yet, in practice, regression testing is often
long-lasting and faces scalability problems as software grows larger or as software
changes are made at higher frequency. One contributing factor to these issues is
test redundancy, which causes the same software functionality being tested multi-
ple times across different test cases. In large-scale industrial software, redundancy in
regression testing can significantly grow the size of test suites and thus the cost of
regression testing. This paper presents a practical approach for reducing ineffective
test redundancy in regression suites for configurable software developed in continu-
ous integration. The novelty of the approach lies in learning and predicting the fault
detection effectiveness of integration tests using historical test records and combin-
ing this information with coverage based redundancy metrics, to identify ineffective
redundancy, which is eliminated from a regression test suite. We apply and evalu-
ate the approach in testing of industrial video conferencing software and industrial
mobile application software developed in a continuous integration practice, as well as
using a large set of artificial test subjects. The results show that the proposed redun-
dancy reduction approach based on coverage and history analysis can significantly
improve industry practice in terms of time-efficiency and fault detection effective-
ness, in two distinct domains. This suggests that the proposed approach contributes
to the state-of-the-practice in continuous integration testing.
KEYWORDS:
Test redundancy, selective regression testing, highly-configurable software, highly-interleaving tests, test
optimization, continuous regression testing
1 INTRODUCTION
Regression testing runs frequently in software development under continuous integration (CI), as a means of ensuring that
changes made to the software as part of frequent code development or bug fixing have not introduced new bugs. With changes
2AUTHOR ONE ET AL
continuously made to the software, test suites for checking the correctness of the software often grow in size, covering new
functionality or new scenarios for the existing functionality. Growing the test suite size often causes increased test redundancy.
Test redundancy happens when different test cases cover the same functionality (usually partially), and thus executing the whole
test suite means testing the same functionality multiple times. A test suite containing redundant tests increases the cost of testing
as well as test maintenance effort, which implies that for effective testing, test redundancy needs to be analyzed and removed,
or maximally reduced.
We have observed the problem of making continuous regression testing efficient in practice, during a long-standing collab-
oration with Cisco Systems Norway on testing video-conferencing highly-configurable software (HCS). Specifically, we have
observed the need for time-efficient regression testing of industrial software, on the one hand, which is due to the nature of CI,
and the difficulty of systematically selecting the adequate set of test cases for covering particular changes, on the other hand,
which is due to the software complexity and size. Furthermore, the case that makes handling test redundancy particularly dif-
ficult is the testing of HCS using interleaved tests. These are the tests that test high interaction degree of individual features
forming chain configurations, where sub-configurations occur frequently across individual test cases. This makes it difficult to
effectively reduce redundancy in the initial test suite for HCS using traditional coverage-based techniques.
A common approach to regression testing in practice is a retest-all approach, which runs all available test cases, or a selective
retest-all approach, which runs all test cases covering the parts of software affected by changes. However, for real software
systems, retest-all can lead to scalability problems, if systems are large and software modifications (triggering regression testing)
are made frequently1. To make continuous regression testing efficient, only the relevant minimal set of tests that adequately
cover the changes should be selected and run. While retest-all approach is easily automated, selective regression testing is mainly
a manual process done by developers or testers, thus often unsystematic and liable to subjectivity. Because test suites grow in
size over time, introducing redundancy, manually selecting an adequate non-redundant set of regression tests soon becomes
inefficient and impractical. This is especially evident in continuous integration (CI) development, where regression testing runs
as part of a timeboxed development iteration restricted to a specific duration. As these iterations are short, less time is available
for testing, making it difficult to select regression tests using a manual approach. Because regression test feedback needs to be
provided rapidly, efficient regression test selection needs to be automated and further optimized for minimal test redundancy.
Related approaches to improving regression testing have been mainly focused on test minimization, without specific focus
on redundancy analysis for highly-interleaved tests in HCS. Moreover, existing regression testing approaches rarely focus on
specific needs of continuous integration environments, such as high time-efficiency, especially important due to extensive test
suites typically found in testing industrial HCS. Unlike these approaches, we focus on reducing redundancy of highly-interleaved
regression tests for HCS in CI. In our previous work [reference removed due to blind review], we proposed an approach for
minimizing test redundancy of interleaved tests in HCS using the overlap information of configuration options in HCS across a
test suite and historical fault detection effectiveness of test cases. This approach helped classify regression test cases as unique,
totally redundant, and partially redundant tests. In this work, we extend the approach by introducing more accurate classification
of partial redundancy. Specifically, we use regression trees to predict fault detection effectiveness of tests based on historical test
execution records, which provide more accurate classification of effective and ineffective partially redundant regression tests,
and thus more efficient regression test suites.
To evaluate our approach, we conducted an empirical study on one large industrial video conferencing software (VCS), and
one industrial HCS mobile application, both developed in CI. We use historical test execution data in evaluation in both cases.
We complemented the industrial test sets with a large set of artificial HCS test suites. We compare the approach with current
industry practice, with an advanced retest-all approach, and with random test selection, in terms of fault-detection effectiveness
and time-efficiency. The results of our study show that our approach can improve cost-effectiveness of regression testing of
HCS compared to industry practice, as well as often used practical strategies such as random and retest-all. To evaluate the
performance of our prediction model, we use the metrics such as precision, recall, accuracy, and F-score, which demonstrated
good performance of the prediction model. Overall, the experimental results show that our proposed approach contributes to an
improved continuous regression testing with respect to fault detection effectiveness and time-effectiveness.
The paper makes the following contributions with respect to our previous work [reference removed due to blind review]:
1) We introduce a novel classification approach for effective and ineffective test cases based on regression trees.
2) We perform the evaluation of the proposed classification approach using precision, recall, accuracy, and F-score metrics,
demonstrating the improvement in the performance of test redundancy reduction, compared to the previous work.
AUTHOR ONE ET AL 3
3) We provide new experimental results for the same test data used in the previous work, showing the improved performance
of the overall approach in terms of fault detection effectiveness due to the novel algorithm for predicting ineffective partially
redundant test cases.
4) We perform more comprehensive evaluation of the overall approach to test redundancy reduction by introducing a novel
case study of continuous regression testing, which reduces the threat to external validity of our results.
In the remainder of the paper we review the background work relevant for the understanding of our approach in Section 2. We
describe the running example of testing industrial video conferencing software that motivated our work in Section 3, followed
by the problem statement in Section 4. We describe our solution to reducing ineffective test redundancy in practical continuous
regression testing in Section 5. We present the experimental study performed to evaluate the approach in Section 6, and discuss
the results in Section 7. We discuss threats to validity in Section 8, and review related work in Section 9. Finally, we give the
conclusion and highlight further research in Section 10.
2 BACKGROUND
This section gives an overview of the background concepts relevant for our work, such as testing of HCS, and test redundancy
and regression test selection in the testing of HCS.
2.1 Testing of Highly-Configurable Software
Highly configurable software consists of a common code base and a set of configuration options (features) used for customizing
main software functionality. For example, features in a configurable video communication software include video resolution
or networking protocol type. We consider that a test case for testing of HCS is an integration test aimed at checking the cor-
rectness of interaction between involved features. Given a HCS with a set of features 𝐹𝑆 = {𝑓1, 𝑓2, ..., 𝑓𝑛}, and a test suite
𝑇 𝑆 = {𝑡1, 𝑡2, ..., 𝑡𝑚}for testing of HCS, we can define an association function 𝐶𝑜𝑣 ∶𝑇 𝑆 →𝐹 𝑆, associating a set of covering
features to each test case. 𝐶𝑜𝑣(𝑡𝑖)={𝑓1, 𝑓2, ..., 𝑓𝑘}, 𝑘 ≤𝑛, represents a set of features tested by 𝑡𝑖. We assume that the relation
between 𝑡𝑖∈𝑇 𝑆 and 𝑓𝑖∈𝐹 𝑆 is "many-to-many", which means that a feature can be covered by multiple test cases, and a test
case can cover multiple features. We further assume that ∀𝑡𝑖∈𝑇 𝑆, 𝐶 𝑜𝑣{𝑡𝑖}≠∅, i.e. each test covers at least one feature.
Configurable software typically requires sophisticated testing approaches due to the large number of features found in realistic
HCS, which normally causes redundancy in tests. The problem becomes especially evident in continuous regression testing,
which runs frequently and is highly time-constrained. Test feedback on pass/fail regression tests needs to be provided rapidly,
which necessitates that regression test suites are optimized for less redundancy. A common approach for testing of HCS has
been combinatorial interaction testing (CIT), where tests are developed to cover combinations of features depending on a target
coverage criteria2. However, existing approaches predominantly consider a fixed-strength interaction coverage (for example
pairwise), while we noticed in practice that realistic systems often require testing the combinations of features with various
degrees of interaction.
2.2 Test Case Redundancy
Test redundancy is often defined with respect to coverage metrics, such that for a given coverage criterion (for example pairwise
feature coverage) if two tests 𝑡𝑖and 𝑡𝑗execute the same pair of features, one of the tests in a suite 𝑇 𝑆 = {𝑡𝑖, 𝑡𝑗}contributes to
test suite redundancy.
There are numerous causes of test redundancy, for example, test reuse in manual test specification, when existing tests are
modified for testing new similar functionality, unintentionally leaving parts of already tested functionality. Other causes include
incomplete requirements specification, redundancy of requirements, legacy, static test suites, parallel testing, or distributed
testing3. In this work we focus on redundancy of integration tests, and in particular redundancy that is introduced during test
case design, where test cases are developed to test a varying number of feature interactions. Redundant combinations of feature
interactions in a test suite cause the same functionality being executed in multiple instances.
Formally, if we use 𝐹 𝑆 𝑒𝑡 to denote a feature set covered by a test suite 𝑇 𝑆, and 𝐶 𝑜𝑣(𝑡)to denote the set of features covered
by the test case 𝑡∈𝑇 𝑆, then:
4AUTHOR ONE ET AL
1. A test case 𝑡is considered redundant in 𝑇 𝑆 if 𝐶𝑜𝑣(𝑇 𝑆 ⧵{𝑡}) = 𝐹 𝑆 𝑒𝑡,
2. A test case 𝑡𝑖is considered redundant with respect to 𝑡𝑗if 𝐶𝑜𝑣(𝑡𝑖)⊆ 𝐶 𝑜𝑣(𝑡𝑗).
2.3 Regression Test Selection
Regression test selection is a technique used to improve the cost-efficiency of software testing after a change has been made, by
selecting an effective set of test cases that will check whether the change (e.g. defect fix) introduced new faults. For a given set of
changes 𝐶ℎ = {𝐶 ℎ1, 𝐶ℎ2, ..., 𝐶ℎ𝑛}made to a software system 𝑆, and an existing test suite 𝑇 𝑆 = {𝑇1, 𝑇2, ..., 𝑇𝑛}used to test 𝑆,
regression test selection determines 𝑇 𝑆′⊆ 𝑇 𝑆, consisting of tests that are relevant for 𝐶ℎ, to be used for testing of 𝑆′. Selection
is performed according to the selection function 𝑓𝑠𝑒𝑙 that uses information about 𝐶ℎ𝑖and 𝑇𝑖to find those 𝑇𝑖that are relevant for
𝐶ℎ𝑖. Regression test selection may additionally use a set of objectives 𝑂𝑏𝑗 = {𝑂𝑏𝑗1, 𝑂𝑏𝑗2, ..., 𝑂𝑏𝑗𝑛}to optimize the execution
of 𝑇 𝑆′. Some typical objectives include high fault detection, low execution time/cost, etc. Regression test selection may also
use historical test execution data to find an optimal 𝑇 𝑆′based on past test performance. Although various approaches have been
proposed for regression test selection, the challenge remains to efficiently identify and reduce ineffective test redundancy for
highly-interleaved tests.
2.4 Regression Trees for Test Classification
Regression trees4are a variant of decision trees, a supervised learning algorithm used to create models that can predict values
of a target dependent variable, which is represented by leaves in a tree, based on the value of input independent variables.
Dependent variables are called response variables, and independent variables are called predictors. In regression trees, the target
variable is continuous. In our case of testing HCS which consist of a large number of features, regression trees can be used to
predict whether a test case will detect failures in execution based on its previous fault detection effectiveness. In particular, the
tree nodes represents software features. These features are building blocks for test cases. Traversing the tree from a node to a leaf
constructs a set of features which represent a test case. Dependent variables represent a predicted fault detection effectiveness
of a test case, and independent variables represent historical fault detection effectiveness of a test case. Historical test execution
data indicate whether features covered by a test case caused failures in previous test executions. A decision based on a predictor
is represented by internal nodes in a tree, where each edge gives a next decision. Traversing the tree from a node to a leaf gives
the prediction for the effectiveness of a test case, based on values of predictors. Finally, the response variable is assigned a binary
value, meaning a test case is effective or not in the coming runs. An example of regression tree with explanation is given in
Section 5.
A variety of prediction models exists, such as support vector machines, random forests, neural networks, to name a few.
However the reason why we chose regression trees it twofold. First, graphical representation of regression trees resembles
feature models5commonly used to represent the variability in highly-configurable software6. Engineers in our industry case
study, which motivates this work, were already familiar with feature models, therefore they considered decision trees simple
to understand and interpret. Since the proposed work in aimed at improving industry practice, we believe that that choosing
the right technology plays an important role in the success of adopting the solution by industry. Second, according the thess
studies7 8, classification and regression trees are considered among top 10 algorithms in data mining. They are also successfully
used for predicting software quality9.
3INDUSTRIAL CASE STUDY
We present a case study describing the industrial practice of testing highly-configurable video conferencing systems at Cisco.
This case study motivated our research, and the proposed solution is aimed at improving the existing practice in this domain.
In this section, we describe one class of video conferencing systems, C90 codec. We describe the current industry practice of
testing C90. Finally, we introduce notations and assumptions used throughout the paper.
Cisco video conferencing software is developed as a product line, with the core conferencing functionality which is common
to all product variants, and a set of features which can be used to configure the variants according to user requirements. C90 is a
HCS, consisting of a wide range of features, including multisite features, audio and video features, security features, illustrated
in Figure 1 , which makes the testing of C90 complex. In the current industry practice (CIP) of testing C90, a QA engineer
AUTHOR ONE ET AL 5
FIGURE 1 Highly-configurable Cisco video conferencing system C90.
selects and combines features creating valid test cases, which are continuously executed as part of CI. This gives the motivation
to define a set of tests such that the test execution time is as minimized as possible. However, tests usually cover the same features
in multiple instances across a test suite, which can create test overhead if not handled carefully.
Consider a video conferencing system 𝐶𝐵𝐶 that consists of a set of features referred to as 𝐹 𝑆 = {𝑓1, 𝑓2, ..., 𝑓𝑛}.𝐹 𝑆 are used
to build a set of products 𝐶𝐵 𝐶𝑃 𝑟𝑜𝑑 = {𝑃1, ..., 𝑃𝑧}, which are software solutions representing various configurations of desktop
or boardroom conferencing. There is a test suite 𝑇 𝑆 = {𝑇1, 𝑇2, ..., 𝑇𝑚}used for testing of 𝐶𝐵𝐶.
𝐶𝑜𝑣(𝑇𝑖)={𝑓1, 𝑓2, ..., 𝑓𝑘}, 𝑘 ≤𝑛denotes a set of features tested by 𝑇𝑖. When 𝑇 𝑆 is developed, association tags are established
between pairs 𝑇𝑖/𝑓𝑖, denoting which tests are relevant for which features. Building association tags between tests and features
is a manual process, because a similar process of building tags exists for another purpose in the software development stage.
Therefore, when implementing the proposed approach, we inherited the tagging system for the purpose of test reduction. When
software changes (for example features modified or added), developers are prompted which tests are affected by changes, using
tags. Then developers can change affected tests accordingly. In this industrial practice of testing highly-configurable conferencing
systems at Cisco, feature changes that require test modifications do not happen frequently, and therefore updating the association
tags and tests manually does not create scalability problems.
Each test 𝑇𝑖∈𝑇 𝑆 is associated with a set of historical records 𝐹 𝑑𝑒(𝑇𝑖)={𝐹 𝑑𝑒_𝑇𝑖1, 𝐹 𝑑 𝑒_𝑇𝑖2, ..., 𝐹 𝑑 𝑒_𝑇𝑖𝑗}that correspond
to the execution statuses of 𝑗past test case executions (fail/pass status). 𝐹 𝑑𝑒_𝑇𝑖𝑗= {𝑠𝑡𝑎𝑡𝑢𝑠𝑖,𝑗 , 𝐶𝑓 𝑎𝑖𝑙 (𝐹 𝑑𝑒_𝑇𝑖𝑗)} is associated
with one or more configurations responsible for failure 𝐶𝑓𝑎𝑖𝑙 (𝐹 𝑑𝑒_𝑇𝑖𝑗)={𝐶1, 𝐶2, ..., 𝐶𝑚}, where 𝐶𝑖= {𝑓1, 𝑓2, ..., 𝑓𝑘}, and
𝑠𝑡𝑎𝑡𝑢𝑠 ∈ [0, 1].
𝐶𝐵𝐶 is developed and evolves incrementally and iteratively, following a CI practice, where much of the functionality
is shared between different products. 𝑇 𝑆 is developed to test all 𝐶𝐵𝐶 𝑃 𝑟𝑜𝑑, and because of shared functionality between
products, the same parts of 𝐶𝐵𝐶 are executed multiple times. These conditions are illustrated in Figure 2 . Five products in
the figure consist of different functionality modules 𝐹 𝑆 = {𝐴, 𝐵, 𝐶, 𝐷, 𝐸 , ..., 𝑁}. In the context of 𝐶𝐵𝐶, features include an
audio protocol or video resolution of a conferencing call, or network type, and they are reused across 𝐶𝐵𝐶. Tests in 𝑇 𝑆 are
highly-interleaved and specified with a varying degree of feature coverage, such as single-feature coverage:
𝐶𝑜𝑣(𝑇 5)={𝐼},𝐶𝑜𝑣(𝑇 6)={𝑁},𝐶𝑜𝑣(𝑇 7)={𝑁}
6AUTHOR ONE ET AL
A
C
B
D
C
A
D
L
F
A
K
L
D
F
BA
K
L
M
I
H
G
EN
H
I
A
M
E
N
T2 T3
T1 T5
Products
Tests T4
Features A B CD E F GHI K LMN
T6 T7
Cov(T1) = {B,D}
Cov(T2) = {C,D}
Cov(T3) = {A,K}
Cov(T4) = {A,K,L,M}
Cov(T5) = {I}
Cov(T6) = {N}
Cov(T7) = {N}
FIGURE 2 Test redundancy in HCS: Test cases 𝑇1..6cover a set of features 𝐴..𝑁 in different combinations across different
products, causing partial or total test overlap.
or multiple-feature coverage:
𝐶𝑜𝑣(𝑇 1)={𝐵, 𝐷},𝐶 𝑜𝑣(𝑇 2)={𝐶 , 𝐷},𝐶𝑜𝑣(𝑇 3)={𝐴, 𝐾},𝐶 𝑜𝑣(𝑇 4)={𝐴, 𝐾 , 𝐿, 𝑀}.
𝑇 𝑆 evolves and grows, accumulating tests with different coverage criteria, which over time increases the risk of redundant
testing. When using a single-feature coverage criterion, tests 𝑇1and 𝑇2overlap, as 𝐶𝑜𝑣(𝑇 1)={𝐵 , 𝐷}and 𝐶𝑜𝑣(𝑇 2)={𝐶 , 𝐷},
contributing to partial redundancy in 𝑇 𝑆. We say that 𝑇1is partially redundant with respect to 𝑇2and vice versa. Assuming that
𝑇 𝑆 = {𝑇 1, 𝑇 2}, eliminating either 𝑇1or 𝑇2would leave a feature in 𝐶𝐵𝐶 (𝐵or 𝐶) untested. However, in a test suite containing
multiple instances of partial redundancy for the same set of features, partially redundant tests can be eliminated. Contrarily,
{𝑇 3, 𝑇 4}contributes to total redundancy in 𝑇 𝑆, as 𝐶 𝑜𝑣(𝑇 3)={𝐴, 𝐾 }is a proper subset of 𝐶𝑜𝑣(𝑇 4)={𝐴, 𝐾 , 𝐿, 𝑀}. We say
that 𝑇3is totally redundant with respect to 𝑇4.
Whenever 𝐶𝐵𝐶 is modified, introducing changes 𝐶ℎ = {𝐶ℎ1, 𝐶 ℎ2, ..., 𝐶 ℎ𝑛}, affected products in 𝐶𝐵𝐶 𝑃 𝑟𝑜𝑑 are regression
tested to ensure that new modifications do not negatively affect existing functionality. Regression tests are selected mainly
manually from 𝑇 𝑆 based on change impact analysis and intuitive assessment of test relevance for a given 𝐶ℎ𝑖. To be able to
benefit from CI, test feedback needs to be produced as quickly as possible. One way to improve the agility of test cycle is to
automate the process of regression test selection (time savings compared to manual selection). Another way is to reduce the size
of regression test suite by automatically identifying and removing redundancy (time savings in test execution). In this paper,
we focus on the latter, enabling more cost-effective regression testing of HCS in CI, by identifying and eliminating redundancy
caused by overlapped feature combinations across tests.
4 PROBLEM STATEMENT
Let’s consider the target test coverage to be a single feature coverage. Then, a test 𝑇𝑖∈𝑇 𝑆 can be considered:
Definition 1. 𝑇𝑖is totally redundant of 𝑇 𝑆,if∃𝑇𝑗∈𝑇 𝑆, 𝑖 ≠𝑗 , 𝐶𝑜𝑣(𝑇𝑖)⊆ 𝐶𝑜𝑣(𝑇𝑗).
Definition 2. 𝑇𝑖is partially redundant of 𝑇 𝑆,if∃𝑇𝑗∈𝑇 𝑆 , 𝑖 ≠𝑗 , 𝐶𝑜𝑣(𝑇𝑖)≠𝐶 𝑜𝑣(𝑇𝑗)and 𝐶 𝑜𝑣(𝑇𝑖) ∩ 𝐶𝑜𝑣(𝑇𝑗)≠∅.
Definition 3. 𝑇𝑖is unique, if ¬∃𝑇𝑗∈𝑇 𝑆, 𝑖 ≠𝑗, 𝐶𝑜𝑣(𝑇𝑖) = 𝐶 𝑜𝑣(𝑇𝑗)and 𝑇𝑖is partially nor totally redundant with respect to 𝑇𝑗.
From the example given in Figure 2 , 𝑇3is totally redundant of 𝑇4because all feature combinations covered by 𝑇3are also
covered by 𝑇4.𝑇5is unique, because it uniquely covers feature 𝐼.𝑇1and 𝑇2are mutually partially redundant, with respect to
single-feature coverage, as they both cover feature 𝐷, but also other different features.
As observed previously10 , using coverage-based metrics solely does not exhibit good performance in test redundancy detec-
tion. However, supplementing coverage metrics with the information about fault detection effectiveness of tests could provide
more reliable test redundancy detection. Furthermore, assuming a fixed feature-combination coverage as a target test coverage
AUTHOR ONE ET AL 7
criterion is allegedly an oversimplification for a vast majority of realistic and complex HCS systems. Features are typically inte-
grated within different products in a varying degree of interaction. Thus, effective regression testing needs to be able to detect
the cases when the same functionality (features) is executed as part of different feature combinations across different tests and
products.
Based on the presented concepts, we formulate the problem of test redundancy reduction for highly-interleaved tests as
follows:
For a given software change 𝐶ℎ, existing test suite 𝑇 𝑆, and fault detection records of each test based on execution history
𝐹 𝑑𝑒_𝑇𝑖,𝑇𝑖∈𝑇 𝑆 , select 𝑇 𝑆′⊆ 𝑇 𝑆 as a regression test suite with the following properties:
Property 1. ∀𝑇∈𝑇 𝑆′𝑇includes features affected by 𝐶ℎ.
Property 2. 𝑇 𝑆′⊂ 𝑇 𝑆 ⧵{totally redundant tests}, excludes any totally redundant test in 𝑇 𝑆.
Property 3. 𝑇 𝑆′∩𝑇 𝑆 = {unique tests}, includes all unique tests in 𝑇 𝑆.
Property 4. 𝑇 𝑆′⊂ 𝑇 𝑆 ⧵{ineffective tests}, where ineffective tests are tests in 𝑇 𝑆 that do not contribute to increased fault
detection effectiveness of 𝑇 𝑆′.
The major challenge in solving the above formulated problem lies in identifying ineffective test cases. The following section
provides an explanation of ineffective test case, i.e. when a partially redundant test case is considered effective, and when it is
considered ineffective.
5 REDUNDANCY DETECTION AND REDUCTION
Given the Def 2. of partially redundant test case, for {𝑇𝑖, 𝑇𝑗∈𝑇 𝑆 ∣𝑖≠𝑗, 𝐶 𝑜𝑣(𝑇𝑖)≠𝐶𝑜𝑣(𝑇𝑗), 𝐶 𝑜𝑣(𝑇𝑖) ∩ 𝐶𝑜𝑣(𝑇𝑗)≠∅}, we
want to determine if exists {𝑇𝑘∈𝑇 𝑆 ∣𝑘≠𝑖, 𝑘 ≠𝑗, 𝐶 𝑜𝑣(𝑇𝑖)⧵𝐶𝑜𝑣(𝑇𝑗)⊂ 𝐶 𝑜𝑣(𝑇𝑘)}. Then:
1) If true, 𝑇𝑖is an ineffective partially redundant test that can be eliminated from 𝑇 𝑆 without decreasing the fault detection
effectiveness of 𝑇 𝑆,as𝐶𝑜𝑣(𝑇𝑗) ∩ 𝐶𝑜𝑣(𝑇𝑘)⊇ 𝐶 𝑜𝑣(𝑇𝑖).
2) If false, 𝑇𝑖is a partially redundant test that executes a feature set not covered by any other test in 𝑇 𝑆.
This unique set could be a single feature or a combination of features. Since our work relates a HCS with a high degree of feature
interaction, together with a test suite developed over time to cover these various interactions, it is expected that combinations
with varying (high) degree of feature interactions will usually exist in a test suite.
The key step of our approach for reducing redundancy in regression test suites is distinguishing between effective and ineffec-
tive partially redundant test cases. To determine whether 𝑇𝑖is an effective or ineffective partially redundant test case, we make the
following hypothesis: Test cases covering combinations that have historically exhibited good fault-revealing performance can
be classified as effective, and otherwise as ineffective. The hypothesis is based on the fact, suggested by previous studies11 12 13 1
that test execution history can help improve the effectiveness of regression testing. Therefore, to optimally detect and minimize
test redundancy, we combine coverage analysis with the information indicating fault detection capability of tests exhibited in
past test executions for specific configurations.
In particular, 𝐹 𝑑𝑒(𝑇𝑖)={𝐹 𝑑𝑒_𝑇𝑖1, 𝐹 𝑑 𝑒_𝑇𝑖2, ..., 𝐹 𝑑 𝑒_𝑇𝑖𝑗}denotes execution history for 𝑇𝑖over 𝑗test runs, where each failure
is associated with one or more failing configurations 𝐶𝑓𝑎𝑖𝑙 (𝐹 𝑑 𝑒_𝑇𝑖𝑗)={𝐶1, 𝐶2, ..., 𝐶𝑚}, where 𝐶𝑖= {𝑓1, 𝑓2, ..., 𝑓𝑛}is a set of
features, and 𝑠𝑡𝑎𝑡𝑢𝑠 ∈ [0, 1]. Each feature is associated a fault_index, which denotes its historical fault proneness (rate of being
the part of failing test cases). When a test case fails, all features covered by the failed test case are assigned fault indexes. A
feature can be assigned fault index originating from only distinct failures, otherwise, that feature would become overemphasized.
If the single feature led to the failure, the fault index is 1, and if the failure occurred in a feature interaction, the fault index is
assigned a value 1∕𝑛where 𝑛is the number of the features contributing to the failure. Based on fault indexes we build statistical
regression models that predict whether a test case 𝑇𝑖is likely to detect a failure in the upcoming execution. We use C4.5 algorithm
for building regression trees14 . The regression model predicts a value in the range [0,1], where the values in the range [0,0.5)
mean that 𝑇𝑖is ineffective (not likely to detect a failure), and the values in the range [0.5,1] mean that 𝑇𝑖is effective (likely to
detect a failure). A simplified regression tree is illustrated in Figure 3 . Nodes of the tree represent software features. Features
are building blocks for test cases. Therefore, a traversal from the root node to a leaf represents a test case. The numerical value
8AUTHOR ONE ET AL
under leaves represents the predicted fault-detection effectiveness of a test cases that is on a path from the root node to the leaf
node. Numerical value shown for leaves in Figure 3 is selected for one arbitrary test case, for the purpose of illustration.
FIGURE 3 Simplified regression tree for the classification of effective and ineffective partially redundant test cases for highly
configurable video conferencing software.
5.1 The Approach
The following algorithm describes the overall approach to test redundancy reduction based on coverage and history analysis:
Cov({RTS’\Ti})
Cov(Ti) Cov(Ti)
Cov({RTS’\Ti}) Cov({RTS’\Ti})
Cov(Ti)
a) Titotally redundant b) Tiunique c) Tipartially redundant
FIGURE 4 Test suite properties: a) 𝑇𝑖totally redundant, b) 𝑇𝑖unique, c) 𝑇𝑖partially redundant.
6 EMPIRICAL EVALUATION
To evaluate the proposed approach for reducing ineffective test redundancy for continuous regression testing, we performed
extensive experiments using real-world test sets of two highly configurable software systems developed and tested in CI. In
addition, we used a large set of 110 artificial test sets created emulating test sets for the real HCS systems.
To evaluate the overall approach, we measure the following metrics: 1) fault-detection effectiveness, and 2) time-efficiency of
reduced regression test suites in comparison with i) existing industry practice of testing HCS in CI, ii) retest-all approach, and
iii) random test selection. To evaluate the performance of the classification model for predicting ineffective redundant test cases
specifically, we use metrics: a) precision,b)recall,c)accuracy, and d) F-score.
The five experiments E1a/b-E4 address the following research questions respectively:
AUTHOR ONE ET AL 9
Algorithm 1 The algorithm for ineffective test redundancy redution based on test coverage and history information
Input: The existing test suite 𝑇 𝑆, and software changes 𝐶ℎ𝑖
Output: Regression test suite that satisfies the properties 𝑃1 − 𝑃4,
Step 1:
Given the input, identify 𝑅𝑇 𝑆′as a set of tests affected by 𝐶ℎ𝑖, based on the associations between features covered by a test
and software modules implementing these features, similarly to the associations between features and test cases.
𝑅𝑇 𝑆′represents an initial regression test suite that needs to be processed further to satisfy the properties 𝑃2,𝑃3, and 𝑃4.
Step 2:
Analyze total redundancy in 𝑅𝑇 𝑆′to satisfy the Property 𝑃2.
Identify tests whose covering set of features is entirely covered by another tests 𝐶𝑜𝑣(𝑇𝑖)⊆ 𝑅𝑇 𝑆 ′(Figure 4 a), and remove such
tests from 𝑅𝑇 𝑆′.
Step 3:
Look for tests in 𝑅𝑇 𝑆′that cover features not covered by other tests in 𝑅𝑇 𝑆′(Figure 4 b), extract these tests from 𝑅𝑇 𝑆′, and
construct 𝑅𝑇 𝑆 , satisfying the Property 𝑃3.
Step 4:
At this point, 𝑅𝑇 𝑆′consists of partially redundant tests (Figure 4 c).
To partition effective and ineffective tests, for each 𝑇𝑖∈𝑅𝑇 𝑆′obtain the execution history 𝐹 𝑑 𝑒(𝑇𝑖) =
{𝐹 𝑑𝑒_𝑇𝑖1, 𝐹 𝑑 𝑒_𝑇𝑖2, ..., 𝐹 𝑑𝑒_𝑇𝑖𝑗}, with 𝑗= 6. This value has been experimentally evaluated as the optimal size of history
window for test optimization for HCS13. Longer windows capturing more of older failure information showed to provide less
accurate indication of potentially failing test cases.
Build a regression tree that predicts the likelihood of a test case to detect failures in upcoming test executions, as a numerical
value in the range [0,1].
After obtaining predictions for fault detection effectiveness of all partially redundant tests in 𝑅𝑇 𝑆′, sort 𝑅𝑇 𝑆′such that tests
with higher likelihood of detecting failures come first in a sequence.
If one or more test cases are assigned equal weights, check their failure rate 𝐹 𝑟 =𝑡𝑜𝑡𝑎𝑙_𝑛𝑢𝑚_𝑜𝑓 _𝑓 𝑎𝑖𝑙𝑢𝑟𝑒𝑠∕𝑡𝑜𝑡𝑎𝑙_𝑛𝑢𝑚_𝑜𝑓 _𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛𝑠,
and give higher priority to those tests that have failed more frequently.
If one or more of these test cases have the same failure rate, assign them different weights at random.
Step 5:
Append to 𝑅𝑇 𝑆 the top 𝑛test cases from 𝑅𝑇 𝑆 ′(with the highest priority) to fill the available test budget for 𝑅𝑇 𝑆. The test
budget is defined for every CI test run. 𝑛is determined by adding up test execution times of tests from 𝑅𝑇 𝑆 and 𝑅𝑇 𝑆 ′as
obtained from their historical execution records.
At this point, 𝑅𝑇 𝑆 is the selected regression test suite that satisfies all defined properties 𝑃1 − 𝑃4, as the solution of coverage-
and history-based test redundancy reduction problem.
RQ1 Can the reduced regression test suites (based on the proposed approach for redundancy reduction) improve time-efficiency
in regression testing compared to industry practice of testing HCS in CI, and what is the effect of reduction on fault-
detection effectiveness?
RQ2 Can the reduced regression test suites improve time-efficiency in regression testing compared to retest-all approach, and
what is the effect of reduction on fault-detection effectiveness?
RQ3 Can the reduced regression test suites improve fault detection effectiveness in regression testing compare to randomly
selected regression tests, when controlling for the total test suite execution time?
RQ4 What is the performance of the classification model for predicting ineffective partially redundant test cases?
The proposed approach combines coverage-based test reduction for lower test redundancy and history-based test prioritization
for increased fault-detection effectiveness. As such, we performed two additional experiments where we compare the proposed
approach with a state-of-art test reduction and test prioritization approach.
The reminder of this section describes the test sets in more detail, presents the experiment methodology and measures, and
finally presents and discusses the experiment results.
10 AUTHOR ONE ET AL
6.1 Experiment Test Sets
We evaluated the proposed approach using real test suites of two industrial HCS from a video conferencing domain and a mobile
application domain. In addition, to perform more extensive evaluation of the approach on larger test instances with varied number
of test cases and requirements, as well as varied failure rates and degrees of feature interaction coverage across test suites, we
use 110 artificial test suites created to resemble the realistic test suite.
6.1.1 Industrial Test Sets
Video Conferencing Software. One industrial experiment subject is the video conferencing software system 𝐶𝐵𝐶 introduced
in Section 3 (Running Example). 𝐶𝐵𝐶 contains a test suite 𝐵𝐶 𝑇 𝑆 consisting of 460 test cases, which cover the set of 75
features. Each test case contains historical execution records for the last ten runs. The historical information includes test result
status, execution duration, required resources, test identifiers, tested software version, and execution schedulers. In total, the
execution history contains results of 4600 executions (460 tests * 10 consecutive executions per test). Minimum degree of feature
interaction for tests in 𝐵𝐶𝑇 𝑆 varies from 1 (single features) to 3 (combination of 3 features), and maximum degree of feature
interaction varies from 4 to 7.
Mobile Software. Another industrial experiment subject is a cross-platform mobile application software with a test suite 𝑀𝐴𝑇 𝑆
containing 500 test cases. The mobile software is highly configurable, to enable adaptation to a wide range of mobile platform
variants such as different screen sizes, resolutions, communication protocols, CPUs. Test history is available for the test suite
for the last ten test executions, and it includes test result status, execution duration, required resources, and test identifiers. Test
history contains 5000 execution records (500 tests * 10 consecutive executions per test). Minimum degree of feature interaction
for tests in 𝑀𝐴𝑇 𝑆 varies from 2 (combination of 2 features) to 3 (combination of 3 features), and maximum degree of feature
interaction varies from 4 to 9.
Table 1 provides summary information about the industrial test suites.
TABLE 1 IND USTRIAL EXPE RIMENT SUBJECTS BCTS AND MATS.
Video conferencing test suite (BCTS) Mobile application test suite (MATS)
Number of test cases 460 500
Number of test history records 4600 5000
Min feature interaction coverage 1-3 2-3
Max feature interaction coverage 4-7 4-9
6.1.2 Artificial Test Sets
The artificial experimental subjects are developed to resemble the described industrial test suite 𝐵𝐶 𝑇 𝑆 as follows. We cre-
ated 110 test suites, larger in terms of the number of test cases and covered features compared to 𝐵𝐶 𝑇 𝑆. The artificial
suites are grouped by size in 11 groups {𝐴𝑇 𝑆1, 𝐴𝑇 𝑆2, ..., 𝐴𝑇 𝑆11}, and each 𝐴𝑇 𝑆𝑖consists of ten test suites 𝐴𝑇 𝑆𝑖=
{𝐴𝑇 𝑆𝑖_1, 𝐴𝑇 𝑆𝑖_2, ..., 𝐴𝑇 𝑆𝑖_10}. The test suites cover a set of 150 features. Association tags are established between features
and test cases, maintaining feature interactions coverage as found in 𝐵𝐶𝑇 𝑆 . Minimum degree of feature interaction per 𝐴𝑇 𝑆𝑖
varies from 1 (single features) to 3 (combination of 3 features), and maximum degree of feature interaction varies from 3 to 7
features. To enable more thorough evaluation, the size of each test suite 𝐴𝑇 𝑆𝑖_𝑗is increased compared to 𝐵𝐶𝑇 𝑆 . The size of
test suites varies from 500 to 1000 test cases with an increment of 50. Each 𝐴𝑇 𝑆𝑖_𝑗,𝑖= 1..11, 𝑗 = 1..10 contains test execution
history that was developed according to the execution history of 𝐵𝐶 𝑇 𝑆. The history consists of test identifiers, software ver-
sions, test result status (pass/fail), and test execution duration. The history is developed for six consecutive runs. For all artificial
test suites, failure rates and the degree of feature interactions are developed to resemble those observed in the 𝐵𝐶 𝑇 𝑆. In partic-
ular, for each 𝐴𝑇 𝑆𝑖, failure rates range from 29% to 63%, with deviation from 2 to 11. Table 2 provides summary information
about the artificial test suites.
AUTHOR ONE ET AL 11
TABLE 2 ARTIFICIAL E XPERIMENT SUBJEC TS ATS1-ATS11 EACH REPR ESENTS A SET OF 10 TEST SUITES OF THE SA ME
SIZE BUT WITH VARYIN G FAILURE RATE AND FEATURE INTERACTION COVERAGE.
ATS1ATS2ATS3ATS4ATS5ATS6ATS7ATS8ATS9ATS10 ATS11
Number of test cases 500 550 600 650 700 750 800 850 900 950 1000
Min % of failing test cases 29 29 29 29 29 29 29 29 29 29 29
Max % of failing test cases 63 63 63 63 63 63 63 63 63 63 63
Avg % of failing test cases 35 29 43 58 33 37 45 63 58 36 51
Avg min feature inter. cov. 1 2 1 2 1 2 3 2 1 2 1
Avg max feature inter. cov. 5 6 6 5 7 4 4 3 5 7 5
6.2 Measures and Methodology
Measures. We evaluate the proposed approach using two metrics, fault-detection effectiveness and time-efficiency, which are
in our context defined as follows:
1. Fault-detection effectiveness of a reduced test suite is the ratio of the number of non-repeated faults detected by the
reduced test suite and the number of non-repeated faults detected by its original (non-reduced) test suite. The closer the
value to 1, the better the fault-detection effectiveness of a test suite. Non-repeated faults are faults counted only once
regardless of the number of test cases that detect that fault.
2. Time-efficiency of a reduced test suite is the reduction of the execution time of the test suite compared to the execution
time of its original (non-reduced) test suite. In this work, we focus on the reduction of test suite execution time as a
measure of efficiency, rather than the test suite size, since test cases can have varying execution time.
Methodology. The first experiment E1a is designed to answer RQ1 using the video conferencing test suites. We compare the
proposed method for test redundancy reduction (RHS) with the industry practice of testing video conferencing systems (CIP),
described in Section 3, in terms of total test execution time and fault-detection effectiveness of regression suite. At the first step
of the experiment, we retrieve modified features from history, and select a set of tests from 𝐵𝐶𝑇 𝑆 affected by the modifications.
The resulting test suite represents the initial regression test suite, which we refer to as 𝐵𝐶 _𝑅𝐻𝑆. Then we apply RHS approach
to 𝐵𝐶 _𝑅𝐻𝑆, and using six historical execution records of 𝐵𝐶 𝑇 𝑆, we analyze which feature combinations showed good fault-
detection performance in the past. Based on this information, we eliminate totally redundant and ineffective partially redundant
tests from 𝐵𝐶_𝑅𝐻𝑆, creating the final regression test suite. Further, we measure the loss of fault-detection of the final "non-
redundant" test suite, as the percentage of failures that were caused by configurations not covered by tests in 𝐵𝐶_𝑅𝐻 𝑆, due
to reductions, and covered by 𝐵𝐶𝑇 𝑆 . Finally, for the final "non-redundant" test suite, we measure the percentage of reduction
of the test suite size compared to the size of 𝐵𝐶 𝑇 𝑆. As we have available test execution history from ten consecutive runs for
𝐵𝐶 𝑇 𝑆, and our approach uses the history window of size 6 (explained previously), we run the experiment 5 times. We start
from the oldest six historical execution records, and in every experiment replace the oldest record with one newer, until we have
used all the available records.
The second experiment E1b is designed to answer RQ1 using the mobile application test suites. In this experiment we repeat
the methodology used in E1a, while replacing 𝐵𝐶 𝑇 𝑆 with 𝑀 𝐴𝑇 𝑆.
The third experiment E2 addresses RQ2. We compare the proposed approach for test redundancy reduction based on coverage
and history analysis (RHS) with retest-all approach (RA) in terms of total test execution time and fault-detection effectiveness
of a non-redundant regression suite. We compare with RA because RA is a common approach to regression testing used in
practice, instead of tedious manual regression test selection/reduction. However, to allow for fairer comparison, we implemented
a modified retest-all (MRA), which subsets the original test suite that would normally be executed by RA such that only the tests
affected by a change are selected as regression tests. This was implemented using the association tags between features covered
by a test and software modules implementing these features, similarly to the associations existing between test cases and their
covering features.
12 AUTHOR ONE ET AL
At the first step of the experiment, we randomly choose 15 features as modified (10% of the total number of features in the
artificial test suites). Then, from each test suite 𝐴𝑇 𝑆𝑖_𝑗, 𝑖 = 1..11, 𝑗 = 1..10, we automatically select a set of tests affected by
changed features. The selection is based on the existing association tags between test cases and their covering features. The
resulting test suites represent the initial regression test suites, which we refer to as 𝐼𝑛𝑖𝑡𝑅𝐻𝑆𝑖_𝑗, 𝑖 = 1..11, 𝑗 = 1..10. Then we
apply RHS approach to each 𝐼𝑛𝑖𝑡𝑅𝐻 𝑆𝑖_𝑗. We use six consecutive historical execution records of each corresponding test suite
indicating which feature combinations exhibited good fault-revealing capability, and according to the described approach, from
each 𝐼𝑛𝑖𝑡𝑅𝐻 𝑆𝑖_𝑗we eliminate any redundant tests, as well as tests contributing to partial ineffective redundancy. The result-
ing test suites represent the final selected regression test suites, which we refer to as 𝐹𝑖𝑛𝑅𝐻 𝑆𝑖_𝑗, 𝑖 =1..11, 𝑗 =1..10. Finally,
for each 𝐹 𝑖𝑛𝑅𝐻𝑆𝑖_𝑗we measure the loss of fault-detection compared to the fault-detection capability of its corresponding
𝐼𝑛𝑖𝑡𝑅𝐻𝑆𝑖_𝑗. The loss is measured as the percentage of failures that were caused by configurations not covered by test cases in
𝐹 𝑖𝑛𝑅𝐻 𝑆𝑖_𝑗, due to reduction, and covered by 𝐼𝑛𝑖𝑡𝑅𝐻 𝑆𝑖_𝑗. Then for each 𝐹 𝑖𝑛𝑅𝐻𝑆𝑖_𝑗we measure the percentage reduction of
a test suite size compared to the size of its corresponding 𝐼𝑛𝑖𝑡𝑅𝐻𝑆𝑖_𝑗.
The fourth experiment E3 addresses RQ3. We compare the proposed approach for test redundancy reduction (RHS) with
random test selection (RS) in terms of fault-detection effectiveness, while controlling for the total execution time of regression
test suite. The motivation to compare with RS comes from the observation that random selection is sometimes used as an
alternative to automated regression test reduction, to reduce the cost of regression testing in cases where running the whole
regression suite selected after code modifications would exceed test budget.
At the first step of E3, we obtain 𝐼𝑛𝑖𝑡𝑅𝐻 𝑆𝑖_𝑗, 𝑖 = 1..11, 𝑗 = 1..10 and 𝐹 𝑖𝑛𝑅𝐻 𝑆𝑖_𝑗, 𝑖 =1..11, 𝑗 =1..10 created in E1,
and for each 𝐹 𝑖𝑛𝑅𝐻𝑆𝑖_𝑗we measure total test execution time 𝐸𝑇 (𝐹 𝑖𝑛𝑅𝐻 𝑆𝑖_𝑗). Then we start randomly selecting tests from
𝐼𝑛𝑖𝑡𝑅𝐻𝑆𝑖_𝑗, and when each test is selected we accumulate its execution time 𝐸𝑇 (𝐼 𝑛𝑖𝑡𝑅𝑆𝑖_𝑗). We continue randomly selecting
tests until 𝐸𝑇 (𝐼 𝑛𝑖𝑡𝑅𝑆𝑖_𝑗)equals 𝐸𝑇 (𝐹 𝑖𝑛𝑅𝐻 𝑆𝑖_𝑗). We refer to the resulting test suites as 𝐹 𝑖𝑛𝑅𝑆𝑖_𝑗. Then for each 𝐹 𝑖𝑛𝑅𝑆𝑖_𝑗
we measure the loss of fault detection compared to the fault-detection capability of its corresponding 𝐼 𝑛𝑖𝑡𝑅𝐻𝑆𝑖_𝑗. The loss is
measured as the percentage of failures that were caused by configurations not covered by test cases in 𝐹 𝑖𝑛𝑅𝑆𝑖_𝑗, and covered by
𝐼𝑛𝑖𝑡𝑅𝐻𝑆𝑖_𝑗. Finally, we compare 𝑅𝐻𝑆 and 𝑅𝑆 in terms of the loss of fault detection effectiveness. To account for randomness
in RS, we repeat each experiment 100 times and determine the statistical significance of the results in non-parametric Mann-
Whitney U-Tests, with the significance level 0.01.
The fifth experiment E4 addresses RQ4, by measuring the quality of predictions of the effectiveness of partially redundant test
cases produced by our regression model. The quality is evaluated using a confusion matrix shown in Figure 5 . True positives
(TP) mean that the model predicted a test case is effective and it actually is effective. False positives (FP) mean that the model
predicted a test case is effective, while it actually is ineffective. False negatives (FN) mean that the model predicted a test case
is ineffective, while it actually is effective. True negatives (TN) mean that the model predicted a test case is ineffective and it
actually is ineffective. We used the following four measures in the evaluation15 :
1. Precision is the ratio of the number of test cases correctly predicted as ineffective to the total number of test cases predicted
as ineffective. Precision=1 means that every test case that was predicted to be ineffective, actually is ineffective.
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =𝑇 𝑃
𝑇 𝑃 +𝐹 𝑃
2. Recall is the ratio of the number of test cases correctly predicted as ineffective to the total number of modules that are
actually ineffective. Recall=1 means that every test case that is ineffective was also predicted to be ineffective.
𝑅𝑒𝑐𝑎𝑙𝑙 =𝑇 𝑃
𝑇 𝑃 +𝐹 𝑁
3. Accuracy is the ratio of the number of correct predictions to the total number of test cases. Accuracy=1 means that the
classification model did not make any mistakes.
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =𝑇 𝑃 +𝑇 𝑁
𝑇 𝑃 +𝑇 𝑁 +𝐹 𝑃 +𝐹 𝑁
4. F-score is defined as the harmonic mean of Precision and Recall.
𝐹-𝑠𝑐𝑜𝑟𝑒 =2 ∗ 𝑃 𝑟𝑒𝑐 𝑖𝑠𝑖𝑜𝑛 ∗𝑅𝑒𝑐𝑎𝑙𝑙
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 +𝑅𝑒𝑐𝑎𝑙𝑙
In E4, we use the artificial test suites 𝐴𝑇 𝑆 with 5-fold cross-validation to evaluate the performance of prediction in the
following manner. Each 𝐴𝑇 𝑆𝑖is first randomly divided into five sets of tests. Then, four sets are selected for model training and
AUTHOR ONE ET AL 13
FIGURE 5 Confusion matrix for assessing the quality of predictions of the effectiveness of partially redundant test cases.
the fifth set is used for testing. The process is repeated five times, and each time a different set is selected for testing. This whole
process is run 10 times to account for randomness.
7 RESULTS AND ANALYSIS
This section contains the results of the experiments E1a, E1b, E2, E3, and E4 addressing the research questions RQ1, RQ2,
RQ3, and RQ4 respectively.
7.1 Industry Practice
In experiment E1a, we evaluate whether RHS is able to reduce regression test feedback while achieving equal fault-detection
effectiveness compared to CIP of testing video conferencing systems. In terms of fault-detection effectiveness, RHS exhibited
equal performance compared to CIP for all five experiments. In terms of total regression test execution time, RHS showed
improvements over CIP by 30% on average (15% in worst case, and 39% in best case). The results of time-effectiveness are
shown in Figure 6 . X-axis represents five experiments corresponding to different historical execution data, and y-axis denotes
the percentage of test suite total execution time, with CIP as baseline (100%).
0
20
40
60
80
100
BCTS1 BCTS2 BCTS3 BCTS4 BCTS5
Percentage of Execution Time
Left bar: RHS, Right bar: CIP, Baseline: CIP
FIGURE 6 Percentage reduction of regression test execution time for RHS compared to CIP for video conferencing software.
In experiment E1b, we evaluate whether RHS is able to reduce regression test feedback while achieving equal fault-detection
effectiveness compared to CIP of testing mobile application software. In terms of fault-detection effectiveness, RHS exhibited
equal performance compared to CIP for all five experiments. In terms of total regression test execution time, RHS showed
improvements over CIP by 39% on average. The results of time-effectiveness are shown in Figure 7 . X-axis represents five
experiments corresponding to different historical execution data, and y-axis denotes the percentage of test suite total execution
time, with CIP as baseline (100%).
14 AUTHOR ONE ET AL
0
20
40
60
80
100
MATS1 MATS2 MATS3 MATS4 MATS5
Percentage of Execution Time
Left bar: RHS, Right bar: CIP, Baseline: CIP
FIGURE 7 Percentage reduction of regression test execution time for RHS compared to CIP for mobile application software.
In summary, the results of the experiment E1 indicate that the proposed approach for redundancy reduction based on coverage
and history analysis successfully reduces regression test feedback compared to industry practice, without compromising fault-
detection effectiveness.
7.2 Time Effectiveness
Figure 8 shows time-effectiveness of RHS compared to MRA for each subject. X-axis represents test suites, and y-axis is the
percentage reduction of total regression test execution time for RHS, compared to MRA. The results show that RHS can reduce
total regression test execution time by 31% on average compared to MRA. In the best case, using RHS led to reducing the
execution time by 45%, and in the worst case by 15%. Higher reduction is achieved for test suites with a higher degree of feature
interaction, and vice versa. This is because many of the feature combinations with a lower interaction degree are subsumed
within the ones covering larger combinations, which simplifies detecting redundant tests. We consider that execution time is
directly proportional to the size of a test suite, because tests have similar execution time. Shorter test suite execution time leads
to shorter test feedback in continuous integration testing.
0
10
20
30
40
50
60
ATSet1 ATSet2 ATSet3 ATSet4 ATSet5 ATSet6 ATSet7 ATSet8 ATSet9 ATSet10 ATSet11
Min Outlier Max Outlier
ATS1 ATS2 ATS3 ATS4 ATS5 ATS6 ATS7 ATS8 ATS9 ATS10 ATS11
FIGURE 8 Percentage reduction of the total regression test execution time for RHS compared to MRA.
7.3 Fault Detection Effectiveness
Figure 9 shows fault detection effectiveness of RHS compared to MRA for each subject. X-axis represents test suites, and y-
axis is the percentage of fault-detection effectiveness loss for RHS, compared to MRA. The results indicate that RHS exhibits
comparable performance to MRA, with less than 0.3% loss in fault-detection effectiveness on average. For about the half of
experiment subjects, RHS achieved equal performance to MRA. In the worst case, RHS led to max 2% fault-detection loss for
𝐴𝑇 𝑆8.
AUTHOR ONE ET AL 15
0
0.5
1
1.5
2
2.5
ATS1 ATS2 ATS3 ATS4 ATS5 ATS6 ATS7 ATS8 ATS9 ATS10 ATS11
Min Outlier Max Outlier
FIGURE 9 Percentage reduction of fault detection effectiveness for RHS compared to MRA.
In summary, the results of the experiment E2 show that the proposed approach for redundancy reduction based on coverage
and history analysis can reduce test feedback compared to an advanced retest-all approach without significantly compromising
the fault-detection effectiveness of a regression suite.
In the experiment E3, we compare RHS and RS in terms of fault detection effectiveness, while controlling for the total
test execution time of regression test suites. We present the results of the experiment in Figure 10 . The boxplot shows the
distribution of the percentage of fault detection effectiveness loss of RS compared to RHS for each subject, averaged over 𝐴𝑇 𝑆𝑖.
The results indicate that RHS achieves significant improvement in fault-detection effectiveness of regression test suites over RS.
RHS consistently exhibits better performance across all subjects, producing test suites capable of detecting 80% more faults
on average compared to RS, with a deviation of about 10 on average. The statistical Mann-Whitney U-tests showed that the
improvements of fault-detection effectiveness are statistically significant.
30
40
50
60
70
80
90
100
ATS1 ATS2 ATS3 ATS4 ATS5 ATS6 ATS7 ATS8 ATS9 ATS10 ATS11
Min Outlier Max Outlier
FIGURE 10 Distribution of the percentage of fault detection effectiveness loss of RS compared to RHS.
In summary, the results of the experiment E3 indicate that the proposed approach for redundancy reduction based on coverage
and history analysis can significantly improve fault detection effectiveness of a test suite compared to random test selection, for
an equal test suite execution time.
7.4 Prediction Performance
In the experiment E4, we evaluate the performance of the regression model in predicting the fault-detection effectiveness of
partially redundant test cases. The results for the four measures: precision, recall, accuracy, and F-score are shown in Table
3 . Mean and Standard deviation are shown for each of the 11 groups of 𝐴𝑇 𝑆𝑖. A closer value of Mean to 1 indicates better
performance of prediction of our regression model. The results in Table 3 show consistent performance of the prediction for
all 11 test suites. The overall precision is in the range 0.88-0.95, the overall recall is in the range 0.87-1.00, the overall accuracy
is in the range 0.85-0.93, and the overall F-score is in the range 0.85-0.94. These results demonstrate good performance of
16 AUTHOR ONE ET AL
the regression model in detecting inefficient partially redundant test cases, which suggests that the proposed approach for test
redundancy reduction can be a useful and reliable tool for reducing the effort and improving the quality of continuous integration
testing.
TABLE 3 PRE DICTION PERFORMANCE MEASURE S FOR THE ARTIFIC IAL TEST SUITES .
Precision Recall Accuracy F-score
Mean SDev Mean SDev Mean SDev Mean SDev
ATS_1 0.90 0.04 0.87 0.06 0.85 0.25 0.91 0.08
ATS_2 0.89 0.05 0.90 0.29 0.90 0.11 0.90 0.27
ATS_3 0.88 0.01 0.95 0.03 0.93 0.63 0.94 0.06
ATS_4 0.94 0.25 0.94 0.18 0.90 0.37 0.91 0.23
ATS_5 0.95 0.09 1.00 0.20 0.89 0.20 0.91 0.33
ATS_6 0.95 0.13 0.90 0.25 0.91 0.04 0.89 0.09
ATS_7 0.89 0.14 0.95 0.19 0.92 0.05 0.92 0.56
ATS_8 0.95 0.63 0.96 0.17 0.89 0.65 0.93 0.03
ATS_9 0.87 0.09 0.90 0.05 0.88 0.19 0.85 0.05
ATS_10 0.92 0.02 0.93 0.29 0.92 0.15 0.87 0.20
ATS_11 0.91 0.20 0.90 0.21 0.90 0.24 0.90 0.53
8 THREATS TO VALIDITY
External Validity: A threat to external validity of the experimental results relates the fact that the experiment subjects are not
sufficiently representative. In this paper, we reduce this threat by evaluating the approach on two distinct industrial data sets
from two different HCS domains. In addition, to provide more variety in test data, we used larger artificial subjects resembling
the realistic one, while varying different parameters such as the number of test cases, degree of feature interaction coverage, and
number of failing tests.
Internal Validity: A threat to internal validity relates potential faults in our implementation of RHS, MRA, and RS. To mitigate
this threat, we have carefully analyzed and thoroughly tested all implementations used in experimentation. Another threat to
internal validity may be the effect of randomness in evaluation. To mitigate this threat, we repeated experiments 100 times and
used non-parametric Mann-Whitney U-Tests, with the significance level 0.01, to confirm the statistical significance of the results.
Construct Validity: A threat to construct validity is the choice of experimental measures. To mitigate this threat, we used
the measures commonly used in evaluation of regression test selection, such as execution time reduction and fault detection
effectiveness of a test suite, as well as measures commonly used to evaluate prediction performance, such as precision, recall,
accuracy, and F-score.
9 RELATED WORK
Our work touches upon and extends the following two principal research areas.
Test Redundancy. Existing approaches to test redundancy detection are based on different test coverage metrics, such as state-
ment coverage16 , branch coverage17 , decision coverage, condition coverage, or path coverage. However, Koochakzadeh et al.
AUTHOR ONE ET AL 17
report that coverage based metrics are imprecise in test redundancy detection and should be combined with additional informa-
tion10 18. Fraser et al. present an approach where redundancy in tests generated with model checker is identified based on the
analysis of paths of Kripke structure19. In a later stage, the algorithm modifies redundant parts of tests to eliminate redundancy.
However, the authors report high run-time complexity of the algorithm as a drawback. Jeffrey et al. propose to use a modified
HSG heuristic20 for selectively retaining redundant tests in a suite based on two sets of test requirements, branch coverage and
all-uses coverage, while preserving fault-detection effectiveness21 22 . However, the approach has limitations due to a large size
of a reduced test suite. Test redundancy has been studied in the context of test suite minimization techniques6,23,24,25,26,16,17,27 ,
which aim to reduce the size of the original test suite while preserving its fault detection effectiveness. However, it was shown
that test minimization techniques often compromise fault detection effectiveness of a test suite. Rothermel et al. analyze the
effect of test suite minimization on its fault detection capability, and show that minimizing a test suite can significantly lower its
effectiveness in detecting faults28 17 . Heimdahl et al. find that test suite reduction based on structural coverage exhibits high loss
in fault detection capability29 . These findings imply that the performance of test suite reduction could be improved by improving
the effectiveness of precisely identifying redundant tests. This is particularly important for HCS, where redundancy incurs high
testing and maintenance costs. Unlike other approaches, our work uses classification trees in combination with test coverage
metrics to improve the precision of identifying and reducing redundant test cases.
Regression Test Selection. Several approaches have been proposed for improving the effectiveness of regression test selection.
Ren et al. use change impact analysis to identify regression tests affected by a change, by constructing a call graph for each test
that consists of interdependent changes at the method level30 . Orso et al. use change impact analysis and field data, although
the approach is considered unsafe31 . Other techniques include regression test selection based on specification changes32 33 34 ,
as well as code changes35 32 36 37 , or adaptive test prioritization 38. Several researchers have propose to use model-based tech-
niques for regression test selection. Korel et al. use an Extended Finite State Machine model dependence analysis for identifying
modifications39. Andrews et al. apply model-based selective regression testing to a railroad crossing control system based on
behavioral model of a system and a behavioral test suite40. However, common limitation of model-based regression testing tech-
niques is an ambiguous quality of regression test suites, because these techniques use only a modified system model instead of
an original 39. Despite numerous approaches proposed for improving regression test selection, limited attention has been given
to the problem of regression testing of highly configurable software developed in continuous integration, especially in the case
of highly-interleaved tests.
10 CONCLUSIONS
In this paper, we address the problem of inefficient regression testing of HCS in continuous integration, which runs frequently
and therefore needs to be time-efficient. On the other hand, HCS environments typically imply complex and sizable test setups,
due to the large number of software features, which need to be tested in different combinations. To enable efficient regression
testing for software systems bound by these conflicting objectives, in this paper we present a practical approach for improving
the efficiency of regression test selection based on identifying and eliminating test redundancy. The approach uses coverage
metrics to classify tests as unique, totally redundant, or partially redundant. Then, for partial redundancy, regression models
are built to classify partially redundant test cases as being likely or not-likely to detect failures, based on their historical fault
detection effectiveness. Finally, based on this information, partially redundant tests are classified into effective and ineffective
tests, and a regression test suite is created consisting of only unique and effective partially redundant tests. The approach has been
evaluated using a large set of subjects, including one industrial HCS video conferencing system, one HCS mobile application
software, and eleven artificial HCS systems resembling the industrial one in structure and complexity. The results show that
the proposed approach improves current industry practice, enabling faster regression test feedback (better time-efficiency of
testing), without compromising fault-detection effectiveness. The results further show that our proposed approach can reduce
test feedback compared to an advanced retest-all approach without significantly compromising the fault-detection effectiveness
of a regression suite. The results also show that the proposed approach can significantly improve fault detection effectiveness
compared to random testing, which is often used to reduce the effort of regression testing in time- and resource constrained
environments. Finally, the results show that the proposed approach for predicting fault detection effectiveness of tests achieves
good prediction performance, in terms of accuracy and precision.
18 AUTHOR ONE ET AL
ACKNOWLEDGMENTS
This work is supported by Certus SFI project, funded by the Norwegian Research Council. We also thank various people at
Cisco QA Team for their input and contribution to this work.
References
1. Elbaum Sebastian, Rothermel Gregg, Penix John. Techniques for Improving Regression Testing in Continuous Integration
Development Environments. In: ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE
2014):235–245ACM; 2014; New York, NY, USA.
2. Bell K. Z., Vouk M. A.. On effectiveness of pairwise methodology for testing network-centric software. In: 2005
International Conference on Information and Communication Technology:221-235; 2005.
3. Engstrom Emelie, Runeson Per. Test Overlay in an Emerging Software Product Line - An Industrial Case Study.Information
and Software Technology. 2013;55(3):581–594.
4. Selby R.W., Porter A. A.. Learning from examples: Generation and evaluation of decision trees for software resource
analysis. IEEE Transactions on Software Engineering. 1988;14(12):1743-1756.
5. K. Kang J. Hess W. Nowak S. Peterson. Feature-Oriented Domain Analysis (FODA) Feasibility Study. In: ; 1990.
6. Hervieu Aymeric, Marijan Dusica, Gotlieb Arnaud, Baudry Benoit. Practical Minimization of Pairwise-covering Test
Configurations Using Constraint Programming. Information Software Technology. 2016;71(C):129–146.
7. Xingdong Wu J. Ross Quinlan Joydeep Ghosh et. al.. Top 10 Algorithms in Data Mining. Knowledge Inf Systems. 2008;14:1-
37.
8. Hui Li Jie Sun Jian Wu. Predicting business failures using classification and regression tree: An empirical compari-
son with popular classical statistical methods and top classification mining methods. Expert Systems with Applications.
2010;37:5895-5904.
9. Gokhale Swapna, Lyu Michael R.. Regression Tree Medeling for the Prediction of Software Quality. In: :31-36; 1997.
10. Koochakzadeh N., Garousi V., Maurer F.. Test Redundancy Measurement Based on Coverage Information: Evaluations and
Lessons Learned. In: International Conference on Software Testing Verification and Validation:220-229; 2009.
11. Marijan Dusica, Gotlieb Arnaud, Sen Sagar. Test Case Prioritization for Continuous Regression Testing: An Industrial Case
Study. In: International Conference on Software Maintenanse 2013:540-543; 2013.
12. Kim Jung-Min, Porter Adam. A History-based Test Prioritization Technique for Regression Testing in Resource Constrained
Environments. In: International Conference on Software Engineering:119–129ACM; 2002; New York, NY, USA.
13. Marijan Dusica, Liaaen Marius. Effect of Time Window on the Performance of Continuous Regression Testing. In:
International Conference on Software Maintenance and Evolution:568-571; 2016.
14. Quinlan J.R.. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers. 1993;.
15. Witten Ian, Eibe Frank. Improving Fault Detection Capability by Selectively Retaining Test Cases During Test Suite
Reduction. 2007;:second edition, Morgan Kaufmann.
16. Offutt A. Jefferson, Pan Jie, Voas Jeffrey M.. Procedures for reducing the size of coverage-based test sets. In: International
Conference Testing Computer Software:111–123; 1995.
17. Rothermel G., Harrold M. J., Ostrin J., Hong C.. An empirical study of the effects of minimization on the fault detection
capabilities of test suites. In: International Conference on Software Maintenance:34-43; 1998.
AUTHOR ONE ET AL 19
18. Koochakzadeh Negar, Garousi Vahid. A Tester-assisted Methodology for Test Redundancy Detection. Advanc. Software
Engineering. 2010;2010:6:1–6:13.
19. Fraser Gordon, Wotawa Franz. Redundancy Based Test-suite Reduction. In: Proceedings of the 10th International Confer-
ence on Fundamental Approaches to Software Engineering (FASE’07):291–305Springer-Verlag; 2007; Berlin, Heidelberg.
20. Harrold M. Jean, Gupta Rajiv, Soffa Mary Lou. A Methodology for Controlling the Size of a Test Suite. ACM Transactions
on Software Engineering and Methodology. 1993;2(3):270–285.
21. Jeffrey D., Gupta Neelam. Test suite reduction with selective redundancy. In: 21st IEEE International Conference on
Software Maintenance (ICSM’05):549-558; 2005.
22. Jeffrey Dennis, Gupta Neelam. Improving Fault Detection Capability by Selectively Retaining Test Cases During Test Suite
Reduction. IEEE Transactions on Software Engineering. 2007;33(2):108–123.
23. Masri W., Podgurski A., Leon D.. An Empirical Study of Test Case Filtering Techniques Based on Exercising Information
Flows. IEEE Transactions on Software Engineering. 2007;33(7):454-477.
24. Hsu H. Y., Orso A.. MINTS: A general framework and tool for supporting test-suite minimization. In: International
Conference on Software Engineering; 2009.
25. Gotlieb Arnaud, Marijan Dusica. FLOWER: Optimal Test Suite Reduction As a Network Maximum Flow. In: Proceedings
of the 2014 International Symposium on Software Testing and Analysis (ISSTA 2014):171–180ACM; 2014; New York,
NY, USA.
26. Zhang Lingming, Marinov Darko, Zhang Lu, Khurshid Sarfraz. An Empirical Study of JUnit Test-Suite Reduction. In:
Proceedings of the 2011 IEEE 22Nd International Symposium on Software Reliability Engineering (ISSRE ’11):170–
179IEEE Computer Society; 2011; Washington, DC, USA.
27. Baller H., Lity S., Lochau M., Schaefer I.. Multi-objective Test Suite Optimization for Incremental Product Family Testing.
In: 2014 IEEE Seventh International Conference on Software Testing, Verification and Validation:303-312; 2014.
28. Rothermel Gregg, Harrold Mary Jean, Ronne Jeffery, Hong Christie. Empirical Studies of Test-Suite Reduction. Journal of
Software Testing, Verification, and Reliability. 2002;12:219–249.
29. Heimdahl M. P. E., George D.. Test-suite reduction for model based tests: effects on test quality and implications for testing.
In: International Conference on Automated Software Engineering 2004.:176-185; 2004.
30. Ren Xiaoxia, Shah Fenil, Tip Frank, Ryder Barbara G., Chesley Ophelia. Chianti: A Tool for Change Impact Analysis of
Java Programs. In: Proceedings of the 19th Annual ACM SIGPLAN Conference on Object-oriented Programming, Systems,
Languages, and Applications (OOPSLA ’04):432–448ACM; 2004; New York, NY, USA.
31. Orso Alessandro, Apiwattanapong Taweesup, Harrold Mary Jean. Leveraging Field Data for Impact Analysis and Regres-
sion Testing. In: Proceedings of the 9th European Software Engineering Conference (ESEC/FSE-11):128–137ACM; 2003;
New York, NY, USA.
32. Chen Yanping, Probert Robert L., Sims D. Paul. Specification-based Regression Test Selection with Risk Analysis. In:
Proceedings of the 2002 Conference of the Centre for Advanced Studies on Collaborative ResearchIBM Press; 2002.
33. Mao Chengying, Lu Yansheng. Regression testing for component-based software systems by enhancing change information.
In: 12th Asia-Pacific Software Engineering Conference (APSEC’05):8 pp.-; 2005.
34. Sajeev A. S. M., Wibowo B.. Regression test selection based on version changes of components. In: Tenth Asia-Pacific
Software Engineering Conference 2003:78-85; 2003.
35. Gligoric M., Eloussi L., Marinov D.. Ekstazi: Lightweight Test Selection. In: 2015 IEEE/ACM 37th IEEE International
Conference on Software Engineering, vol. 2: :713-716; 2015.
20 AUTHOR ONE ET AL
36. Gupta Rajiv, Harrold M. Jean, Soffa Mary Lou. Program Slicing-Based Regression Testing Techniques. Software Testing,
Verification and Reliability. 1996;6(2):83–111.
37. Rothermel Gregg, Harrold Mary Jean. A Safe, Efficient Regression Test Selection Technique. ACM Transactiona on
Software Engineering and Methodology. 1997;6(2):173–210.
38. Schwartz Amanda, Do Hyunsook. Cost-effective regression testing through Adaptive Test Prioritization strategies. Journal
of Systems and Software. 2016;115:61 - 81.
39. Korel B., Tahat L. H., Vaysburg B.. Model based regression test reduction using dependence analysis. In: 2002. Proceedings.
International Conference on Software Maintenance:214-223; 2002.
40. Andrews A., Elakeili S., Alhaddad A.. Selective Regression Testing of Safety-Critical Systems: A Black Box Approach. In:
2015 IEEE International Conference on Software Quality, Reliability and Security - Companion (QRS-C):22-31; 2015.
AUTHOR ONE ET AL 21
How to cite this article: Williams K., B. Hoskins, R. Lee, G. Masato, and T. Woollings (2016), A regime analysis of Atlantic
winter jet variability applied to evaluate HadGEM3-GC2, Q.J.R. Meteorol. Soc.,2017;00:1–6.