Conference PaperPDF Available

Using Bandit Algorithms for Selecting Feature Reduction Techniques in Software Defect Prediction

Authors:

Abstract and Figures

Background: Selecting a suitable feature reduction technique, when building a defect prediction model, can be challenging. Different techniques can result in the selection of different independent variables which have an impact on the overall performance of the prediction model. To help in the selection, previous studies have assessed the impact of each feature reduction technique using different datasets. However, there are many reduction techniques, and therefore some of the well-known techniques have not been assessed by those studies. Aim: The goal of the study is to select a high-accuracy reduction technique from several candidates without preliminary assessments. Method: We utilized bandit algorithm (BA) to help with the selection of best features reduction technique for a list of candidates. To select the best feature reduction technique, BA evaluates the prediction accuracy of the candidates, comparing testing results of different modules with their prediction results. By substituting the reduction technique for the prediction method, BA can then be used to select the best reduction technique. In the experiment, we evaluated the performance of BA to select suitable reduction technique. We performed cross version defect prediction using 14 datasets. As feature reduction techniques, we used two assessed and two non-assessed techniques. Results: Using BA, the prediction accuracy was higher or equivalent than existing approaches on average, compared with techniques selected based on an assessment. Conclusions: BA can have larger impact on improving prediction models by helping not only on selecting suitable models, but also in selecting suitable feature reduction techniques.
Content may be subject to copyright.
Using Bandit Algorithms for Selecting Feature Reduction
Techniques in Software Defect Prediction
Masateru Tsunoda
Kindai University
Higashi-osaka, Japan
tsunoda@info.kindai.ac.jp
Kwabena Ebo Bennin
Wageningen UR
Wageningen, Netherlands
kwabena.bennin@wur.nl
Akito Monden
Okayama University
Okayama, Japan
monden@okayama-u.ac.jp
Keitaro Nakasai
Kumamoto College, NIT
Kirishima, Japan
nakasai@kagoshima-ct.ac.jp
Koji Toda
Fukuoka Institute of Tech.
Fukuoka, Japan
toda@fit.ac.jp
Masataka Nagura
Nanzan University
Nagoya, Japan
nagura@nanzan-u.ac.jp
Amjed Tahir
Massey University
Palmerston North, NZ
a.tahir@massey.ac.nz
Kenichi Matsumoto
NAIST
Ikoma, Japan
matumoto@is.naist.jp
ABSTRACT
Background: Selecting a suitable feature reduction technique,
when building a defect prediction model, can be challenging.
Different techniques can result in the selection of different
independent variables which have an impact on the overall
performance of the prediction model. To help in the selection,
previous studies have assessed the impact of each feature reduction
technique using different datasets. However, there are many
reduction techniques, and therefore some of the well-known
techniques have not been assessed by those studies. Aim: The goal
of the study is to select a high-accuracy reduction technique from
several candidates without preliminary assessments. Method: We
utilized bandit algorithm (BA) to help with the selection of best
features reduction technique for a list of candidates. To select the
best feature reduction technique, BA evaluates the prediction
accuracy of the candidates, comparing testing results of different
modules with their prediction results. By substituting the reduction
technique for the prediction method, BA can then be used to select
the best reduction technique. In the experiment, we evaluated the
performance of BA to select suitable reduction technique. We
performed cross version defect prediction using 14 datasets. As
feature reduction techniques, we used two assessed and two non-
assessed techniques. Results: Using BA, the prediction accuracy
was higher or equivalent than existing approaches on average,
compared with techniques selected based on an assessment.
Conclusions: BA can have larger impact on improving prediction
models by helping not only on selecting suitable models, but also
in selecting suitable feature reduction techniques.
CCS CONCEPTS
• Software and its engineering Software creation and
management Software verification and validation Software
defect analysis Software testing and debugging
KEYWORDS
Software fault prediction, online optimization, variable selection,
external validity
ACM Reference format:
Masateru Tsunoda, Akito Monden, Koji Toda, Amjed Tahir, Kwabena Ebo
Bennin, Keitaro Nakasai, Masataka Nagura, and Kenichi Matsumoto. 2022.
Effect of Bandit Algorithms on Selecting Feature Reduction Techniques in
Software Defect Prediction. In Proceedings of the 19th International
Conference on Mining Software Repositories (MSR’22). ACM, Pittsburgh,
PA, USA, 12 pages. https://doi.org/10.1145/1234567890
1 Introduction
Software defect prediction is an important mechanism to find
defects earlier and thus enhance software quality. However, models
with low prediction accuracy and recall could negatively impact the
quality of the software. Therefore, to enhance the accuracy of
defect prediction models, previous studies [12][14][16][37] have
adopted several approaches including feature reduction techniques
(i.e., a technique to select independent variables of a defect
prediction model). Applying a feature reduction technique [17][23]
is one of the approaches to enhance the accuracy of such models.
We consider a scenario where predictive models are being used
during the software testing process, and that one may wish to make
choices of what technique to use during such process. When
building a defect prediction model, there are often many candidates
of explanatory variables that can be used for the prediction of defect,
such as the product and process metrics. Intuitively, feature
reduction techniques remove variables which are not considered to
enhance prediction accuracy. Depending on the reduction
technique, the explanatory variables included in the model can vary.
As a result, the accuracy of defect prediction can also vary.
Therefore, selecting a feature reduction technique is also important
in enhancing the accuracy of the defect prediction models [17][23].
There is a known external validity issue with software defect
prediction in general [12]. The issue means that even when the
accuracy of a model is high with the learning dataset (where a
prediction model is built), the accuracy is not always high on the
test dataset (dataset where a prediction model is utilized). Therefore,
to avoid the degradation of the accuracy, probing assessment using
many datasets is needed before selecting a suitable reduction
technique.
MSR’22, May 23-24, 2022, Pittsburgh, PA, USA M. Tsunoda et al.
Previous studies have assessed the effectiveness of different
feature reduction techniques in defects prediction [7][17][23] using
multiple datasets. However, those studies covered only a limited
number of reduction techniques, and there are non-assessed
techniques such as Akaike Information Criterion (AIC) [2]
stepwise feature selection and Bayesian information criterion (BIC)
[29] stepwise feature selection.
The goal of our study is to enhance defect prediction accuracy
by selecting a high-accuracy reduction technique from a number of
candidates. To achieve this, we included non-assessed techniques
as they might enhance the prediction accuracy. To ensure that we
select the best performing feature reduction technique, we adopt
and apply Bandit Algorithm (BA) [32]. BA selects optimal solution
from candidates whose performance is unknown. To improve the
selection process of BA, this paper proposes a new approach of BA
in defect prediction called BANP (Bandit Algorithm to handle
Negative Prediction). In other words, our goal is not to compare
reduction techniques, but to evaluate BANP which is an alternative
approach to the extensive evaluations of reduction techniques.
In the past, BA has been applied in the context of defect
prediction to select the best prediction method (i.e., the method
with the highest accuracy) amongst candidates of methods [19]. In
general, BA evaluates prediction methods to select the best
prediction method for the test dataset. Therefore, BA can suppress
the influence of the external validity issue [12] of prediction
methods. Likewise, when we replace prediction methods with the
reduction techniques on BA, BA can select the reduction
techniques in the same manner, suppressing the external validity
issue.
In [19], BA evaluates prediction methods based on the
comparison of test and prediction results on tested modules. This
can be done because modules are often tested sequentially in a
defined order during the software testing phase [1][22], and we can
acquire a subset of test results during the testing phase. Although
study [19] assumes that all modules included in the prediction are
sequentially tested during the testing phases, in reality some of
these modules might not have been tested (for example, in order to
reduce testing effort [37]). Regarding the untested modules, we
cannot compare predictions with test results, and this could
negatively impact the evaluation by BA. To avoid such a case, we
propose a new BA based method that considers software testing in
the prediction. Study [19] simply counted correct and incorrect
prediction and used the sum to select prediction methods. However,
prediction models are generally evaluated by accuracy criteria such
as the area under the ROC curve (AUC). To improve the
performance of BA, we propose using an accuracy criterion to
select the reduction techniques.
To clarify the effectiveness of our proposal, we formulate the
following research questions:
RQ1: Is the prediction accuracy of BANP higher than naïve
BA?
RQ2: Is the prediction accuracy of BANP higher than
conventional approaches?
RQ3: Does BANP enable us to use non-assessed techniques,
avoiding degradation of the accuracy?
RQ4: To what extent can BANP select high accuracy arms?
RQ5: To what extent can BANP improve the arm selection,
compared with the naïve BA?
RQ1, 2, and 3 focus on the prediction accuracy, while RQ4 and
5 focus on selected arm by BANP. RQ1, 2, and 3 are explained in
Section 6.1, and RQ4 and 5 are explained in Section 7.3.
2 Bandit Algorithms
BAs are proposed to solve multi-armed bandit problems. Those
type of problems are often explained through an analogy with slot
machines. Assume that a player has 100 coins to bet on several slot
machines, and the player wants to maximize their reward. Instead
of selecting only one slot machine and betting all 100 coins, BA
suggests that a player bet only one coin on each slot machine. By
calculating average reward of each machine after each betting, the
player can then recognize which slot machine is the best (i.e., the
highest average reward). The derivation of the problem is that each
slot machine has an arm, and the arm is compared to a bandit who
steals money from players. Multi-armed bandit problem seeks
sequentially best candidates (they are referred to as arms) whose
expected rewards are unknown, to maximize total rewards.
BA is sometimes regarded as an online learning [8]. Typically,
online learning deals with continuous flow of training instances that
continuously update the decision model [25]. In addition to that,
online learning includes online optimization as a sub-category [30].
Online optimization makes decisions (e.g., setting parameters) to
maximize the gain (e.g., prediction accuracy) based on the flow of
instances such as test results of each module [20], but the decision
model is not always updated. Thus, BA is considered an online
optimization method [20] since BA iteratively chooses an arm to
maximize the total rewards.
In the context of online learning, our study uses BA to select
reduction technique online (i.e., during software testing), while
existing approaches such as approaches considered in [7][17][23]
mostly work offline (i.e., before testing). During software testing,
BA decides which reduction technique is used, to maximize the
gain (i.e., prediction accuracy), based on the flow of instances (i.e.,
testing results). There are multiple implementations for this, with
the most popular ones are Epsilon-greedy and Upper Confidence
Bounds Algorithms [35] - both are explained below:
Epsilon-greedy Algorithm: This algorithm chooses a random
arm with probability epsilon. That is, it selects an arm whose
average reward is the highest among arms with the probability 1 -
ε (0 ≤ ε 1). When the value of ε is 0, arms are always selected
based on the average reward of each arm. In contrast, when the
value of ε is 1, arms are always selected randomly.
Upper Confidence Bounds: UCB algorithm [5] focus on not
only average reward of each arm, but also the amount of
information about each arm. For instance, when an arm is selected
only once, we do not have enough information about the arm, and
it is not clear whether the average is valid or not. After the arm is
selected repeatedly, we can know the valid average. UCB
positively selects such less-information arms because they might
Using Bandit Algorithms for Selecting Feature Reduction
Techniques in Software Defect Prediction MSR’22, May 23-24, 2022, Pittsburgh, PA, USA
have high average reward. To do that, UCB selects the arm where
the value of r in the following equation is the highest:
   

(1)
In the equation, x denotes average reward of the arm, and s is
the number of times the arm is selected, and t is the number of total
trials. The right most term denotes the amount of information of the
arm. When s is small, the value is large, and the arm is selected with
a high probability.
3 Defect Prediction Based on the Bandit
Algorithm
3.1 Procedure
To apply BA to select suitable feature reduction technique, we
assume that m modules (m is a natural number) are tested. Also, we
assume that the learning dataset and the prediction method have
been decided. Following the procedure proposed in [19] used to
select better prediction methods, we propose to replace the methods
with reduction techniques in step 1 of the procedure, to select the
techniques.
In what follows, steps B1 … B3 are applied in an offline manner
(i.e., before software testing), in order to prepare predictions
optimized by the online learning. Steps D1 … D3 are applied in an
online manner (i.e., during software testing) to optimize prediction
accuracy online.
Select n feature reduction techniques (n is a natural number).
Make n prediction models, applying each reduction technique,
and using learning dataset (collected from a project where
testing completed).
Predict defects of m modules based on the models, using test
dataset (collected from a project where testing does not start).
The prediction of each model is treated as arm, and n arms are
made.
The steps are performed before target modules (i.e., t7, t5, t3, t9,
t1... in Figure 1) are tested. This procedure is illustrated in Figure
1. For step B1, three reduction techniques are selected (i.e., n = 3).
It is recommended to set n around four, because BA works well
when the number of arms is four [19] and six [4] (the number of
arm is decided based on the number of available candidates). For
step B2, three prediction models (i.e., model A, B and C in the
figure) are made, combining the reduction techniques and a
prediction method (e.g., logistic regression). The independent
variables can be different between the models as the reduction
technique is different for each model. For instance, as shown in
Figure 1, DIT, NOC and CBO are selected by ConFS for the model
A, and NOC, CBO, and RFC are selected by CFS for the model B.
For step B3, defects of modules t1 … t10 (i.e., m = 10) are predicted
by the models. As a result, three arms are made, resulting in three
prediction values for each module.
Note that although step B3 predicts defects of m modules, the
prediction is optimized by online manner during step D1 … D3.
Additionally, note that BA directly selects an arm of prediction
results, but also indirectly selects the reduction techniques. This is
because the results are derived by the prediction models, and the
models are built using the techniques. For example, in Figure 1,
when BA selects arm A, it indirectly selects ConFS. Since arm A
is derived by model A, and the model is built using ConFS.
Additionally, BA dynamically selects the arms, but does not change
subset of independent variables.
After the above preparation, BA selects arms by the following
procedure as illustrated in Figure 1. In the procedure, the average
reward means prediction accuracy of each arm. As an average
reward, we used tentative AUC explained in the next subsection.
Figure 1: Procedure of defect prediction based on BA
Test
module
Arm A Arm B Arm C BA
Test result
Pred. Reward Tentative
AUC Pred. Reward Tentative
AUC Pred. Reward Tentative
AUC
Selected
arm Pred.
t7 (0.7) P TP 1.00 (0.4) N FN 0.00 (0.8) P TP 1.00 A P Defective
t5 (0.3) N TN 1.00 (0.1) N TN 0.50 (0.2) N TN 1.00 A N Non-defective
t3 (0.8) P FP 0.75 (0.6) P FP 0.25 (0.1) N TN 1.00 A P Non-defective
t9 (0.9) P (0.2) N (0.7) P C P
t1 (0.6) P (0.3) N (0.4) N
... ... ... ... ... ... ... ... ... ... ... ... ...
Model A
(DIT, NOC, CBO)
Model B
(NOC, CBO, RFC)
Model C
(WMC, DIT, NOC)
Reduction
technique A
(ConFS)
Reduction
technique B
(CFS)
Reduction
technique C
(Stepwise AIC)
Learning Dataset
(WMC, DIT, NOC,
CBO, RFC , LCOM)
Before software testing
During software testing
P (Positive): Defective
N (Negative): Non-defective
B1 B1 B1
B2 B2 B2
B3 B3 B3
D1 D2
D3 D3 D3
MSR’22, May 23-24, 2022, Pittsburgh, PA, USA M. Tsunoda et al.
D1.
Select an arm based on the average reward (tentative AUC) of
arms
D2.
Test a module based on the prediction of selected arm.
D3.
Recalculate the average reward of each arm, comparing the
prediction and the test result clarified on step D2.
D4.
Back to step D1.
The steps are iteratively performed during testing. Step D1 is
performed based on an algorithm such as epsilon-greedy, explained
in Section 2. In Figure 1, we assume that epsilon-greedy (
ε
= 0) is
applied. Initially, the average reward of all arms is zero, and
therefore an arm is selected randomly in this step. In step D2, when
the defect prediction result is “defective”, developers make more
test cases, and when it is “non-defective”, fewer test cases are made
[26] typically, to save on resources spent on testing [37].
For the initial iteration of BA, arm A is selected randomly in
step D1. In step D2, module t7 is tested thoroughly, because the
selected arm of the prediction is “defective.” In step D3, the
prediction to t7 of arm A and C are evaluated as true-positive, and
that of arm B is evaluated by false-negative (defects are found in
t7). Based on the evaluation, we calculate the average reward for
each arm.
For the second iteration, the arm A is still selected in the step
D1, because the average reward of arm A is the highest among the
three arms. In step D2, less test cases for module t5 are made,
because the prediction on arm A is “non-defective.” In step 3, based
on the test result, the prediction to t5 of all arms is evaluated as true-
negative. Likewise, the third iteration is performed. On the fourth
iteration, the average reward of arm C is the highest. Therefore, arm
C is selected in the iteration.
Prediction of BA: Based on the above iteration, prediction of
the selected arm consists of the prediction of BA. For example, as
shown on column “BA” of Figure 1, prediction of BA consists of
P, N, P, and P, since arm A, A, A, and C are sequentially selected.
Evaluation criterion of BA such as AUC is calculated, comparing
the prediction with test results.
Cutoff Value: Most prediction models output predictions using
real numbers. For instance, in Figure 1, the prediction of model A
to module t7 is 0.7. The cutoff value is set as 0.5 in the figure, and
therefore the prediction is regarded as “defective”. To apply BA,
we must set the cutoff value, and the prediction values should be
converted into binary values (i.e., defective, or non-defective)
based on the cutoff value. We set the cutoff value as the closest
point to top left corner of the graph of the ROC curve on learning
dataset. This is one of common methods to set the cutoff values.
Major application: We assume that BA is mainly applied
during integration testing. This is because during integration testing,
each module is combined with other modules and they are tested
sequentially, so we can acquire the results in the middle of the
testing phase. Modules are tested incrementally as they are being
added to each other, and their behavior might be tested as a
collective unit. Companies may also use defect prediction for
testing prioritization [24]. Although some test cases might be
written before a module is tested, most testing effort for the module
such as applying testing cases and evaluating test results are
dedicated for each module sequentially.
To perform integration testing, top-down and bottom-up testing
is typically applied. Top-down testing sequentially tests each
module from higher (i.e., callers) to lower-level ones (i.e., callees).
In contrast, bottom-up testing tests modules from lower to higher
level ones. Previous studies proposed methods to decide the test
order in detail [1][22].
3.2 Tentative AUC
As average reward of each arm, we propose to use an evaluation
criteria of prediction accuracy such as the AUC and F-score. In this
paper, we used AUC as average reward. AUC is often used to
evaluate prediction accuracy of defect prediction [17][23]. We call
it tentative AUC, because it changes during iterations of BA as
shown in Figure 1. We cannot know the actual AUC of each arm
until all tests are completed.
In previous studies [4][19], the average reward was almost same
as ordinal BA explained in Section 2. For instance, in study [19],
when the prediction is true-positive and true-negative, 1 is added to
the total reward, and when it is false-positive and false-negative, -
1 is added to the total reward. The average reward is calculated by
dividing the total reward by the number of iterations of BA.
Instead of such calculation, we used an evaluation criterion
directly. This is because a prediction model is evaluated by the
criterion, and therefore optimizing the model based on the criterion
is the most reasonable.
4 Approach to Handle Negative Prediction
4.1 Influence of Negative Prediction on
Tentative AUC
Defect overlooking by negative prediction: When a defect
prediction model predicts a negative result (i.e., “non-defective”),
developers will typically write fewer test cases for those modules
[26], in order to efficiently allocate resources for testing [31][37].
As a result, defects are overlooked by the test, and the module
might be regarded as “non-defective” in most cases, even if the
module includes defects. We call this case as
defect overlooking by
negative prediction
. This means that overlooking of defects occurs
due to fewer test cases based on negative prediction. This is
inevitable, when testing resources are allocated by defect prediction.
Defect overlooking by negative prediction makes tentative
AUC inaccurate, and based on the inaccurate AUC, BA could end
up selecting a low performing (accuracy) arm erroneously. In
Figure 2, arm B is randomly selected on the first iteration, and the
test order is different from the one shown in Figure 1. The column
“test result” considers only defects during testing, while “actual
result after testing” also considers defects after testing was done –
such as when the software is released. In the example, we assume
that defects are always overlooked when the prediction is negative,
due to fewer test cases. That is, when “Pred.-Naïve BA” columns
is “N” in Figures 2, “Test result” column is “Non-defective” at
100% probability. Additionally, we removed arm C of Figure 1 for
simplicity.
Using Bandit Algorithms for Selecting Feature Reduction
Techniques in Software Defect Prediction MSR’22, May 23-24, 2022, Pittsburgh, PA, USA
The reward of arm B is set as true-negative on module t1, t9 and
t7 based on the test outcomes. However, based on the actual result,
this reward is incorrect, and it should be set as false-negative.
Likewise, the reward of arm A is set as false-positive erroneously.
As a result, tentative AUC becomes incorrect, and arm B with low-
accuracy is erroneously selected.
The overlooking was not considered in previous studies [4][19].
The studies assume risk-based testing [15] where the prediction is
used to settle the test order, but not to control testing effort.
Therefore, BA is not affected by negative prediction in such case.
Defect overlooking by positive prediction: Even when the
prediction is positive (i.e., “defective”), and many test cases are
applied, defects are sometimes overlooked during testing. We call
this case as
defect overlooking by positive prediction
. This could
occur even when testing resources are not allocated by defect
prediction. Modules t3 in the figure is an example of such case. This
is because we cannot find all defects perfectly by testing. Based on
large-scale data from cross-companies [21], about 17% of defects
are overlooked during integration testing. The overlooking was
simulated in the experiments reported in previous studies [4][19].
4.2 Procedure
To suppress the influence of negative prediction, we propose
BANP (Bandit Algorithm to handle Negative Prediction). In what
follows, the BA explained in Section 3 is called naïve BA. BANP
forcibly sets prediction as positive during early iteration of BA, to
enhance the accuracy of tentative AUC. Figure 3 illustrates how
BANP works (Conditions of Figure 3 are the same as Figure 2,
except for module t8). The procedure of BANP is as follows:
(a)
Set the prediction by BA as positive forcibly when
p
iterations
of BA is not finished.
(b)
Select an arm by naïve BA, when
p
iterations are finished, or
prediction of all arms is equal (e.g., module t5 in Figure 3).
The item (a) restrains the overlooking by negative prediction,
and keeps tentative AUC accurate. This is because when “Pred.-
BANP” column is “P,” many test cases are applied to the module.
One can then determine whether the module is actually defective.
Concretely speaking, the overlook occurs on modules t1, t9, and t7
in Figure 2, while it does not occur on them in Figure 3. As shown
in Figure 3, it makes rewards and tentative AUC accurate.
The item (a) forcibly sets the values of “Pred.-BANP” column
as “P” in Figure 3, but does not do “Pred.-Arm A” and “Pred.-Arm
B” columns. That does not affect tentative AUC, because it is
derived, comparing “Pred.-Arm A,” “Pred.-Arm B” and “Test
result” columns in Figure 3.
Figure 4: Relationship between
i
and the average of
r
.
0.00
0.02
0.04
0.06
0.08
0.10
0.4 0.5 0.6 0.7 0.8 0.9 1.0
i
Average of ratio r
Figure 2: Influence of overlooking on BA
Figure 3: Procedure of BANP and setting parameter
p
Test
module
Arm A Arm B Naïve BA
Test result Actual result
after testing
Pred. Reward Tentative
AUC Pred. Reward Tentative
AUC
Selected
arm Pred. Actual
AUC
t1 P FP 0.00 N TN 1.00 B N 0.00 Non-defective Defective
t9 P FP 0.00 N TN 1.00 B N 0.00 Non-defective Defective
t5 N TN 0.33 N TN 1.00 B N 0.50 Non-defective Non-defective
t7 P FP 0.25 N TN 1.00 B N 0.50 Non-defective Defective
t3 P FP 0.20 P FP 0.80 B P 0.63 Non-defective Defective
… … … …
Occur about 20% probability
Occur in most cases
Test
module
Arm A Arm B BANP
Test result Actual result
after testing
Pred. Reward Tentative
AUC Pred. Reward Tentative
AUC
Selected
arm Pred. Actual
AUC
t1 P TP 1.00 N FN 0.00 -- P1.00 Defective Defective
t9 P TP 1.00 N FN 0.00 -- P1.00 Defective Defective
t5 N TN 1.00 N TN 0.50 A N 1.00 Non-defective Non-defective
t7 P TP 1.00 N FN 0.50 -- P1.00 Defective Defective
t3 P FP 0.75 P FP 0.25 A P 1.00 Non-defective Defective
t8 N TN 0.83 P FP 0.17 -- P0.75 Non-defective Non-defective
… …
Occur about 20% probabilityp= i* # of test target modules = 0.1 * 40 = 4
r= 0.60 r= 0.33 average r= 0.47 i= 0.1
P1 P1
P2
P3
P4
MSR’22, May 23-24, 2022, Pittsburgh, PA, USA M. Tsunoda et al.
The parameter
p
denotes how many times the prediction is set
as positive. For instance,
p
= 4 in Figure 3. The value of
p
is 0-10%
of whole test target modules. The main idea of using value
p
, was
to make sure that the first few (
p
) modules are thoroughly tested,
by forcing the prediction to be defective. The detail of
p
is
explained later.
The item (b) prescribes when the item (a) is not applied. For
example, item (a) is not applied to module t5 in Figure 3, although
4 iterations have not finished. This is because prediction of all arms
is the same (i.e., negative). When prediction of all arms is the same,
tentative AUC of all arms are uniformly improved or worsened. In
this case, the overlooking does not affect the arm selection by BA,
and therefore item (a) is not applied.
Setting Parameter p: To set the parameter
p
, we should be
aware of the characteristics of item (a) explained in the followings:
Negatively affect prediction accuracy on some cases.
The effect is small when most of the predictions are positive.
Module t8 in Figure 3 is an example of the first item. Although
prediction of arm A is correct, prediction of BANP is incorrect, and
actual AUC of BANP degrades. This is because the item (a)
corrects tentative AUC, but has a risk to degrade prediction
accuracy on each module.
For the second item, assume that 90 out of 100 modules are
predicted as positive. In this case, the overlooking by negative
prediction rarely occurs, and the accuracy of tentative AUC would
be high, even when item (a) is not applied at all. That is, when many
predictions are positive, the advantage of the item (a) is limited as
explained in the second item, but the disadvantage explained on the
first item remains unchanged.
Therefore, to optimize the effect of the item (a), the value of
p
should change according to the number of positive predictions. We
set the parameter
p
by the following procedures (see Figure 3). The
parameter is settled before software testing.
P1.
Calculate the ratio
r
by the number of positive predictions
divided by the number of negative ones on each arm.
P2.
The average
r
is calculated.
P3.
Using the average
r
and the graph of Figure 4,
i
is settled.
P4.
Calculate
p
by multiplying
i
by the number of test target
modules. (The number is same as whole iteration of BA).
For instance, in step P1, when the number of positive
predictions is 15 and the negative one is 25 on arm A,
r
is 0.60, as
shown in Figure 3. In the figure,
r
is 0.60 on arm A and 0.33 on arm
B, and therefore the average
r
is 0.47 in step P2. In step P3,
i
is
settled as 0.1, referring the Figure 4. In the figure, the number of
target modules is 40, and therefore the
p
is 4 in step P4.
We assume that when the ratio of positive and negative
predictions is larger than 1:1, the influence of the overlooking by
negative prediction is small, considering frequency of positive
prediction. In contrast, when the ratio is smaller than 1:2, the
influence should be cared to the maximum. Hence, to make the
graph of Figure 4, we set
p
to 10%, when the average of
r
0.5,
and set
p
as 0%, when the average of
r
≥ 1.0.
5 Feature Reduction Techniques
In this study, we examined four feature reduction techniques as
follows:
CFS: Correlation based feature selection [18]
ConFS: Consistency based feature selection [13]
Stepwise AIC: AIC [2] stepwise feature selection
Stepwise BIC: BIC [29] stepwise feature selection
Kondo et al., [23] recommended CFS and ConFS for the
supervised defect prediction models such as logistic regression. As
prediction model, we select logistic regression, because we
followed the above recommendation, and it is one of the most used
models for defect prediction [12][23].
In contrast, previous studies [17][23] which assessed the effects
of the reduction techniques do not cover Stepwise AIC and
stepwise BIC. The techniques are used to select features of logistic
regression in other fields such as biosciences [11]. As far as we
know, only one study [3] used AIC and BIC for defect prediction,
but it is not clear whether they perform better than other techniques.
Therefore, there is a possibility that stepwise AIC or stepwise BIC
have higher performance compared to CFS and ConFS.
AIC is calculated as:
 


  
where
σ
2
is the residual variance of the model,
n
is the number
of data points, and
k
is the number of independent variables.
Generally, the higher the value of
k
, the lower the value of
σ
2
.
However, larger value of
k
causes overfitting. AIC selects a
prediction model, considering both
σ
2
(i.e., the accuracy) and
k
(i.e.,
the overfitting). That is, when AIC of a model is smaller, the model
is preferable to avoid overfitting.
BIC focuses not only on
k
but also on
n
, in order to consider this
overfitting. That is, BIC is calculated by the following equation:
 



  
6 Experiment
6.1 Prediction Models
As conventional approach, we assumed that a feature reduction
technique is typically selected as follows:
Control: The technique which is recommended by previous
studies i.e., [17] [23].
Ad hoc: The technique which is selected arbitrarily.
Experiment: The technique which shows the highest accuracy
among candidates on learning dataset.
To compare BA with the conventional approaches, the
approaches selected one of the reduction techniques explained in
Section 5, and based on the selection, the following models were
made. As mentioned in Section 5, we used logistic regression as
prediction method, and hence the models were built using the
method and a reduction technique.
C1: Model built by CFS
C2: Model built by ConFS
A1: Model built by stepwise AIC
A2: Model built by stepwise BIC
A3: Model randomly selected from C1, C2, A1, and A2
Using Bandit Algorithms for Selecting Feature Reduction
Techniques in Software Defect Prediction MSR’22, May 23-24, 2022, Pittsburgh, PA, USA
L: Model which showed the highest accuracy among C1, C2,
A1, and A2 on learning dataset
B: BANP using C1, C2, A1, and A2 as arms
C (Control)1 and C2 are knowledge-based approaches because
they are recommended in a previous study [23]. A(ad hoc)1 and A2
are ad hoc approaches as it is not clear whether the performance of
stepwise AIC and BIC are higher than C1 and C2 . That is, there
are not explicit rationales of this selection. A3 is also ad hoc
approach. L(learning dataset) selects the model which showed the
highest accuracy among C1, C2, A1, and A2 on learning dataset.
Note that only BA and BANP can select feature reduction
techniques on a validation set during actual software testing.
We also compared BANP with naïve BA. Same as BANP, C1,
C2, A1, and A2 were used as arms. When naïve BA was used, we
applied the following algorithms explained in Section 2. We set the
parameter ε as 0, 0.1, 0.2, 0.3, because the values are often used to
set
ε
[35]. When BANP was used, we applied UCB, because the
accuracy of UCB is higher in preliminary analysis, and UCB does
not require parameter settings. We omit the analysis due to the page
limitation.
1
E0: Epsilon greedy algorithm (ε = 0)
E1: Epsilon greedy algorithm (ε = 0.1)
E2: Epsilon greedy algorithm (ε = 0.2)
E3: Epsilon greedy algorithm (ε = 0.3)
U: UCB algorithm
6.2 Dataset
We used 14 datasets published on PROMISE [9] and D’Ambros
et al. [12] (DAMB for simplicity) repositories. From these
repositories, we selected datasets which were collected from
different versions of the software, to perform cross-version defect
prediction (CVDP). On CVDP, when dataset collected from a
certain version is used as test dataset, dataset collected from
previous version is used as learning dataset. We used the newest
version in the repository for each software as test dataset. Each
dataset includes 20 independent variables which includes product
metrics such as CK metrics [10] (e.g., WMC, DIT, and NOC), but
does not include process metrics. Table 1 shows details of datasets
used in the experiment.
6.3 Evaluation criteria
AUC: AUC is widely used to evaluate defect prediction models
[17][23]. The maximum value of AUC is 1, and when the value of
a prediction model is large, it means that prediction accuracy of the
model is high. As a criteria, we only used AUC, because BA
optimizes defect prediction based on AIC as explained in Section
3.2, and we intended to check the validity of the optimization. Note
that in [23], the authors used AUC and AUC-based indices (i.e.,
ImpR
and
Win
%; these are explained later) as evaluation criteria
and did not use other measures such as F-score.
Generally, AUC is calculated, changing the cutoff value of the
prediction to draw ROC (Receiver Operating Characteristic) curve
on test dataset. However, as explained in Section 3.1, the cutoff
value is fixed on the test dataset, when BA is applied. As a result,
AUC of ROC drawn by fixing the value could be smaller than that
drawn by changing the value. To align the condition, we fixed the
cutoff value of all models explained in Section 6.1, based on the
way explained in Section 3.1.
ImpR: To see whether the accuracy of the model is improved
by a method, we used the following criterion [23].
 



!"#
 
$
In the equation,
AUC
base
is AUC of a baseline method such as a
conventional approach, and
AUC
foc
is AUC of a focused method
such as BANP. That is, the criterion denotes the extent AUC is
improved by the focused method. When the value is larger than 1,
AUC is improved by it.
Summary statistics: Based on the above criteria, we
summarized them as the followings:
Average of AUC
Standard Deviation (SD) of AUC
Average of
ImpR
Win %
We calculated the average and SD of AUC among 14 datasets
for each model. As explained in Section 1, due to the external
validity of defect prediction, the accuracy of feature reduction
technique is different across datasets [17]. Hence, the stability of
the prediction accuracy is also preferable characteristic. Therefore,
we focused on SD, in addition to the average. When SD is low, the
stability of the model is regarded as high.
To comprehend the results easily, we also ranked the average
and SD. When the rank of the average of a model is top, the average
is the largest among the models. Likewise, when the rank of SD is
top, SD is the smallest. That is, when these ranks are top, the
Table 1: Datasets used in the experiment
Software
Learning dataset Test dataset
Ver.
# of Modules (%)
Ver.
# of Modules (%)
All Defective All
Defective
ant 1.6
351
92 (26.2)
1.7
745
166 (22.3)
camel 1.4
872
145 (16.6)
1.6
965
188 (19.5)
forrest 0.7
29
5 (17.2)
0.8
32
2 (6.3)
ivy 1.4
241
16 (6.6)
2.0
352
40 (11.4)
jedit 4.2
367
48 (13.1)
4.3
492
11 (2.2)
log4j 1.1
109
37 (33.9)
1.2
205
189 (92.2)
Lucene
2.2
247
144 (58.3)
2.4
340
203 (59.7)
Pbeans 1
26
20 (76.9)
2
51
10 (19.6)
poi 2.5
385
248 (64.4)
3.0
442
281 (63.6)
prop 5
8516
1299 (15.3)
6
660
66 (10.0)
synapse
1.1
222
60 (27.0)
1.2
256
86 (33.6)
velocity
1.5
214
142 (66.4)
1.6
229
78 (34.1)
xalan 2.6
885
411 (46.4)
2.7
909
898 (98.8)
xerces 1.3
453
69 (15.2)
1.4
588
437 (74.3)
1 We performed BANP and the simulation using code written in
Python and Microsoft
Excel VBA.
The models were built on R using the MASS, pROC, FSelector and
RcmdrMisc packages. We made our full replication
package available online at
https://zenodo.org/record/5886599
MSR’22, May 23-24, 2022, Pittsburgh, PA, USA M. Tsunoda et al.
performance is the highest among models. Additionally, to validate
the difference of AUC between models, we applied Wilcoxon
signed rank test (i.e., non-parametric test). Hence, we could
suppress the influence of protentional outliers, although we
statically tested the difference of reduction techniques across
datasets. We set significance level (p-value) as 0.05.
Similar to [23], we counted the number of datasets where
ImpR
is larger than 1. Using the number of datasets, we defined
Win%
. It
is calculated by dividing the number of datasets by the total number
of datasets (i.e., 14). When
Win%
is larger than 50%, the focused
method is more effective than the baseline on more datasets.
6.4 Experimental procedure
In the experiment, we assumed that defect overlooking by
negative prediction always occurs, when the prediction of BA is
negative (see Section 4.1). This is a disadvantageous assumption
for BA. Additionally, we assumed that defect overlooking by
positive prediction is 20% possibility. This is because about 17%
of defects are overlooked during integration testing [4][21].
“Actual result after testing” (cf. Figure 2) is recorded on test
datasets used in the experiment. To simulate Figure 2, we
artificially made “test result” from “actual result after testing,”
following the above assumptions.
As explained in Section 4.1, the order of the test modules and
the selected arms on the first iteration of BA could affect the
accuracy of BA. Considering this in our model, we randomly
changed the order and the arm on test dataset 40 times (e.g., the test
order and the selected arms are different between Figures 1 and 2,
although tested modules are the same).
AUC of BA is the average of the 40 different order test datasets.
Note that both repeated holdout and k-fold cross-validation
calculate AUC in the same manner [6]. Also, to evaluate the
performance of A3 explained in Section 6.1, we randomly selected
prediction results from C1, C2, A1, and A2 40 times, and calculated
AUC.
7 Result
7.1 Comparison to Naïve BA
We compared BANP with naïve BA. Table 2 shows AUC
across the algorithm in each dataset among 14 datasets, and Table
4 shows summary statistics and p-values, as explained in Section
6.3. To compare the performance of BANB with naïve BA, we
Table 2: Prediction accuracy of BANP and naïve BA
Method
ant camel
forrest
Ivy
jedit
log4j
lucene
pbeans
poi
prop
synapse
velocity
xalan
xerces
E0 0.74
0.59
0.65
0.66
0.65
0.56
0.62
0.69
0.69
0.57
0.59
0.66
0.71
0.66
E1 0.74
0.59
0.65
0.67
0.65
0.56
0.63
0.69
0.69
0.57
0.59
0.66
0.72
0.67
E2 0.75
0.59
0.69
0.68
0.65
0.57
0.62
0.70
0.68
0.57
0.60
0.66
0.72
0.66
E3 0.75
0.60
0.65
0.67
0.66
0.56
0.62
0.70
0.68
0.57
0.59
0.67
0.72
0.66
U 0.74
0.58
0.65
0.66
0.64
0.56
0.63
0.69
0.69
0.57
0.60
0.66
0.71
0.66
B 0.75
0.61
0.68
0.67
0.65
0.57
0.63
0.70
0.69
0.57
0.61
0.66
0.73
0.67
Table 3: Prediction accuracy of BA and conventional approaches
Method
ant camel
forrest
ivy
jedit
log4j
lucene
pbeans
poi
prop
synapse
velocity
xalan
xerces
C1 0.76
0.56
0.65
0.69
0.62
0.55
0.66
0.72
0.69
0.60
0.61
0.65
0.71
0.66
C2 0.74
0.59
0.85
0.73
0.62
0.56
0.61
0.67
0.64
0.54
0.59
0.66
0.71
0.68
A1 0.71
0.63
0.68
0.65
0.70
0.59
0.63
0.72
0.65
0.54
0.62
0.66
0.75
0.65
A2 0.75
0.63
0.68
0.69
0.71
0.59
0.61
0.72
0.62
0.50
0.60
0.69
0.70
0.66
A3 0.74
0.60
0.72
0.69
0.66
0.57
0.63
0.71
0.66
0.54
0.61
0.66
0.72
0.66
L 0.71
0.63
0.68
0.65
0.70
0.59
0.63
0.72
0.65
0.54
0.62
0.66
0.75
0.65
E2 0.75
0.59
0.69
0.68
0.65
0.57
0.62
0.70
0.68
0.57
0.60
0.66
0.72
0.66
B 0.75
0.61
0.68
0.67
0.65
0.57
0.63
0.70
0.69
0.57
0.61
0.66
0.73
0.67
Table 4: Summary statistics of BANP and naïve BA
Avg. AUC
SD AUC Avg.
ImpR
P Win%
E0
0.6466 (6)
0.0545 (2)
1.016
0.00
100.0
E1
0.6482 (4)
0.0546 (3)
1.014
0.00
92.9
E2
0.6517 (2)
0.0552 (6)
1.009
0.00
85.7
E3
0.6504 (3)
0.0551 (5)
1.010
0.00
78.6
U
0.6472 (5)
0.0548 (4)
1.016
0.00
92.9
B
0.6571 (1)
0.0538 (1)
- - -
Table 5: Summary statistics of BANP and existing
approaches
Avg. AUC
SD AUC Avg.
ImpR
P Win%
C1
0.6541 (6)
0.0603 (5)
1.006
0.93
50.0
C2
0.6564 (2)
0.0839 (8)
1.008
0.00
78.6
A1
0.6556 (3)
0.0545 (2)
1.003
0.82
50.0
A2
0.6537 (7)
0.0672 (7)
1.009
0.60
42.9
A3
0.6547 (5)
0.0607 (6)
1.005
0.01
57.1
L 0.6551 (4)
0.0555 (4)
1.004
0.68
50.0
E2
0.6517 (8)
0.0552 (3)
1.009
0.00
85.7
B 0.6571 (1)
0.0538 (1)
- - -
Using Bandit Algorithms for Selecting Feature Reduction
Techniques in Software Defect Prediction MSR’22, May 23-24, 2022, Pittsburgh, PA, USA
calculate
ImpR
, setting AUC of BANP as
AUC
foc
, and AUC of
naïve BA as
AUC
base
. In Table 2, shaded cells denote
ImpR
is larger
than 1 (i.e., AUC of BANP is larger than naïve BA). We compared
them, considering after first decimal place. Numbers in parentheses
in Table 4 indicate the rank of each algorithm (see Section 6.3).
In Table 4, both SD and average AUC of BANP were ranked
first. Also,
ImpR
across datasets was larger than 1 for all algorithms.
Based on
Win%
, AUC of BANP was higher than naïve BA on
78.6% to 100.0% in 14 datasets. Difference in AUC values between
BANP and naïve BA were statistically significant at a significance
level of 0.05. To answer RQ1, we found that the prediction
accuracy of BANP is higher than the naïve BA approach.
7.2 Comparison to Existing Approach
We compared the prediction accuracy of BANP to seven
existing approaches. Table 3 and Table 5 show AUC of the models
on each dataset and the summary statistics. In Table 5,
ImpR
was
calculated by setting AUC of BANP as
AUCfoc
, and AUC of other
models as
AUCbase
. Shaded cells in Table 3 denote
ImpR
is larger
than 1. In table 5, numbers in parentheses indicate the rank of each
algorithm. We statistically tested the difference of AUC between
BANP and the other models. As a reference, the Table 3 and 5
include E2 which showed the highest accuracy of naïve BA in
Table 4.
In Table 5, the rank of SD and average AUC of BANP was the
first among the methods. On all methods,
ImpR
was larger than 1.
Therefore, the accuracy of BANP was not inferior to existing
approaches at least, based on the summary statistics.
Based on
Win%
, AUC of BANP was higher than consistency-
based method (C2) on 78.6% of 14 datasets, and AUC of BANP
was higher than randomly selected model (A3) on 57.1% of the
datasets. The difference was statistically significant. Therefore, the
accuracy of BANP is higher than these methods. Also, based on
Win%
, AUC of BANP was higher than C1, A1, and L on the half
of 14 datasets. The difference was not statistically significant, and
hence the prediction accuracy of BANP is equivalent to these
methods at least.
Although AUC of BANP was higher than stepwise BIC (A2) on
42.9% of the datasets, the difference was not statistically significant.
Additionally, the ranks of SD and the average were the second
lowest on A2. Therefore, we do not consider that the prediction
accuracy of stepwise BIC is higher than BANP.
It should be noted that we assumed that the probability of
overlooking by negative prediction is 100%, as explained in section
6.4. The assumption is disadvantageous to BANP. Based on the
results, and to answer RQ2, we found that the prediction accuracy
of BANP using UCB is, at least, equivalent to existing approaches.
We compared assessed techniques (i.e., C1 and C2) with non-
assessed ones (i.e., A1 and A2) from [17][23]. The rank of SD and
average AUC of A1 was higher than C1. While those of A2 were
lower than C1 and C2. The result suggests that some of non-
assessed techniques such as A1 have potential to enhance
prediction accuracy, but the others such as A2 have risks to degrade
the accuracy. Although BANP used the latter one as arms, the
accuracy was not degraded, as stated in the answer of RQ2. To
answer RQ3, we found that BANP enable us to use non-assessed
techniques, avoiding accuracy degradation of the model.
In the study [23], the median AUC value of used reduction
techniques across the different datasets is 0.7, whereas the median
of AUC of BANP is 0.67, showing that the difference between both
values is not that large. We applied CVDP and assumed that the
probability of overlooking by negative prediction is 100%, as
explained in Section 6.4. That is, although the performance of
defect prediction was not very high in the experiment, the
performance is almost equivalent to the study [23]
7.3 Rate of Selected Arms
Basically, the accuracy of defect prediction by BA has the
following characteristics:
(a)
The accuracy depends on the accuracy of arms.
(b)
The accuracy is lower than the highest accuracy arm on a test
data.
The reason for item (a) is the prediction of BA consists of
selected arms. Therefore, when the difference of accuracy among
arms are small, the improvement of accuracy by BA also becomes
small. Therefore, it is natural that the average value of
ImpR
was
not large in Table 4 and 5.
The reason for item (b) is that until identifying the best arm, BA
uses various arms which are not the highest accuracy arm. As a
result, in Table 3, each column has one or more non-shaded cells
(i.e., there were arms whose accuracy was higher than BANP).
Nonetheless, the average AUC of BANP was the highest in Table
5. This is because the best arm is different among datasets as shown
in Table 3, and the accuracy of BANP is stable. It suggests the
existence of the external validity issue of reduction techniques.
Considering item (a) and (b),
Win%
is important to analyze the
effect of BA. To analyze the effect of BANP from another
viewpoint, we focused on selected arms, and set RQ4 and 5.
We counted how many times BANP selected the highest and
up-to second highest accuracy arms were selected from four arms
(i.e., C1, C2, A1 and A2), and calculated the rate. Assume that in
Figure 1, the highest accuracy arm is arm C, and the second highest
accuracy arm is arm A. When 10 modules are tested, BA selected
any of arms 10 times. If arm C is selected four times, and arm A
Table 6: Selection rate of the highest and the up-to second highest accuracy arms
ant camel
forrest
ivy
jedit
log4j
lucene
pbeans
poi
prop
synapse
velocity
xalan
xerces
Avg.
E2 (1) 17.5
30.3
10.2
51.3
29.4
18.0
20.5
92.7
95.4
52.5
25.8
39.4
14.8
20.2
37.0
E2 (1&2)
31.3
46.9
62.7
51.3
39.1
28.8
35.7
92.7
99.2
60.7
38.8
47.3
43.4
74.6
53.8
B (1) 53.8
41.5
12.2
18.4
34.6
14.0
40.0
99.1
99.7
60.4
30.8
35.1
49.5
65.1
46.7
B (1&2)
67.5
83.3
67.9
55.3
38.4
36.7
57.4
99.1
99.8
65.2
58.2
40.7
52.2
67.9
63.5
MSR’22, May 23-24, 2022, Pittsburgh, PA, USA M. Tsunoda et al.
selected three times, the selection rate of the highest accuracy arm
is 40% (four out of 10), and the rate of up-to second highest one is
70% (seven out of 10).
Table 6 shows the selection rate when naïve BA (E2) and BANP
(B) are used. In the table, shaded cells show the rate of B was larger
than E2. BANP selected the highest accuracy arm at 46.7%
iterations, and up-to second highest ones at 63.5% iterations on
average (see “Avg.” column of the table). Therefore, the answer of
RQ4 is “BANP selects the best arm about 50% of the time, and
better ones at about 60% of the times on average.” The result
suggests the selection by BANP works well, although there is still
room for improvement.
In contrast, the rate of naïve BA was lower than BANP on most
datasets. It selected the highest accuracy arm at 37.0% iterations,
and the up-to second highest ones at 53.8% iterations. That is
almost 10% degradation, compared with BANP. Therefore, to
answer RQ5, we found that the improvement of the arm selection
by BANP is about 10% on average. This strengthens the result that
the effect of BANP is larger than naïve BA (see section 7.1).
7.
4
Arms Selection over Time
We analyzed whether the choice of a feature reduction method
really varies over time, and whether it stabilizes to a single choice
after enough modules are tested by BANP (i.e., whether BANP
finally identifies the best reduction technique, or uses variant
techniques all the time). To perform that, we counted the number
of selected techniques by BANP, in which every 10% of the
modules are tested. When the number of selected arms is almost
one, it means that BANP reached a single choice.
Figure 5 shows the relationship between the average number of
used techniques and the percentage of tested modules across the 14
datasets. In the figure, 2 to 3 techniques were used before 10% of
modules were tested, while the number of arms become almost one,
when the percentage of tested modules is larger than 30%.
Therefore, BANP actually tried various techniques on early stage
of the testing phase, and reached a single choice by the middle stage
of the phase. The result also suggests that the behavior of BANP is
different from ensemble learning, which uses multiple techniques
in parallel all the time.
8 Discussion
Can BA help in selecting feature reduction techniques?: As
shown in Section 7.2, the prediction accuracy of BANP is at least,
equivalent to existing approaches, but in most cases it performs
better than those approaches. The performance is attained without
the assessment of feature reduction techniques, such as surveying
previous studies about the techniques [17][23]. Therefore, BANP
can be a better alternative to other approaches.
Note that automation of BA process is not difficult, and
therefore the application of BA to defect prediction is not effort-
consuming. This is because test results are often recoded, and using
the results, D3 and D1 explained in section 3.1 can be performed
automatically.
How our results compare to the previous studies?
;
In
previous studies such as [17][23] which evaluated feature reduction
techniques, they tried to identify the best reduction technique (i.e.,
the highest accuracy technique on average). Using BA, the
combination of methods is important. For example, method A is
effective on one datasets, and method B is effective on another
datasets. In such case, the combination of methods A and B might
attain high prediction accuracy by applying BA. That is, when
researchers evaluate the reduction techniques, it would be better to
identify techniques which have high accuracy on some dataset,
even if the average accuracy is low, considering the application of
BA.
Threats to validity: As shown in Section 7.2, BANP did not
improve prediction accuracy drastically. This could be one of
threats to internal validity to the effect of BANP. To reduce the
threat, we analyzed the selected arms by BANP, as explained in
Section 7.3. In addition, in order to minimize this threat, we
compare the experimental results with the previous study of Kondo
et al. [23]. This is because we selected feature reduction techniques
based on the same study. Additionally, the study used 15 datasets
from PROMISE and DAMB repository, and nine of them are same
as ours. Note that the study did not apply CVDP but applied a
sampling method to validate the results. Therefore, some of their
dataset (test dataset) were different from ours.
As explained in Section 6.3, the study [23] used
ImpR
, setting
AUC of a model without feature reduction as
AUC
base
, and with
feature reduction as
AUC
foc
. The average
ImpR
was 1.012 across 15
datasets when CFS was used. When ConFS was used, it was 1.008.
The baseline is a model without feature reduction, and therefore,
the values tend to become a bit larger. While the
ImpR
of BANP
was calculated, a model with feature reduction is set as the baseline.
Therefore, the extent of the improvement by BANP (i.e., average
ImpR
was 1.003 at least, and 1.009 at most in Table 5) is reasonable,
considering the results of the study in [23].
The number of used datasets is almost the same as the study of
Kondo et al. [23]. Therefore, we might reduce threats to external
validity to the same extent as [23], with regards to the evaluation
against the selected dataset. We used only a part of prediction
methods and reduction techniques of the study [23] (It used eight
techniques and 10 methods), but we plan examine the other
methods and techniques in the future.
Figure 5: Relationship between the average number of
used techniques and the percentage of tested modules.
1
1.5
2
2.5
3
3.5
4
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1.0
# of used arms
Percentage of tested modules
ant synapse prop
pbeans xerces velocity
xalan poi ivy
camel lucene jedit
log4j forrest
Using Bandit Algorithms for Selecting Feature Reduction
Techniques in Software Defect Prediction MSR’22, May 23-24, 2022, Pittsburgh, PA, USA
9 Related Work
Some defect prediction approaches [14][28][33][34][36] are
somewhat akin to the defect prediction based on BA. The
approaches adjust prediction models on test datasets. Although the
approaches might be applied to select high accuracy reduction
technique, the previous studies did not do that. We leave this to our
future work in which we consider how to apply these approaches to
select the reduction techniques, and to compare the performance
with BA. Besides application of BA to the feature technique
selection, there are some differences between BA and other
approaches. In what follows, we explain details of the approaches.
Online learning: In software engineering field, several online
learning methods have been applied to defect prediction in the
previous studies such as [33][34]. In previous studies, online
learning means that the prediction models are dynamically rebuilt
using updated learning dataset. That is, the assumption in these
studies is that models should be rebuilt continuously as prediction
targets of software systems might change over time. Based on this
assumption, it is recommended to build the models continuously
using new data which are acquired sequentially during the
development (i.e., online).
Tabassum et al. [33] applied online learning to Just-In-Time
software defect prediction models in order to compare the
outcomes of three different proposed methods. The authors used
majority voting (simply by counting the majority). The study
assumes that the prediction models are not only used in the testing
phase but also in the whole software development process. Most of
datasets used in the study are collected from open-source projects
for a period of 6-14 years. It is natural to assume that the
performance of the prediction models will vary over time, because
modules would be modified considerably over time. In such
situation, online learning is considered to be appropriate compared
to offline approaches. Wang et al. [34] applied online oversampling
and undersampling techniques to software defect prediction in
order to address the class imbalance issue in defects datasets.
However, in these previous studies, the accuracy of the
prediction models is not taken into consideration in order to
optimize the models. Our study is different in that we assume the
prediction models do not change over time, and based on this
assumption, the accuracies of the prediction models are taken into
consideration.
BA might be regarded as a sort of feedback control. Xiao et al.
[36] proposed defect prediction based on feedback from software
testing process. The feedback in the study means that test results
are used as learning dataset, and it does not refer to the accuracy of
the prediction models, same as [33][34]. Additionally, their method
uses only one model, and does not select a model from a set of
candidates. That is, the method is regarded as online learning.
Dynamic model selection: Some of the previous studies
[14][28] focused on the issue that the prediction models do not
result in the best possible performance across different datasets. In
these studies, the focus was on the dynamic selection of prediction
models from a set of available models. Di Nucci et al. [14] selected
prediction models dynamically to predict defective modules where
Rathore et al. [28] did the same to predict the number of defects.
However, the selection process was based on characteristics of
prediction target modules such as code and design metrics (i.e.,
independent variables), and they did not consider the accuracy of
prediction models (i.e., dependent variable) to optimize the
prediction. Additionally, they do not apply online learning, and
therefore the selection process itself does not change during the
prediction (i.e., the process is static).
Ensemble methods: In addition to online learning, ensemble
methods such as [27] are somewhat similar to BA. However, they
do not consider prediction accuracy on test dataset. Therefore,
ensemble methods do not solve the external validity issue [12]. BA
can use various ensemble methods such as stacking and voting as
arms of BA. That is, we can use both BA and ensemble methods
together.
10 Conclusions
Using feature reduction techniques can improve the accuracy of
software defect prediction models. In this paper, we propose to
apply bandit algorithm (BA) to select a suitable feature reduction
technique for defects prediction. Intuitively speaking, BA
dynamically selects the best technique from candidates of
techniques, based on the comparison of test and prediction results
on tested modules. Therefore, with BA it is expected to suppress
the degradation of the accuracy. In this regard, we proposed BANP
(Bandit Algorithm Considering Overlooking by Negative
Prediction). In software testing, when the prediction outcome of a
module is negative (i.e., non-defective), the module is not tested
thoroughly, and defects are often overlooked. Overlooking such
potential defects negatively affects the selection of the techniques
by BA. BANP forcibly sets prediction as positive during early
iteration of BA, to suppress such potential negative influence.
In our experiment, we selected 14 datasets to perform CVDP.
We applied four well-known feature reduction techniques: CFS,
ConFS, AIC stepwise, and BIC stepwise. As existing approaches,
we built six kinds of models which is based on control (i.e., survey
of previous studies), experiments (i.e., evaluation of the accuracy
of the techniques on learning dataset), ad hoc, and ordinal BA
techniques. We compared the prediction accuracy of them with
BANP.
As a result, the prediction accuracy of BANP was higher than
naïve BA. The accuracy of BANP was almost the same or higher
than the existing approaches. That is, BANP could decrease the
effort of the assessment of reduction techniques, avoiding the
degradation of prediction accuracy. Our approaches suggested here
can help in selecting a suitable feature reduction technique that can
improve the overall accuracy of the prediction model.
ACKNOWLEDGMENTS
This research is partially supported by the Japan Society for the
Promotion of Science [Grants-in-Aid for Scientific Research (C)
and (S) (No.21K11840 and No. 20H05706). A. Tahir is also
partially supported by a NZ National Science Challenge grant.
MSR’22, May 23-24, 2022, Pittsburgh, PA, USA M. Tsunoda et al.
REFERENCES
[1] A. Abdurazik, and J. Offutt, 2009. Using Coupling-Based Weights for the
Class Integration and Test Order Problem. The Computer Journal 52, 5, 557-
570.
[2] H. Akaike, 1974. A new look at the statistical model identification. IEEE
Transactions on Automatic Control 19, 6, 716-723.
[3] S. Albahli, G. Yar, 2022. Defect Prediction using Akaike and Bayesian
Information Criterion, Computer Systems Science & Engineering 41, 3, 1117-
1127.
[4] T. Asano, M. Tsunoda, K. Toda, A. Tahir, K. Bennin, K. Nakasai, A. Monden,
and K. Matsumoto, 2021. Using Bandit Algorithms for Project Selection in
Cross-Project Defect Prediction. In Proc. of International Conference on
Software Maintenance and Evolution (ICSME). 649-653.
[5] P. Auer, N. Cesa-Bianchi, and P. Fischer, 2002. Finite-time Analysis of the
Multiarmed Bandit Problem. Machine Learning 47, 235-256.
[6] R. Bali, D. Sarkar, B. Lantz, and C. Lesmeister. 2016. R: Unleash Machine
Learning Techniques. Packt Publishing.
[7] A. Balogun, S. Basri, S. Mahamad, S. Abdulkadir, M. Almomani, V. Adeyemo,
Q. Al-Tashi, H. Mojeed, A. Imam, A. Bajeh, 2020. Impact of Feature Selection
Methods on the Predictive Performance of Software Defect Prediction Models:
an Extensive Empirical Study. Symmetry 12, 7, 1147.
[8] R. Busa-Fekete, and E. Hüllermeier, 2014. A Survey of Preference-Based
Online Learning with Bandit Algorithms. Algorithmic Learning Theory ALT
2014, Lecture Notes in Computer Science 8776, 18-39.
[9] B. Caglayan, E. Kocaguneli, J. Krall, F. Peters, and B. Turhan. 2012. The
PROMISE repository of empirical software engineering data.
[10] S. Chidamber, and C. Kemerer, 1994. A metrics suite for object oriented design.
IEEE Transactions on Software Engineering 20, 6, 476-493.
[11] G. Claeskens, C. Croux, and J. Kerckhoven. 2006. Variable Selection for
Logistic Regression Using a Prediction-Focused Information Criterion.
Biometrics 62, 4, 972-979.
[12] M. D’Ambros, M., Lanza, and R. Robbes, 2012. Evaluating defect prediction
approaches: a benchmark and an extensive comparison. Empirical Software
Engineering 17, 4-5, 531-577.
[13] M. Dash, and H. Liu, 2003. Consistency-based search in feature selection.
Artificial Intelligence 51, 1-2, 155-176.
[14] D. Di Nucci, F. Palomba, R. Oliveto and A. De Lucia, 2017. Dynamic Selection
of Classifiers in Bug Prediction: An Adaptive Method, IEEE Transactions on
Emerging Topics in Computational Intelligence 1, 3, 202-212.
[15] M. Felderer, and R. Ramler, 2014. Integrating risk-based testing in industrial
test processes. Software Quality Journal 22, 3, 543-575.
[16] B. Ghotra, S. McIntosh, and A. Hassan, 2015. Revisiting the impact of
classification techniques on the performance of defect prediction models. In
Proc. of International Conference on Software Engineering (ICSE). 789-800.
[17] B. Ghotra, S. McIntosh, and A. Hassan, 2017. A Large-Scale Study of the
Impact of Feature Selection Techniques on Defect Cla ssification Models. In
Proc. of International Conference on Mining Software Repositories (MSR).
146-157.
[18] M. Hall. 1999. Correlation-based Feature Subset Selection fo r Machine
Learning, PhD dissertation, Department of Computer Science, University of
Waikato.
[19] T. Hayakawa, M. Tsunoda, K. Toda, K. Nakasai, A. Tahir, K. Bennin, A.
Monden, and K. Matsumoto, 2021. A Novel Approach to Address External
Validity Issues in Fault Prediction Using Bandit Algorithms. IEICE
Transactions on Information and Systems E104.D, 2, 327-331.
[20] E. Hazan. 2016. Introduction to Online Convex Optimization. Foundations and
Trends in Optimization 2, 3-4, 157-325.
[21] Information-technology Promotion Agency (IPA), Japan. 2018. The 2018-
2019 White Paper on Software Development Projects. IPA (in Japanese).
[22] S. Jiang, M. Zhang, Y. Zhang, R. Wang, Q. Yu, and J. Keung, 2021. An
Integration Test Order Strategy to Consider Control Coupling. IEEE
Transactions on Software Engineering 47, 7, 1350-1367.
[23] M. Kondo, C. Bezemer, Y. Kamei, A. Hassan, and O. Mizuno, 2019. The
impact of feature reduction techniques on defect prediction models. Empirical
Software Engineering 24, 4, 1925-1963.
[24] P. Li, J. Herbsleb, M. Shaw, and B. Robinson, 2006. Experiences and results
from initiating field de fect prediction and product test prioritization efforts at
ABB Inc. In Proc. of international conference on Software engineering (ICSE).
413–422.
[25] J. Ma, L. Saul, S. Savage, and G. Voelker. 2009, Identifying suspicious URLs:
an application of large-scale online learning. In Proc. of Annual International
Conference on Machine Learning (ICML). 681-688.
[26] S. Mahfuz. 2016. Software Quality Assurance - Integrating Testing, Security,
and Audit, CRC Press.
[27] F. Matloob, T. Ghazal, N. Taleb, S. Aftab, M. Ahmad, M. Khan, S. Abbas,
andT. Soomro 2021. Software Defect Prediction Using Ensemble Learning: A
Systematic Literature Review. IEEE Access 9, 98754-98771.
[28] S. Rathore and S. Kumar, 2019. An Approach for the Prediction of Number of
Software Faults Based on the Dynamic Selection of Learning Techniques.
IEEE Transactions on Reliability 68, 1, 216-236.
[29] G. Schwarz, 1978. Estimating the Dimension of a Model. Annals of Statistics
6, 2, 461-464.
[30] S. Shalev-Shwartz, 2011. Online Learning and Online Convex Optimization.
Foundations and Trends in Machine Learning 4, 2, 107-194.
[31] M. Shepperd, D. Bowes, and T. Hall, 2014. Researcher Bias: The Use of
Machine Learning in Software Defect Prediction. IEEE Transactions on
Software Engineering 40, 6, 603-616.
[32] R. Sutton, and A. Barto. 1998. Reinforcement Learning: An Introduction. A
Bradford Book.
[33] S. Tabassum, L. Minku, D. Feng, G. Cabral, and L. Song, 2020. An
Investigation of Cross-Project Learning in Online Just-In-Time Software
Defect Prediction. In Proc of International Conference on Software
Engineering (ICSE).
[34] S. Wang, L. Minku, and X. Yao, 2013. Online Class Imbalance Learning and
Its Applications in Fault Detection. International Journal of Computational
Intelligence and Applications 12, 4.
[35] J. White. 2012. Bandit Algorithms for Website Optimization: Developing,
Deploying, and Debugging. O'Reilly Media.
[36] P. Xiao, B. Liu, and S. Wang, 2018. Feedback-based integrated prediction:
Defect prediction based on feedback from software testing process. Journal of
Systems and Software 143, 159-171.
[37] T. Zimmermann, and N. Nagappan, 2018. Predicting defects using network
analysis on dependency graphs. In Proc. of International Conference on
Software Engineering (ICSE). 531-540.
Article
Full-text available
Data available in software engineering for many applications contains variability and it is not possible to say which variable helps in the process of the prediction. Most of the work present in software defect prediction is focused on the selection of best prediction techniques. For this purpose, deep learning and ensemble models have shown promising results. In contrast, there are very few researches that deals with cleaning the training data and selection of best parameter values from the data. Sometimes data available for training the models have high variability and this variability may cause a decrease in model accuracy. To deal with this problem we used the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) for selection of the best variables to train the model. A simple ANN model with one input, one output and two hidden layers was used for the training instead of a very deep and complex model. AIC and BIC values are calculated and combination for minimum AIC and BIC values to be selected for the best model. At first, variables were narrowed down to a smaller number using correlation values. Then subsets for all the possible variable combinations were formed. In the end, an artificial neural network (ANN) model was trained for each subset and the best model was selected on the basis of the smallest AIC and BIC value. It was found that combination of only two variables' ns and entropy are best for software defect prediction as it gives minimum AIC and BIC values. While, nm and npt is the worst combination and gives maximum AIC and BIC values.
Conference Paper
Full-text available
Background: defect prediction model is built using historical data from previous versions/releases of the same project. However, such historical data may not exist in case of newly developed projects. Alternatively, one can train a model using data obtained from external projects. This approach is known as cross-project defect prediction (CPDP). In CPDP, it is still difficult to utilize external projects' data or decide which particular project to use to train a model. Aim: to address this issue, we apply bandit algorithm (BA) to CPDP in order to select the most suitable training project from a set of projects. Method: BA-based prediction iteratively reselects the project after each module is tested, considering the accuracy of the predictions. As baselines, we used simple CPDP methods such as training a model with randomly selected project. All models were built using logistic regression. Results: We experimented our approach on two datasets (NASA and DAMB, with a total of 12 projects). The BA-based defect prediction models resulted in, on average, a higher accuracy (AUC and F1 score) than the baselines. Conclusion: in this preliminarily study, we demonstrate the feasibility of using BA in the context of CPDP. Our initial assessment shows that the use BA for predicting defects in CPDP is promising and may outperform existing approaches.
Article
Full-text available
Recent advances in the domain of software defect prediction (SDP) include the integration of multiple classification techniques to create an ensemble or hybrid approach. This technique was introduced to improve the prediction performance by overcoming the limitations of any single classification technique. This research provides a systematic literature review on the use of the ensemble learning approach for software defect prediction. The review is conducted after critically analyzing research papers published since 2012 in four well-known online libraries: ACM, IEEE, Springer Link, and Science Direct. In this study, five research questions that cover the different aspects of research progress on the use of ensemble learning for software defect prediction are addressed. To extract the answers to identified questions, 46 most relevant papers are shortlisted after a thorough systematic research process. This study will provide compact information regarding the latest trends and advances in ensemble learning for software defect prediction and provide a baseline for future innovations and further reviews. Through our study, we discovered that frequently employed ensemble methods by researchers are the random forest, boosting, and bagging. Less frequently employed methods include stacking, voting and Extra Trees. Researchers proposed many promising frameworks, such as EMKCA, SMOTE-Ensemble, MKEL, SDAEsTSE, TLEL, and LRCR, using ensemble learning methods. The AUC, accuracy, F-measure, Recall, Precision, and MCC were mostly utilized to measure the prediction performance of models. WEKA was widely adopted as a platform for machine learning. Many researchers showed through empirical analysis that feature selection and data sampling were important pre-processing steps that improve the performance of ensemble classifiers.
Article
Full-text available
Feature selection (FS) is a feasible solution for mitigating high dimensionality problem, and many FS methods have been proposed in the context of software defect prediction (SDP). Moreover, many empirical studies on the impact and effectiveness of FS methods on SDP models often lead to contradictory experimental results and inconsistent findings. These contradictions can be attributed to relative study limitations such as small datasets, limited FS search methods, and unsuitable prediction models in the respective scope of studies. It is hence critical to conduct an extensive empirical study to address these contradictions to guide researchers and buttress the scientific tenacity of experimental conclusions. In this study, we investigated the impact of 46 FS methods using Naïve Bayes and Decision Tree classifiers over 25 software defect datasets from 4 software repositories (NASA, PROMISE, ReLink, and AEEEM). The ensuing prediction models were evaluated based on accuracy and AUC values. Scott-KnottESD and the novel Double Scott-KnottESD rank statistical methods were used for statistical ranking of the studied FS methods. The experimental results showed that there is no one best FS method as their respective performances depends on the choice of classifiers, performance evaluation metrics, and dataset. However, we recommend the use of statistical-based, probability-based, and classifier-based filter feature ranking (FFR) methods, respectively, in SDP. For filter subset selection (FSS) methods, correlation-based feature selection (CFS) with metaheuristic search methods is recommended. For wrapper feature selection (WFS) methods, the IWSS-based WFS method is recommended as it outperforms the conventional SFS and LHS-based WFS methods.
Article
Full-text available
Defect prediction is an important task for preserving software quality. Most prior work on defect prediction uses software features, such as the number of lines of code, to predict whether a file or commit will be defective in the future. There are several reasons to keep the number of features that are used in a defect prediction model small. For example, using a small number of features avoids the problem of multicollinearity and the so-called 'curse of dimensionality'. Feature selection and reduction techniques can help to reduce the number of features in a model. Feature selection techniques reduce the number of features in a model by selecting the most important ones, while feature reduction techniques reduce the number of features by creating new, combined features from the original features. Several recent studies have investigated the impact of feature selection techniques on defect prediction. However, there do not exist large-scale studies in which the impact of multiple feature reduction techniques on defect prediction is investigated. In this paper, we study the impact of eight feature reduction techniques on the performance and the variance in performance of five supervised learning and five unsupervised defect prediction models. In addition, we compare the impact of the studied feature reduction techniques with the impact of the two best-performing feature selection techniques (according to prior work). The following findings are the highlights of our study: (1) The studied correlation and consistency-based feature selection techniques result in the best-performing supervised defect prediction models, while feature reduction techniques using neural network-based techniques (restricted Boltzmann machine and autoencoder) result in the best-performing unsupervised defect prediction models. In both cases, the defect prediction models that use the selected/generated features perform better than those that use the original features (in terms of AUC and performance variance). (2) Neural network-based feature reduction techniques generate features that have a small variance across both supervised and unsupervised defect prediction models. Hence, we recommend that practitioners who do not wish to choose a best-performing defect prediction model for their data use a neural network-based feature reduction technique .
Article
In machine learning, the notion of multi-armed bandits refers to a class of online learning problems, in which an agent is supposed to simultaneously explore and exploit a given set of choice alternatives in the course of a sequential decision process. In the standard setting, the agent learns from stochastic feedback in the form of real-valued rewards. In many applications, however, numerical reward signals are not readily available—instead, only weaker information is provided, in particular relative preferences in the form of qualitative comparisons between pairs of alternatives. This observation has motivated the study of variants of the multi-armed bandit problem, in which more general representations are used both for the type of feedback to learn from and the target of prediction. The aim of this paper is to provide a survey of the state of the art in this field, referred to as preference-based multi-armed bandits or dueling bandits. To this end, we provide an overview of problems that have been considered in the literature as well as methods for tackling them. Our taxonomy is mainly based on the assumptions made by these methods about the data-generating process and, related to this, the properties of the preference-based feedback.
Article
Various software fault prediction models have been proposed in the past twenty years. Many studies have compared and evaluated existing prediction approaches in order to identify the most effective ones. However, in most cases, such models and techniques provide varying results, and their outcomes do not result in best possible performance across different datasets. This is mainly due to the diverse nature of software development projects, and therefore, there is a risk that the selected models lead to inconsistent results across multiple datasets. In this work, we propose the use of bandit algorithms in cases where the accuracy of the models are inconsistent across multiple datasets. In the experiment discussed in this work, we used four conventional prediction models, tested on three different dataset, and then selected the best possible model dynamically by applying bandit algorithms. We then compared our results with those obtained using majority voting. As a result, Epsilon-greedy with ϵ=0.3 showed the best or second-best prediction performance compared with using only one prediction model and majority voting. Our results showed that bandit algorithms can provide promising outcomes when used in fault prediction.
Conference Paper
Just-In-Time Software Defect Prediction (JIT-SDP) is concerned with predicting whether software changes are defect-inducing or clean based on machine learning classifiers. Building such classifiers requires a sufficient amount of training data that is not available at the beginning of a software project. Cross-Project (CP) JIT-SDP can overcome this issue by using data from other projects to build the classifier, achieving similar (not better) predictive performance to classifiers trained on Within-Project (WP) data. However, such approaches have never been investigated in realistic online learning scenarios, where WP software changes arrive continuously over time and can be used to update the classifiers. It is unknown to what extent CP data can be helpful in such situation. In particular, it is unknown whether CP data are only useful during the very initial phase of the project when there is little WP data, or whether they could be helpful for extended periods of time. This work thus provides the first investigation of when and to what extent CP data are useful for JIT-SDP in a realistic online learning scenario. For that, we develop three different CP JIT-SDP approaches that can operate in online mode and be updated with both incoming CP and WP training examples over time. We also collect 2048 commits from three software repositories being developed by a software company over the course of 9 to 10 months, and use 19,8468 commits from 10 active open source GitHub projects being developed over the course of 6 to 14 years. The study shows that training classifiers with incoming CP+WP data can lead to improvements in G-mean of up to 53.90% compared to classifiers using only WP data at the initial stage of the projects. For the open source projects, which have been running for longer periods of time, using CP data to supplement WP data also helped the classifiers to reduce or prevent large drops in predictive performance that may occur over time, leading to up to around 40% better G-Mean during such periods. Such use of CP data was shown to be beneficial even after a large number of WP data were received, leading to overall G-means up to 18.5% better than those of WP classifiers.
Article
Integration testing is a very important step in software testing. Existing methods evaluate the stubbing cost for class inte-gration test orders by considering only the interclass direct relationships such as inheritance, aggregation, and associa-tion, but they omit the interclass indirect relationship caused by control coupling, which can also affect the test orders and the stubbing cost. In this paper, we introduce an integration test order strategy to consider control coupling. We ad-vance the concept of transitive relationship to describe this kind of interclass dependency and propose a new measure-ment method to estimate the complexity of control coupling, which is the complexity of stubs created for a transitive rela-tionship. We evaluate our integration test order strategy on 10 programs on various scales. The results show that consid-ering the transitive relationship when generating class integration test orders can significantly reduce the stubbing cost for most programs and that our integration test order strategy obtains satisfactory results more quickly than other methods.
Article
Determining the most appropriate learning technique(s) is vital for the accurate and effective software fault prediction (SFP). Earlier techniques used for SFP have reported varying performance for different software projects and none of them has always reported the best performance across different projects. The problem of varying performance can be solved by using an approach, which partitions the fault dataset into different module subsets, trains learning techniques for each subset, and integrates the outcomes of all the learning techniques. This paper presents an approach that dynamically selects learning techniques to predict the number of software faults. For a given testing module, the presented approach first locates its neighbor module subset that contained modules similar to testing module using a distance function and then chooses the best learning technique in the region of that module subset to make the prediction for testing module. The learning technique is selected based on its past performance in the region of module subset. We have performed an evaluation of the proposed approach using fault datasets garnered from the PROMISE data repository and Eclipse bug data repository. Experimental results showed that the proposed approach led to an improved performance when predicting the number of faults in software systems. IEEE