ArticlePDF Available

A Learning Algorithm for Change Impact Prediction: Experimentation on 7 Java Applications

Authors:

Abstract and Figures

Change impact analysis consists in predicting the impact of a code change in a software application. In this paper, we take a learning perspective on change impact analysis and consider the problem formulated as follows. The artifacts that are considered are methods of object-oriented software; the change under study is a change in the code of the method, the impact is the test methods that fail because of the change that has been performed. We propose an algorithm, called LCIP that learns from past impacts to predict future impacts. To evaluate our system, we consider 7 Java software applications totaling 214,000+ lines of code. We simulate 17574 changes and their actual impact through code mutations, as done in mutation testing. We find that LCIP can predict the impact with a precision of 69%, a recall of 79%, corresponding to a F-Score of 55%.
Content may be subject to copyright.
A Learning Algorithm for Change Impact Prediction
Vincenzo Musco, Antonin Carette, Martin Monperrus, Philippe Preux
University of Lille and INRIA
Lille, France
Contact: vincenzo.musco@inria.fr
ABSTRACT
Change impact analysis (CIA) consists in predicting the im-
pact of a code change in a software application. In this pa-
per, the artifacts that are considered for CIA are methods of
object-oriented software; the change under study is a change
in the code of the method, the impact is the test methods
that fail because of the change that has been performed. We
propose LCIP, a learning algorithm that learns from past
impacts to predict future impacts. To evaluate LCIP, we
consider Java software applications that are strongly tested.
We simulate 6000 changes and their actual impact through
code mutations, as done in mutation testing. We find that
LCIP can predict the impact with a precision of 74%, a re-
call of 85%, corresponding to a F-score of 64%. This shows
that taking a learning perspective on change impact analy-
sis let us achieve good precision and recall in change impact
analysis.
1. INTRODUCTION
Change impact analysis consists in predicting the impact
of a code change in a software application [1, 2]. There are
different kinds of impacts: the software modules that may
be broken [1], the documentation that must be impacted [3],
the developers that must keep informed about this change
[4]. In this paper, we focus on the first kind: impacts at the
code level.
There are two important challenges for change impact
analysis. First, it must scale to the size of today’s soft-
ware systems, that often consists of thousands of modules.
Second, the predictions must be accurate. The accuracy of
impact prediction has two dimensions: whether predicted
impacted elements are actually impacted (false positives)
and whether actually impacted elements are all predicted
(false negatives). This boils down to the classical informa-
tion retrieval measures: precision, recall, and F-score. Our
motivation is to design a change impact prediction system
that addresses both challenges: scaling to large systems while
retaining a high accuracy (high precision, high recall, high
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
RAISE’16, May 17 2016, Austin, TX, USA
c
2016 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ISBN 978-1-4503-4165-3/16/05. . . $15.00
DOI: http://dx.doi.org/10.1145/2896995.2896996
F-score).
We favor a learning perspective for the problem of change
impact analysis. By learning, we mean that the change im-
pact analysis could exploit previous actual impacts to form
some kind of knowledge. This knowledge then guides the
prediction of the future impacts. To this extent, a learning-
based change impact analysis technique is the opposite of a
priori approaches which design upfront the whens and hows
the impact propagates, such as those based on program de-
pendence graphs [5] to only name a few.
In this paper, we consider change impact analysis formu-
lated as follows. The artifacts that are considered are meth-
ods of object-oriented software; the change under study is
a change in the code of the method, the impact is the test
methods that fail because of the change that has been per-
formed. We propose an algorithm, called LCIP that learns
from past impacts to predict future impacts. For instance,
the algorithm may learn that a change in method foo is al-
ways followed by a failure observation in bar. Our algorithm
is based on the call graph, where each node is a method and
each edge a potential call between two methods, as observed
statically. We consider a standard Class-Hierarchy-Analysis
(CHA) call graph. The learning strategy consists of decorat-
ing each call graph edge with a weight between zero and one.
This weight represents the likelihood of an edge to propagate
an impact. The weight is updated based on actual impacts
that are given as training data to the system.
To evaluate our system, we consider 2 Java software ap-
plications totaling 120,000+ lines of code. We simulate 6000
changes and their actual impact through code mutations, as
done in mutation testing. We use ten-fold cross validation
to measure the precision, recall and F-score of change im-
pact prediction. We compare our system to two algorithms:
one being a standard transitive closure on the call graph,
the other one is a basic learning strategy that we introduce
for the sake of understanding the actual learning that hap-
pens in LCIP. We find that LCIP can predict the impact
with a precision of 74%, a recall of 85%, corresponding to
aF-score of 64%. This validates our intuition that learning
can be done for change impact analysis. In addition, our
approach does not trade performance for accuracy: learning
lasts in average 26 seconds and prediction 256 milliseconds.
This indicates that our approach may scale-up to really-large
systems.
To sum up, our contributions are:
An algorithm for learning the impact of software changes.
It is based on decorating call graph edges with a weight
representing a likelihood to propagate a change. Mu-
arXiv:1512.07435v2 [cs.SE] 6 May 2018
tants are used to learn those weights.
An experimentation made on 2 popular Java programs,
totaling 120,000+ lines of code. The experiment shows
that a learning approach is a promising approach for
change impact prediction.
We publish all our code as open-source. The frame-
works for mutation, impact prediction and learning can
foster future research in this direction.
The remainder of this paper is structured as follows. In
Section 2, we present concepts used in this paper. In Sec-
tion 3, we introduce our learning approach. In Section 4,
we present our framework and dataset. We also pose our
research question and answer it by presenting and analyzing
our experimental results. In Section 5, we present works re-
lated to our paper in fields such as machine learning, impact
analysis and mutation testing. In Section 6, we conclude this
paper.
2. BACKGROUND
2.1 Change Impact Analysis
Software is made of interconnected pieces of code (e.g.
methods, classes, etc.). Through those connections, the ef-
fects of a change to a given part of the code can propagate to
many other parts of the software. Those other parts, which
can be potentially anywhere in the software, can then be
impacted by the initial change, acting like a ripple. Change
Impact Analysis (a.k.a. CIA) is defined as “the determina-
tion of potential effects to a subject system resulting from a
proposed software change” [6].
In this paper, we use Bohner’s definition of the basic soft-
ware change impact analysis process [6]. Assuming a change
has been performed, Bohner defines the following sets used
in impact analysis: (i) the “starting impact set” (SIS ) is the
list of software parts which can be impacted by the change;
(ii) the “candidate impact set” (CI S ) (also called the “esti-
mated impact set” [7]) is the list of software parts predicted
as impacted by a change impact analysis technique; (iii) the
“actual impacted set” (AIS) is the list of parts of the soft-
ware which are actually impacted by the change; (iv) the
“false negative impact set” (F N I S) is the list of missed im-
pacts by the technique 1; (v) the “false positive impact set”
(F P IS ) is the list of over-estimated impacts returned by the
technique (i.e. false positives).
2.2 Mutation Analysis
Mutation analysis [8] consists of assessing test suite qual-
ity by applying minor changes to a software resulting in
slightly modified versions of the software which are called
software mutants. These changes result from the applica-
tion of a mutation operator which describes what should be
changed in the code, and how it should be modified.
A standard mutation testing scenario is that (i) the soft-
ware tests are run on software mutants to ensure the change
is noticed by at least one test and (ii) if no test kills a mu-
tant, complementary tests are written.
1Bohner named this set the “discovered impact set” (DIS),
but this naming is not appropriate in our context and may
be confusing.
2.3 Call Graph
Grove et al. [9] define “the program call graph [as] a di-
rected graph that represents the calling relationships between
the program’s procedures (...) each procedure is represented
by a single node in the graph. Each node has an indexed
set of call sites, and each call site is the source of zero or
more edges to other nodes, representing possible callees of
that site”. This is the definition taken in this paper.
As we want to study Java applications which are object-
oriented, we also take into consideration the class hierarchy
analysis (a.k.a. CHA), i.e. the inheritance and usage of
interfaces.
3. CONTRIBUTION
In this section, we describe the Learning Change Impact
Prediction (LCIP) algorithm: a new learning algorithm for
change impact analysis. This algorithm made of two phases
is based on decorating call graphs with weights on the edges.
3.1 Approach Overview
We use the call graph to estimate the change and er-
ror propagation. Our key insight is that error propagation
through a call graph edge is not systematic: some edges may
propagate errors, others not.
Thus, we propose a stochastic call graph by adding weights
on the call graph edges. Those weights are first learned from
existing data, and then used to compute impacts. These
weights range from 0 to 1 where 0 means that the edge
never propagates the error, while 1 means that the edge al-
ways propagates it. Any value in between means the prop-
agation sometimes occurs and sometimes not. The initial
weights are set to 0 meaning that we start by assuming that
no impact is propagated at all. In this paper, we consider
that the impacted nodes are always test methods; this is
motivated by the fact that broken assertions unambiguously
denote impacted behavior.
Our approach is thus composed of two distinct phases.
The learning phase consists of learning the weights based
on a set Mof changes and their impacts. In our graph-
based context, a change is modeled as a modified node and
an impact is a set of nodes whose behavior is impacted by
the change. The prediction phase is when, given a call
graph and a specific change in a method by a developer (i.e.
on a node of the call graph), an impact prediction algorithm
computes the candidate impacted set CIS composed of all
test nodes using the weights. The prediction represents all
tests that may be broken by the change.
Figure 1 is an example of a call graph. Three types of
nodes are presented: application nodes (plain circle), test
nodes (circle with a T) and the changed node (double circle).
We have four methods: mul which computes a multiplica-
tion and two methods depending on it: pow and fac which
computes respectively the power and the factorial. Then,
a test function is defined for each method. The op method
uses reflection mechanism to call mul, which is not resolved
statically, resulting in the absence of edge between op and
mul.
Consider that an error is introduced by a developer in the
mul method. Bohner’s sets are illustrated in the figure. The
SI S set is made of all software test nodes. The top part of
the horizontal line shows the SIS set. The AI S set (thin
rectangle) is composed of the test nodes which fail when
actually running the software tests. The CI S set (thick
Figure 1: Example of a call graph in which a change
has been introduced. The class graph includes ap-
plication nodes (bottom), test nodes (top) and call
graph edges. The rectangles illustrate Bohner’s sets
used in Change Impact Analysis.
rectangle) is composed of the candidate test nodes deter-
mined by a virtual change impact prediction technique (in
this example it is an arbitrary set). The F N IS set (dashed
rectangle) shows a false negative test due to the absence of
edge in the call graph. The F P IS set (dotted rectangle)
shows a false positive case: the pow method seems to be im-
pacted according to the change impact analysis technique
(as it is part of the CIS set), but when running the tests, it
did not fail.
3.2 Learning
The LCIP learning phase requires as input data: (i) a call
graph Gsuch as described in Section 2.3; (ii) a set Mmade
of pairs containing changed methods and the impacted nodes
(other methods) such as M={{m, AI S} | ∀S}, where Sis
different versions of the software composed with the method
mon which the change has occurred and the set of methods
AIS which have been impacted by the change on run time.
Mcan be seen as a set of “examples” that our impact
learning algorithm uses as its input. The algorithm esti-
mates the weights to assign to the edges of the call graph
in order to predict accurately the propagation of changes.
Our call graph weight learning algorithm is shown on Algo-
rithm 1.
Algorithm 1: The call graph weight learning algorithm
using call graph. update weight is a sub-algorithm which
updates the weights.
Input:Gthe call graph and Mthe data composed of
changed points and their actual impacts.
Output: a weighted graph
1begin
2LGwith weights = 0 for all edges
3for each {m, AIS} ∈ Mdo
4for each tAIS do
5update weight(m, t, L)
6return L;
For all changed node min the program (line 3) and each
actually impacted node t(line 4), we update the weight of
edges belonging to all paths from mto tfollowing an update
algorithm (line 5). In this paper, we propose two algorithms
for updating the weights.
Figure 2: An illustration of impact prediction based
on weighted call graphs. Edges with a low weight
(<0.2) are considered as non-propagating the impact
of the change.
Figure 2 illustrates our impact analysis technique based
on decorating call graphs with weights. It is based on real
data obtained in our experiments. Assume that a change has
occurred at the method denoted by a blue cross. The edges
are method calls, and the thickness of the edges represents
the weight of the edge after learning. A thicker edge means
a weight close to 1 and a thinner means a weight close to
0.2. The dashed edges are those which have a weight smaller
than 0.2 and which are not considered for propagation. The
green squares and red diamonds represent nodes predicted
as impacted by our approach. The prediction is of high
accuracy because the weights of the two edges (a) and (b) is
low after learning. Consequently, the impact of the change
is stopped and does not ripple to the left-hand side of the
graph. On the contrary, a basic impact prediction based
on a transitive closure on the call graph would predict far
too much tests (all the true negatives would become false
positives).
3.2.1 Binary Update Algorithm
This algorithm assumes a binary impact propagation: im-
pact is propagated or not. Thus, the model consists in as-
signing 0 or 1 to the edge weights as follows. If at least one
impact has been observed between a graph node and the
changed node, then all edges belonging to all paths going
from the former to the latter are labeled as 1, otherwise,
these edges are labeled with a 0. Thus, this approach con-
siders that if an edge has once propagated an error, it will
always do so (as 1 is always greater or equal to the thresh-
old, no matter its value). Algorithm 2 formalizes this idea.
This algorithm is deterministic. Thus, running the same al-
gorithm on the same data several times will always produce
the same results.
3.2.2 Dichotomic Update Algorithm
We now explore a more realistic model where error prop-
agation is conditioned by the current state (e.g. the error
propagates if coming from the if-branch of a condition, and
Algorithm 2: Algorithm Binary for updating the edge
weights in paths between a node and the changed point.
Input:mthe node which has changed, tthe considered
impacted node and Lthe weighted call graph
Output: the weights of Lare updated
1begin
2for each edge in all paths from mto tin Ldo
3wedge 1
not propagated if coming from the else branch). This means
that some edges propagate errors but only sometimes, in
particular cases. This is represented by a weight that is
neither 0 nor 1 but in between.
The Dichotomic algorithm updates the weights according
to an estimation of the probability that a node would be
broken by a change. This estimation is based on training
data as follows: pt,m =αt
βm, where αtis the number of times
the node tis impacted over all changes occurring on the
same method mand βmis the number of times the method
has been changed.
The idea of Algorithm Dichotomic (Algorithm 3) is to
slowly converge to pt,m example after example. For each
training example, the weight wof each edge which belongs
to a path between a changed node mand an impacted node
tis computed in a dichotomic way: the new weight is the
mean value between the current weight and the empirical
probability.
Note that this approach cannot be used in line as we re-
quire a set of changes and their impact to compute future
weights.
Algorithm 3: Algorithm Dichotomic for updating the
edge weights between a node and the changed point.
Input:mthe node which has changed, tthe considered
impacted node, pt,m the empirical probability
for (m, t) and Lthe weighted call graph
Output: the weights of Lare updated
1begin
2for each edge in all paths from mto tin Ldo
3wedge (wedge +pt,m )/2
3.3 Prediction
At prediction time, LCIP is based on Algorithm 4. This
algorithm takes as input the node ncorresponding to the
point in the code where the change has occurred and a
threshold value th lying in the range [0,1]. It returns a
candidate impacted set CIS composed of all nodes pre-
dicted as impacted. To do so, starting at the node being
changed, the graph edges are followed to determine which
nodes can be reached. The weights are used to prune some
edges which are unlikely to propagate the change, according
to the threshold value (line 6). If the weight is lower than
the threshold value, the error does not propagate across the
edge; otherwise the edge propagates the error.
4. EVALUATION
We now present the evaluation of LCIP for impact predic-
tion. We are especially interested in answering the following
Algorithm 4: The impact prediction algorithm that
uses the learned weights of the call graph
Input:Lthe weighted call graph, nthe changed node,
th the threshold
Output: the set of nodes which are predicted as
impacted
1begin
2CI S ← {}
3for each node iconnected to nin Ldo
4if iis not visited then
5mark node ias visited
6if wi>=th then
7CI S C I S ∪ {i}
8CI S C I S visit(i)
9return CIS
research question: To what extent LCIP improves the ac-
curacy of impact prediction?
We want to determine whether our prediction algorithm
based on learning can improve the prediction scores com-
pared to the use of a baseline transitive closure prediction
technique described in 4.1.2.
4.1 Experimental Protocol
We explain how we evaluate our approach, the dataset
and the configuration parameters we use. In this paper, we
use mutation injection to simulate changes in the software
in order to learn the error propagation profile.
The evaluation follows several steps: (i) we create mu-
tants for a software application, they simulate changes; (ii) we
extract the corresponding call graphs with and without class
hierarchy analysis; (iii) we split the dataset of mutants in a
training set and a testing set; (iv) we run our learning algo-
rithm based on the mutants of the training set. This results
in a weighted call graph.
Then, for each mutant of the testing set: (i) we compute
the actual impact set by running the original test cases that
come with the software package under study; (ii) we predict
the impact set for each mutant with our technique, using
the weights learned in the previous step; (iii) we compute
performance metrics by comparing the predicted impact set
and the actual impact set.
The idea of creating mutants is to create synthetic changes
that have an impact observable in test cases, this is further
discussed in 4.3. In addition, we use 10-fold cross-validation
[10]: for each software, we partition the mutants into 10
subsets of equal size. We take 9 subsets to train the model
(with Algorithm 1) and the one remaining is used to as-
sess the model (with Algorithm 4). This process is run 10
times. We compute the mean value of the evaluation met-
rics considered over these 10 runs. For Dichotomic, we use a
threshold value of 0.2 for prediction. This value is the best
one according to a systematic grid search of all thresholds
ranging from 0 to 1 with an increment of 0.1.
Our mutation, learning and evaluation framework is open-
source and freely available online2.
2https://github.com/v-m/PropagationAnalysis,
https://github.com/k0pernicus/PropL
Table 1: Precision, recall and F-score obtained for change impact prediction with three techniques: transitive
closure (TC), Algorithm Binary and Algorithm Dichotomic. The bold faced values indicate the best results
for a single metric (up to rounding precision).
Precision Recall F-score
Package Op. #mut. TC Bin Dic TC Bin Dic TC Bin Dic
Collections ABS 600 0.55 0.75 0.81 0.80 0.80 0.83 0.43 0.59 0.67
AOR 600 0.65 0.78 0.81 0.68 0.70 0.63 0.44 0.55 0.51
LCR 600 0.56 0.69 0.72 0.77 0.78 0.77 0.43 0.51 0.53
ROR 600 0.59 0.80 0.84 0.64 0.68 0.66 0.40 0.55 0.56
UOI 600 0.63 0.80 0.81 0.68 0.70 0.70 0.44 0.57 0.56
All 3000 0.60 0.76 0.80 0.71 0.73 0.72 0.43 0.55 0.57
Lang ABS 600 0.30 0.68 0.69 0.99 0.99 0.98 0.33 0.69 0.70
AOR 600 0.47 0.65 0.67 0.99 0.99 0.99 0.52 0.68 0.70
LCR 600 0.38 0.55 0.59 0.98 0.96 0.93 0.43 0.59 0.62
ROR 600 0.49 0.66 0.67 0.98 0.97 0.97 0.54 0.68 0.69
UOI 600 0.51 0.74 0.75 0.98 0.98 0.97 0.55 0.75 0.77
All 3000 0.43 0.66 0.67 0.98 0.98 0.97 0.47 0.68 0.70
Total 6000 0.52 0.71 0.74 0.85 0.86 0.85 0.45 0.62 0.64
4.1.1 Evaluation Metrics
To assess and compare the performance of our impact pre-
diction techniques, we use metrics computed on CI S and
AIS presented by Arnold and Bohner [7].
The size of the CI S has to be as close as possible to the
size of the AIS as it quantifies the amount of elements re-
trieved by the impact analysis. We express this in terms
of precision, recall and F-score [11]. There is one preci-
sion, recall and F-score per mutant of the testing set. We
then compute average precision, recall and F-score over all
mutants of a fold. The precision Pis the proportion of pre-
dicted tests which are actually impacted;. The recall Ris
the proportion of truly impacted tests that are retrieved. P
and Rare computed using Equation (1).
P=|AIS C I S|
|CI S |;R=|AIS CIS|
|AIS |.(1)
To make comparison easier, the F-score Fis the harmonic
mean of Pand R. Since precision, recall and F-score are
computed for each mutation, it is necessary to consider a
mean value over all mutants of the testing set. Our key goal
is to improve the F-score of impact prediction as it takes
into consideration both precision and recall.
4.1.2 Baseline
As a baseline for change impact analysis, we compute the
transitive closure of call graph nodes from the mutated node
[1]. The result of the transitive closure is a list of all nodes
potentially impacted by the change. Since the impact is only
computed on test nodes, we remove application nodes from
the impacted nodes in the transitive closure.
4.1.3 Dataset and Setup
In this paper, we have selected 2 Java software packages
from the Apache Commons family: Apache Commons Lang
3.5 (git #6965455) and Apache Commons Collections 4.1
(svn r1610049). Using the open-source tool cloc, we com-
puted a total of 122590 LOC for both pro jects (67509 for
Lang and 55081 for Collections). The first one contains
2657 tests and the second one 5262 tests. The call graph
size is 6195 nodes for Lang and 6271 for Collections with
9653 edges for Lang and 12130 for Collections.
4.1.4 Mutation Operators
We consider the 5 classical mutation operators validated
by King and Offutt [12] adapted to the Java language (as
those operators are firstly intended for Fortran). As shown
by Offutt et al., these five mutation operators are sufficient
to effectively implement mutation testing [13]. The five mu-
tations operators are: (i) Absolute value insertion (ABS) in
which each numerical expression or literal is replaced by its
absolute value. (ii) Arithmetic operator replacement (AOR)
in which each arithmetic expression is replaced by a new one
with a different operator but same operands. (iii) Logical
connector replacement (LCR) in which each logical expres-
sion is replaced by a new one with a different operator but
same operands. (iv) Relational operator replacement (ROR)
in which each relational expression is replaced by a new one
with a different operator but same operands. (v) Unary op-
erator inversion (UOI) where each arithmetic expression xis
mutated to its opposite value (i.e. x * -1), its incremented
value (i.e. x+1) and its decremented value (i.e. x-1).
Moreover, each logical expression xis mutated to its com-
plemented form (i.e. !x). For AOR and LCR, the whole
expression can also be mutated to true or false constant.
For LCR and ROR, possible mutations are the left or right
operand alone.
4.2 Results
In this section, we answer our research question: To what
extent LCIP improves the accuracy of impact prediction?
All the raw empirical data is publicly available online3.
Table 1 gives the values of the evaluation metrics pre-
sented in Section 4.1.1. The first, second, and third columns
are respectively the name of the package, the mutation op-
erator and the number of mutants considered. Then we
have three multi-columns, one for each metric (precision,
recall and F-score). Each multi-column is made of three
columns which are the value obtained using the transitive
closure (TC) baseline, the value obtained with the Binary
algorithm and the value obtained with the Dichotomic al-
gorithm. These values are the average over ten-fold cross
validation. For each multi-column, the best value is shown
in bold face.
3https://github.com/v-m/PropagationAnalysis-dataset
First, if we compare the performance of the transitive clo-
sure algorithm (TC) and the two learning algorithms, we
observe that in 100% of cases, both the precision and the
F-score are improved using our learning algorithms. For
the recall, regarding Collection, our learning algorithms give
better recall scores. For Lang, our learning algorithms give
similar scores than TC with respect to 3/5 mutation oper-
ators. This means that in 8 cases out of 10, our learning
algorithms give better recall scores than TC.
For the precision, the best improvement is for the Lang
project with the ABS operator where the precision raises
from 0.30 for TC to 0.68 using Binary algorithm and to 0.69
using the Dichotomic algorithm. Overall, the Binary and
the Dichotomic algorithms respectively provide an average
precision improvement of 0.20 and 0.22 when averaged over
all projects and all mutation operators.
For the recall, the values are quite stable, the best im-
provement for Binary and Dichotomic algorithms are both
with Collections. For Binary, the best improvement is ob-
tained with the ROR operator for which the recall is im-
proved from 0.64 for the baseline to 0.68 (+0.04). For Di-
chotomic, the best improvement is obtained with the ABS
operator for which the recall is improved from 0.80 for the
baseline to 0.83 (+0.03). However, there is no real improve-
ment with our learning algorithm compared to the transitive
closure baseline on average (±0.01).
For the F-score, the best improvement is also obtained
on the Lang project with the ABS operator where the F-
score raises from 0.33 using the transitive closure baseline to
0.69 (+0.36) using the Binary algorithm, and 0.70 (+0.37)
using the Dichotomic approach. Globally, the Binary and
the Dichotomic algorithms lead respectively an average F-
score improvement of 0.17 and 0.18. Figure 3 shows a scatter
plot of the average F-scores obtained for all projects and
mutation operators for TC (x-axis) and Dichotomic (y-axis).
The line represents y=x. Since all points are in the upper-
left part, above y=x, this figure graphically highlights that
Dichotomic much improves the prediction. In addition, we
also graphically note that some operators are really far from
y=x, which means a strong improvement.
If we compare the two learning variants (Dichotomic and
Binary), we see that the Dichotomic algorithm has better
precision values than the Binary for 100% of the cases. But,
Dichotomic only has better recall in 1 case out of 10 (in 3
cases they have the same score). F-scores are also better
for the Dichotomic algorithm in 8 cases out of 10. The fact
that Binary gives better recall while Dichotomic gives bet-
ter precision is linked to the fact that Binary has a “all or
nothing” spirit. If an edge is activated by only one mutant
at learning time, then it will be considered as important as
an other edge activated several times, by several mutants.
To the opposite, with Dichotomic, rarely visited edges have
a low weight, and the threshold prunes them at prediction
time. Consequently, Binary considers more edges compared
to Dichotomic, yielding more predicted nodes, hence higher
recall and lower precision.
To assess the significance of the performance difference
between Dichotomic and Binary, we run a Mann-Whitney-
Wilcoxon statistical test on precision, recall and F-score over
all mutants. The null hypothesis (H0) is: “the metrics (ei-
ther precision, or recall, or F-score) obtained for the Binary
and the Dichotomic algorithms are drawn from the same
population”. Using the usual significance level α= 0.05, the
Figure 3: The performance improvement of the Di-
chotomic algorithm over the basic TC impact pre-
diction. One point represents a mutation operator
for a given project and is located at coordinates
(x=FT C , y =FD ichotomic).
p-values are: 10 ×1016 for precision, 4.1×1012 for recall,
and 8.7×1014 for F-score. For each metric, the conclusion
is that the null hypothesis may be rejected, which means
that Dichotomic is better in a statistically significant sense.
To sum up, LCIP achieves better results for change
impact prediction compared to a standard transitive clo-
sure approach. Among the two algorithms we propose
in this paper, the best algorithm is the Dichotomic ap-
proach.
4.3 Threats to Validity
Synthetic Changes In our evaluation, we use synthetic
changes for exploring the performance of our impact predic-
tion algorithm. Our motivation for using synthetic changes
is to have a large amount of data, which is necessary for
learning. Another option would be to use impact of real
code changes, however, those are extremely difficult to ob-
tain, because developers never commit a change that breaks
a test case. This is the reason why the related work only
considers a very small evaluation benchmark. For instance,
the evaluation of SENSA [14] uses 27 changes, which is much
smaller than the 6000 simulated changes of our experiments.
Also our results are dependent on the choice of synthetic
mutation operators.
Internal Validity Our results are of computational na-
ture. A major bug in our software can invalidate our find-
ings. We have published all our code on Github so as to
facilitate reproduction and falsification of our results, if nec-
essary.
External Validity In this paper, we use 600 mutants
per mutation operator and per project. Using ten-fold cross
validation, this results in testing sets of up to 60 items. Since
the aggregate performance measures (precision, recall and
F-score) are rather stable over folds, we have confidence
that this is enough to back up our conclusions. However
future work with more mutants is required.
The impact prediction much depends on the structure of
the call graphs [15]. For instance, the presence of large util-
ity methods, with many incoming and outgoing edges has
a direct impact on the prediction performance. Our results
may only be valid for Java software, or even worse only valid
for the projects under study. Future work in this field will
strengthen the external validity of our findings.
5. RELATED WORK
Strug and Strug [16] use data and control flow graphs and
classification for detecting similar mutants. They use simi-
lar tools as us (learning and mutation) to reduce the number
of mutant considered when doing mutation testing. In our
work, we use learning for change impact analysis. Do and
Rothermel [17] described a protocol to study test case prior-
itization techniques based on mutation. Their protocol and
ours share the same intuition, use test cases to determine
which tests is impacted by the change. However, we have a
different goal, they study test case prioritization, we study
impact prediction. Hattori et al. [11] have used an approach
based on call graphs to study the propagation. Their evalua-
tion is made on a dataset made of three projects. Their goal
was to show that the precision and recall are good tools as
evaluation of the performance for an impact analysis tech-
nique. In contrast, we propose a concrete impact analy-
sis technique. Law and Rothermel [18] have proposed an
approach for impact analysis; their technique is based on
a code instrumentation to analyze execution stack traces.
They compare their technique against simple call-graphs on
one small software. Our technique is compared to similar
graphs but with two different software packages. Gethers et
al. [19] also address impact analysis. However, they con-
sider a slightly different problem setting, because they take
as input a bug report or a modification request and not a
single source code element as we do.
6. CONCLUSION
In this paper we have explored the possibilities of using
learning to improve impact prediction based on call graphs.
We have shown that we can improve the prediction perfor-
mance with a learning algorithm based on the addition of
weights to the call graph edges. From a software engineering
perspective, our results show that having a learning perspec-
tive on change impact prediction is a promising research di-
rection. Our future work focuses on improving the external
validity, in particular with learning on more projects and
mutants.
7. REFERENCES
[1] S. A. Bohner and R. S. Arnold, Software Change
Impact Analysis. IEEE Computer Society Press,
1996.
[2] B. Li, X. Sun, H. Leung, and S. Zhang, “A Survey of
Code-based Change Impact Analysis Techniques,”
Software Testing, Verification and Reliability, vol. 23,
no. 8, pp. 613–646, 2013.
[3] R. J. Turver and M. Munro, “An Early Impact
Analysis Technique for Software Maintenance,”
Journal of Software Maintenance: Research and
Practice, vol. 6, no. 1, pp. 35–52, 1994.
[4] G. Jeong, S. Kim, and T. Zimmermann, “Improving
Bug Triage with Bug Tossing Graphs,” in Proceedings
of the European Software Engineering Conference and
the Symposium on The Foundations of Software
Engineering, 2009, pp. 111–120.
[5] J. P. Loyall and S. A. Mathisen, “Using Dependence
Analysis to Support the Software Maintenance
Process,” in Proceedings of the Conference on Software
Maintenance, 1993, pp. 282–291.
[6] S. Bohner, “Software Change Impacts - An Evolving
Perspective,” in Proceedings of the International
Conference on Software Maintenance, 2002, pp.
263–272.
[7] R. S. Arnold and S. A. Bohner, “Impact Analysis -
Towards a Framework for Comparison,” in Proceedings
of the Conference on Software Maintenance, 1993, pp.
292–301.
[8] T. A. Budd, R. J. Lipton, R. A. DeMillo, and F. G.
Sayward, “Mutation Analysis,” Yale University,
Department of Computer Science, Tech. Rep., 1979.
[9] D. Grove, G. DeFouw, J. Dean, and C. Chambers,
“Call Graph Construction in Object-oriented
Languages,” in Proceedings of the Conference on
Object-oriented Programming, Systems, Languages,
and Applications, 1997, pp. 108–124.
[10] R. Kohavi, “A Study of Cross-validation and
Bootstrap for Accuracy Estimation and Model
Selection,” International Joint Conference on Artificial
Intelligence, pp. 1137–1143, 1995.
[11] L. Hattori, D. Guerrero, J. Figueiredo, J. Brunet, and
J. Dam´asio, “On the Precision and Accuracy of
Impact Analysis Techniques,” in Proceedings of the
International Conference on Computer and
Information Science, 2008, pp. 513–518.
[12] K. N. King and A. J. Offutt, “A Fortran Language
System for Mutation-based Software Testing,”
Software: Practice and Experience, vol. 21, no. 7, pp.
685–718, 1991.
[13] A. J. Offutt, A. Lee, G. Rothermel, R. H. Untch, and
C. Zapf, “An Experimental Determination of Sufficient
Mutant Operators,” ACM Transactions on Software
Engineering and Methodology, vol. 5, no. 2, pp.
99–118, 1996.
[14] H. Cai, S. Jiang, R. Santelices, Y.-J. Zhang, and
Y. Zhang, “SENSA: Sensitivity Analysis for
Quantitative Change-Impact Prediction,” in
Proceedings of the International Working Conference
on Source Code Analysis and Manipulation, 2014, pp.
165–174.
[15] V. Musco, M. Monperrus, and P. Preux, “A
Generative Model of Software Dependency Graphs to
Better Understand Software Evolution,” ArXiv
e-prints, 1410.7921v2, 2014.
[16] J. Strug and B. Strug, “Machine Learning Approach in
Mutation Testing,” in Testing Software and Systems,
2012, no. 7641, pp. 200–214.
[17] H. Do and G. Rothermel, “A Controlled Experiment
Assessing Test Case Prioritization Techniques via
Mutation Faults,” in Proceedings of International
Conference on Software Maintenance, 2005, pp.
411–420.
[18] J. Law and G. Rothermel, “Whole Program
Path-Based Dynamic Impact Analysis,” in Proceedings
of the International Conference on Software
Engineering, 2003, pp. 308–318.
[19] M. Gethers, B. Dit, H. Kagdi, and D. Poshyvanyk,
“Integrated Impact Analysis for Managing Software
Changes,” in Proceedings of the International
Conference on Software Engineering, 2012, pp.
430–440.
Article
In this paper, we propose an architecture model called Design Rule Space (DRSpace). We model the architecture of a software system as multiple overlapping DRSpaces, reflecting the fact that any complex software system must contain multiple aspects, features, patterns, etc. We show that this model provides new ways to analyze software quality. In particular, we introduce an Architecture Root detection algorithm that captures DRSpaces containing large numbers of a project's bug-prone files, which are called Architecture Roots (ArchRoots). After investigating ArchRoots calculated from 15 open source projects, the following observations become clear: from 35% to 91% of a project's most bug-prone files can be captured by just 5 ArchRoots, meaning that bug-prone files are likely to be architecturally connected. Furthermore, these ArchRoots tend to live in the system for significant periods of time, serving as the major source of bug-proneness and high maintainability costs. Moreover, each ArchRoot reveals multiple architectural flaws that propagate bugs among files and this will incur high maintenance costs over time. The implication of our study is that the quality, in terms of bug-proneness, of a large, complex software project cannot be fundamentally improved without first fixing its architectural flaws.
Conference Paper
Full-text available
Sensitivity analysis determines how a system responds to stimuli variations, which can benefit important software-engineering tasks such as change-impact analysis. We present SENSA, a novel dynamic-analysis technique and tool that combines sensitivity analysis and execution differencing to estimate the dependencies among statements that occur in practice. In addition to identifying dependencies, SENSA quantifies them to estimate how much or how likely a statement depends on another. Quantifying dependencies helps developers prioritize and focus their inspection of code relationships. To assess the benefits of quantifying dependencies with SENSA, we applied it to various statements across Java subjects to find and prioritize the potential impacts of changing those statements. We found that SENSA predicts the actual impacts of changes to those statements more accurately than static and dynamic forward slicing. Our SENSA prototype tool is freely available for download.
Article
Full-text available
Software systems are composed of many interacting elements. A natural way to abstract over software systems is to model them as graphs. In this paper we consider software dependency graphs of object-oriented software and we study one topological property: the degree distribution. Based on the analysis of ten software systems written in Java, we show that there exists completely different systems that have the same degree distribution. Then, we propose a generative model of software dependency graphs which synthesizes graphs whose degree distribution is close to the empirical ones observed in real software systems. This model gives us novel insights on the potential fundamental rules of software evolution.
Conference Paper
Full-text available
We review accuracy estimation methods and compare the two most common methods: crossvalidation and bootstrap. Recent experimental results on arti cial data and theoretical results in restricted settings have shown that for selecting a good classi er from a set of classiers (model selection), ten-fold cross-validation may be better than the more expensive leaveone-out cross-validation. We report on a largescale experiment|over half a million runs of C4.5 and a Naive-Bayes algorithm|to estimate the e ects of di erent parameters on these algorithms on real-world datasets. For crossvalidation, we vary the number of folds and whether the folds are strati ed or not � for bootstrap, we vary the number of bootstrap samples. Our results indicate that for real-word datasets similar to ours, the best method to use for model selection is ten-fold strati ed cross validation, even if computation power allows using more folds. 1
Article
Full-text available
A new Type of software test, called mutation analysis, is introduced. A method of applying mutation analysis is described, and the design of several existing automated systems for applying mutation analysis to Fortran and Cobol programs is sketched. These systems have been the means for preliminary studies of the efficiency of mutation analysis and of the relationship between mutation and other systematic testing techniques. The results of several experiments to determine the effectiveness of mutation analysis are described, and examples are presented to illustrate the way in which the technique can be used to detect a wide class of errors, including many previously defined and studied in the literature. Finally, a number of empirical studies are suggested, the results of which may add confidence to the outcome of the mutation analysis of a program. (Author)
Article
Full-text available
Path-based Statistical Software Testing is interested in sampling the feasible paths in the control flow graph of the program being tested. As the ratio of feasible paths becomes negligible for large programs, an ML approach is presented to iteratively estimate and exploit the distribution of feasible paths.
Conference Paper
This paper deals with an approach based on the similarity of mutants. This similarity is used to reduce the number of mutants to be executed. In order to calculate such a similarity among mutants their structure is used. Each mutant is converted into a hierarchical graph, which represents the program’s flow, variables and conditions. On the basis of this graph form a special graph kernel is defined to calculate similarity among programs. It is then used to predict whether a given test would detect a mutant or not. The prediction is carried out with the help of a classification algorithm. This approach should help to lower the number of mutants which have to be executed. An experimental validation of this approach is also presented in this paper. An example of a program used in experiments is described and the results obtained, especially classification errors, are presented.
Article
Software change impact analysis (CIA) is a technique for identifying the effects of a change, or estimating what needs to be modified to accomplish a change. Since the 1980s, there have been many investigations on CIA, especially for code‐based CIA techniques. However, there have been very few surveys on this topic. This article tries to fill this gap. And 30 papers that provide empirical evaluation on 23 code‐based CIA techniques are identified. Then, data was synthesized against four research questions. The study presents a comparative framework including seven properties, which characterize the CIA techniques, and identifies key applications of CIA techniques in software maintenance. In addition, the need for further research is also presented in the following areas: evaluating existing CIA techniques and proposing new CIA techniques under the proposed framework, developing more mature tools to support CIA, comparing current CIA techniques empirically with unified metrics and common benchmarks, and applying the CIA more extensively and effectively in the software maintenance phase. Copyright © 2012 John Wiley & Sons, Ltd.
Article
The paper presents an adaptive approach to perform impact analysis from a given change request to source code. Given a textual change request (e.g., a bug report), a single snapshot (release) of source code, indexed using Latent Semantic Indexing, is used to estimate the impact set. Should additional contextual information be available, the approach configures the best-fit combination to produce an improved impact set. Contextual information includes the execution trace and an initial source code entity verified for change. Combinations of information retrieval, dynamic analysis, and data mining of past source code commits are considered. The research hypothesis is that these combinations help counter the precision or recall deficit of individual techniques and improve the overall accuracy. The tandem operation of the three techniques sets it apart from other related solutions. Automation along with the effective utilization of two key sources of developer knowledge, which are often overlooked in impact analysis at the change request level, is achieved. To validate our approach, we conducted an empirical evaluation on four open source software systems. A benchmark consisting of a number of maintenance issues, such as feature requests and bug fixes, and their associated source code changes was established by manual examination of these systems and their change history. Our results indicate that there are combinations formed from the augmented developer contextual information that show statistically significant improvement over standalone approaches.
The accurate estimation of the resources required to implement a change in software is a difficult task. A method for doing this should include the analysis of the impact of the change on the existing system. A number of techniques for analysing the impact of a change on the source code have been described in the literature. While these techniques provide a good example of how to apply ripple effect analysis to source code, a weakness in these approaches is that they can be difficult to apply in the risk assessment phase of a project. This is because the source code is often not very well understood at this phase, and change proposals are written at a much higher level of abstraction than the code. It is therefore often the case that in practice subjective impact analysis methods are used for risk assessment and project investment appraisal. The underestimated resources for dealing with the ripple effects of a change can result in project schedules becoming so tight that only the minimal quality is achieved. This paper surveys existing ripple analysis techniques and then presents a new technique for the early detection of ripple effects based on a simple graph-theoretic model of documentation and the themes within the documentation.The objective is to investigate the basis of a technique for analysing and measuring the impact of a change on the entire system that includes not only the source code but the specification and design documentation of a system, and an early phase in the maintenance process.