Available via license: CC BY-NC-ND 4.0
Content may be subject to copyright.
1
SQAPlanner: Generating Data-Informed
Software Quality Improvement Plans
Dilini Rajapaksha, Chakkrit Tantithamthavorn, Jirayus Jiarpakdee,
Christoph Bergmeir, John Grundy, and Wray Buntine
Abstract— Software Quality Assurance (SQA) planning aims to define proactive plans, such as defining maximum file size, to prevent
the occurrence of software defects in future releases. To aid this, defect prediction models have been proposed to generate insights as
the most important factors that are associated with software quality. Such insights that are derived from traditional defect models are
far from actionable—i.e., practitioners still do not know what they should do or avoid to decrease the risk of having defects, and what is
the risk threshold for each metric. A lack of actionable guidance and risk threshold can lead to inefficient and ineffective SQA planning
processes. In this paper, we investigate the practitioners’ perceptions of current SQA planning activities, current challenges of such
SQA planning activities, and propose four types of guidance to support SQA planning. We then propose and evaluate our AI-Driven
SQAPlanner approach, a novel approach for generating four types of guidance and their associated risk thresholds in the form of rule-
based explanations for the predictions of defect prediction models. Finally, we develop and evaluate an information visualization for our
SQAPlanner approach. Through the use of qualitative survey and empirical evaluation, our results lead us to conclude that SQAPlanner
is needed, effective, stable, and practically applicable. We also find that 80% of our survey respondents perceived that our visualization
is more actionable. Thus, our SQAPlanner paves a way for novel research in actionable software analytics—i.e., generating actionable
guidance on what should practitioners do and not do to decrease the risk of having defects to support SQA planning.
Index Terms—Software Quality Assurance, SQA Planning, Actionable Software Analytics, Explainable AI.
F
1 INTRODUCTION
Software Quality Assurance (SQA) planning is the pro-
cess of developing proactive SQA plans. One of the
most important SQA activities is to define development
policies and their associated risk thresholds [12] (e.g.,
defining the maximum file size, the maximum code com-
plexity, and the minimum degree of code ownership).
Such SQA plans will be later enforced for a whole team
to ensure the highest quality of software systems. These
policies are essential to improve software quality and
software maintainability [29].
Recently, top software companies have released sev-
eral commercial AI-driven defect prediction tools. For
example, Microsoft’s Code Defect AI, Amazon’s Code-
Guru. Such tools heavily rely on the concept of defect
prediction models that have been well-studied in the
past decades [17]. In particular, Microsoft’s Code Defect
AI is built on top of the concept of explainable Just-In-
Time defect prediction [21, 44]—i.e., explaining the pre-
dictions of defect models using a LIME model-agnostic
technique [37]. The crux of Microsoft’s Code Defect AI
tool is similar to the recent parallel work by Jiarpakdee et
al. [21] who also suggested to use a LIME model-agnostic
technique to explain the predictions of defect models.
•D. Rajapaksha, C. Tantithamthavorn, J. Jiarpakdee C. Bergmeir, J. Grundy,
and W. Buntine are with the Faculty of Information Technology, Monash
University, Melbourne, Australia.
E-mail: {dilini.rajapakshahewaranasinghage, chakkrit, jirayus.jiarpakdee,
christoph.bergmeir, john.grundy, wray.buntine}@monash.edu
•Corresponding author: C. Tantithamthavorn.
However, these current state-of-the-art defect predic-
tion approaches can only indicate the most important
features, which are still far from actionable. Thus, prac-
titioners still do not know (1) what they should do to
decrease the risk of having defects, and what they should
avoid to not increase the risk of having defects and (2)
what is a risk threshold for each metric (e.g., how large
is a file size that would be risky? and how small is a file
size that would be non-risky?).
A lack of actionable guidance and its risk threshold
can lead to inefficient and ineffective SQA planning
processes. Such ineffective SQA planning processes will
result in the recurrence of software defects, slow project
progress, high costs of development, unsatisfactory soft-
ware products, and unhappy end-users. These chal-
lenges are very significant to the practical applications
of defect prediction models, but still remain largely
unexplored.
We aim to help practitioners to make better data-
informed SQA planning decisions by generating action-
able guidance derived from defect prediction models.
Thus, we first propose the following four types of guid-
ance to support SQA planning:
(G1) Risky current practices that lead the defect model
to predict a file as defective are needed to help
practitioners understand what are the current risky
practices.
(G2) Non-risky current practices that lead the defect
model to predict a file as clean are needed to
help practitioners understand what are the non-
risky current practices.
arXiv:2102.09687v1 [cs.SE] 19 Feb 2021
2
(G3) Potential practices to avoid to not increase the
risk of having defects are needed to help prac-
titioners understand which currently not imple-
mented practices to avoid to not increase the risk
of having defects.
(G4) Potential practices to follow to decrease the risk
of having defects are needed to help practitioners
understand which practices to newly implement to
decrease the risk of having defects.
To achieve this aim, our research study has the follow-
ing 3 key objectives:
(Obj1) Investigating practitioners’ perceptions and chal-
lenges of carrying out current SQA planning
activities and the perceptions of our proposed
four types of guidance;
(Obj2) Developing and evaluating our novel SQAPlan-
ner approach and comparing it with state-of-the-
art approaches;
(Obj3) Developing and evaluating an information vi-
sualization for our SQAPlanner approach and
comparing it with the visualization of Microsoft’s
Code Defect AI tool.
To achieve the first objective, we first conducted a
qualitative survey with practitioners to address the fol-
lowing research questions:
(RQ1) How do practitioners perceive SQA planning
activities? For SQA planning activities, 86% of
the respondents perceived as important and
70% perceived as being used in practice. How-
ever, 66% perceived as time-consuming and 58%
perceived as difficult, indicating that a data-
informed SQA planning tool is needed to sup-
port QA teams for better data-informed decision-
making and policy-making.
(RQ2) How do practitioners perceive our proposed
four types of guidance to support SQA plan-
ning? Both (G1) the guidance on risky current
practices that lead a model to predict a file as
defective and (G4) the guidance on the potential
practices to follow to decrease the risk of having
defects are perceived as among the most useful,
most important, and most considered willingness
to adopt by the respondents.
Motivated by the findings of RQ1 and RQ2, we pro-
posed an AI-Driven SQAPlanner—i.e., an approach to
generate four types of guidance in the form of rule-
based explanations [34] to support data-informed SQA
planning. Our AI-Driven SQAPlanner is a significant ad-
vancement over the LIME model-agnostic technique [37],
since LIME only indicates what factors are the most im-
portant to support the predictions towards defective (G1)
and clean (G2) classes, while our AI-Driven SQAPlanner
can additionally provide actionable guidance on what
should developers avoid (G3) and should do (G4) to
decrease the risk of having defects. Then, we conduct
an empirical evaluation to evaluate our SQAPlanner
approach and compare with two state-of-the-art local
rule-based model-agnostic techniques (i.e., Anchor [38]
(i.e., an extension of LIME [37]), LORE [15]). Through a
case study of 32 releases across 9 open-source software
projects, we addressed the following research questions:
(RQ3) How effective are the rule-based explanations
generated by our SQAPlanner approach when
compared to the state-of-the-art approaches?
The rule-based guidance generated by our SQA-
Planner approach achieves the highest coverage
(at the median 89%), confidence (at the median
99%), and lift scores (at the median 6.6) when
comparing to baseline techniques.
(RQ4) How stable are the rule-based explanations
generated by our SQAPlanner approach when
they are regenerated? Our SQAPlanner approach
produces the most consistent (a median Jaccard
coefficient of 0.92) rule-based guidance when
comparing to baseline techniques, suggesting that
our approach can generate the most stable rule-
based guidance when they are regenerated.
(RQ5) How applicable are the rule-based explana-
tions generated by our SQAPlanner approach
to minimize the risk of having defects in the
subsequent releases? For 55%-87% of the defec-
tive files, our SQAPlanner approach can gener-
ate rule-based guidance that is applicable to the
subsequent release to decrease the risk of having
defects.
To evaluate the practical usefulness of our SQAPlan-
ner, we developed a proof-of-concept prototype to vi-
sualize the actual generated actionable guidance. The
visualization of our SQAPlanner is designed to provide
the following key information: (1) the list of guidance
that practitioners should follow and should avoid; (2) the
actual feature value of that file; and (3) its threshold and
range values for practitioners to follow to mitigate the
risk of having defects. Then, we compare our visualiza-
tion with the visualization of Microsoft’s Code Defect AI
(see Figure 2). Finally, we conducted a qualitative survey
to address the following research questions:
(RQ6) How do practitioners perceive the visualiza-
tion of SQAPlanner when comparing to the
visualization of the state-of-the-art? 80% of the
respondents agree that the visualization of our
SQAPlanner is best to provide actionable guid-
ance compared to the visualization of Microsoft’s
Code Defect AI.
(RQ7) How do practitioners perceive the actual guid-
ance generated by our SQAPlanner? 63%-90% of
the respondents agree with the seven statements
derived from the actual guidance generated by
our SQAPlanner.
The key contributions of this paper are:
•An empirical investigation of the practitioners’ per-
ceptions and their challenges of current SQA plan-
ning activities.
3
•An empirical investigation of the practitioners’ per-
ceptions of our proposed four types of guidance.
•The development of our novel AI-Driven SQAPlan-
ner approach to generate the proposed four types of
guidance in the form of rule-based explanations to
better support SQA planning. The implementation
is available at https://github.com/awsm-research/
SQAPlanner-implementation.
•The empirical investigation of the effectiveness, the
stability, and the applicability of rule-based expla-
nations generated by our SQAPlanner.
•The development of the visualization of our SQA-
Planner approach and the empirical investigation of
the practitioners’ perceptions on our visualization
and the actual guidance.
The rest of the paper is organized as follows. Sec-
tion 2 discusses the significance of SQA planning, the
limitations of current AI-driven defect prediction tools,
and the motivation of the proposed four types of guid-
ance to support SQA planning. Section 3 presents the
overview of our case study and the motivation of the
research questions. Section 4 presents the results of the
practitioners’ perceptions of SQA planning activities and
the four types of guidance to support SQA planning.
Section 5 presents our SQAPlanner approach, while Sec-
tion 6 presents the empirical results of our SQAPlanner
approach. Section 7 presents the empirical investigation
of the visualization of our SQAPlanner and the actual
guidance generated by our SQAPlanner approach. Sec-
tion 8 summarizes the threats to the validity of our study,
and Section 9 discusses related work. Finally, Section 10
draws the conclusions.
2 BACKG ROUN D AN D MOTIVATI ON
In this section, we first discuss the significance of Soft-
ware Quality Assurance (SQA) planning. Then, we dis-
cuss the limitations of current AI-driven defect predic-
tion tools. Finally, we propose the four types of guidance
to support SQA planning.
2.1 “Prevention is better than cure”
This is a classic principle that is commonly applied to
SQA processes to prevent software defects [27]. It is
widely known that the cost of software defects rises
significantly if they are discovered later in the process.
Thus, finding and fixing software defects prior to releas-
ing software is usually much cheaper and faster than
fixing after the software is released [3]. Therefore, SQA
teams play a critical role in software companies as a
gatekeeper, i.e., not allowing software defects to pass
through to end-users.
Consider an example of an SQA practice inside the
Atlassian company, Australia’s largest software com-
pany with a variety of well-known software products
e.g., JIRA Issue Tracking System, BitBucket, and Trello.
Fig. 1: A JIRA software development process and how
QA engineers interact with developers prior to releasing
a software product.
Figure 1 provides an overview of a JIRA software de-
velopment process.1During this process, a QA engineer
has multiple points at which he or she provides feedback
into the way the feature is developed and tested—i.e.,
providing every form of quality improvement guidance
for all steps of the software development process from
planning to completion. This process allows for imme-
diate active feedback to ensure that knowledge gained
from previous software defects is fed back into the
testing notes for future releases to prevent defects in the
next iteration.
2.2 AI-Driven Defect Prediction and Limitations
An AI-driven defect prediction (aka. defect prediction
model) is a classification model which is trained on
historical data in order to predict if a file is likely to be
defective in the future. Defect models serve two main
purposes. First is to predict. The predictions of defect
models can help developers to prioritize their limited
SQA resources on the most risky files [9, 31, 46, 48].
Therefore, developers can save their limited SQA effort
on the most risky files instead of wasting their time
on inspecting less risky files. Second is to explain. The
insights that are derived from defect models could help
managers chart quality improvement plans to avoid the
pitfalls that lead to defects in the past [2, 30, 49]. For
example, if the insights suggest that code complexity
shares the strongest relationship with defect-proneness,
managers must initiate quality improvement plans to
control and monitor the code complexity of that system.
Recently, top software companies have released sev-
eral commercial AI-driven defect prediction tools. For
example, Microsoft’s Code Defect AI,2Amazon’s Code-
Guru.3Such tools heavily rely on the concept of defect
prediction models that have been well-studied in the
1. https://www.atlassian.com/blog/inside-atlassian/jira-qa-process
2. https://www.microsoft.com/en-us/ai/ai-lab-code-defect
3. https://aws.amazon.com/codeguru/
4
Fig. 2: An example visualization of the Microsoft’s Code Defect AI tool (http://codedefectai.azurewebsites.net/).
However, this tool does not suggest what practitioners should do to decrease the risk of having defects, and what
practitioners should avoid in order not to increase the risk of having defects. In addition, this tool does not suggest
a risk threshold for each metric.
past decades [17]. In particular, Microsoft’s Code Defect
AI is built on top of the concept of explainable Just-In-
Time defect prediction [21, 44]—i.e., explaining the pre-
dictions of defect models using a LIME model-agnostic
technique [37]. LIME is a model-agnostic technique for
explaining the predictions of any AI/ML algorithms.
The crux of Microsoft’s Code Defect AI tool is similar
to the recent parallel work by Jiarpakdee et al. [21]—
i.e., extracting several software metrics (e.g., Churn),
building a classification model (e.g., random forests),
generating a prediction for each file in a commit, and
generating an explanation of each prediction using the
LIME model-agnostic technique [37].
Figure 2 presents an example visualization of
Microsoft’s Code Defect AI product for the file
ErrorHandlerBuilderRef.java of the Apache
Camel Release 2.9.0. This figure shows that this file
is predicted as defective with a confidence score of
70%. There are three most important factors that are
associated with this prediction as defective, i.e., the
number of lines of class and method declaration, the
number of distinct developers, and the degree of code
ownership. Thus, these insights can help managers
chart quality improvement plans to control for these
metrics. However, there exist the following limitations.
•First, practitioners still do not know what they
should do to decrease the risk of having defects,
and what they should avoid to not increase the risk
of having defects. We find that LIME can only indi-
cates what factors are the most important to support
the predictions towards defective (G1) and clean
(G2) classes, without providing actionable guidance
on what should they avoid (G3) and should do (G4)
to decrease the risk of having defects.
•Second, practitioners still do not know a risk
threshold for each metric (e.g., how large is a file
size that would be risky? and how small is a file
size that would be non-risky?).
A lack of these types of guidance and its risk thresh-
old could lead to inefficient and ineffective SQA plan-
ning processes. Such ineffective SQA planning processes
could result in the recurrence of software defects, slow
project progress, high costs of development, unsatisfac-
tory software products, and unhappy end-users. To the
best of our knowledge, the aforementioned challenges
are very significant to the practical applications of defect
prediction models, but still remain largely unexplored.
2.3 A Motivating Scenario for our SQAPlanner
To address the aforementioned challenges, we propose
an AI-driven SQAPlanner—i.e., an approach for gener-
ating four types of guidance and its risk threshold in
the form of rule-based explanation for the predictions of
defect models. Below, we discuss a motivating scenario
of how our AI-Driven SQAPlanner could be used in a
software development process to assist SQA planning.
Without our SQAPlanner. Consider Bob who is a QA
manager joining a new software development project.
His main responsibility is to apply SQA activities (e.g.,
code review and testing) to find defects and develop
quality improvement plans to prevent them in the next
iteration. However, he has little knowledge of the soft-
ware projects. Therefore, he decides to deploy a defect
prediction model to guide his QA team about where is
the risky areas of source code so his team can effectively
allocate the limited effort on this risky area. However,
Bob still encounters various SQA planning problems
5
during the planning steps to prevent software defects in
the next iteration. In particular, without AI-driven SQA
planning tools, he can’t understand what are the risky
practices and what are the non-risky practices for this
team and this project, what are key actions to avoid that
increase the risk of having defects, and what are the key
actions to do to decrease the risk of having defects. A
lack of AI-driven SQA planning tools could lead to a failure
to develop the most effective SQA plans. Ultimately, this
results in the recurrence of software defects, slow project
progress, and high costs of software development, unsat-
isfactory software products, and unhappy end-users.
With our SQAPlanner. Now consider that Bob adopts
our AI-driven SQAPlanner tool. In particular, given a
file that is predicted as defective by defect prediction
models, our SQAPlanner can further generate rule-based
explanations to better understand what are key risky
practices, non-risky practices, actions to avoid that in-
crease the risk of defects, and actions to do to decrease
the risk of having defects for that file. Bob can use our
SQAPlanner to make data-informed decisions when devel-
oping SQA plans. This could result in more optimal SQA
plans, leading to higher quality of software systems,
less number of software defects, lower costs of software
development, satisfactory software products, and happy
end-users.
2.4 The Design Rationale for the Four Types of Guid-
ance
First, we propose to generate the guidance in the form
of rule-based explanations, since our recent work [21]
found that decision trees/rules are the most preferred
representation of explanations by software practitioners
as they involve logic reasoning that they are familiar
with. Formally, a rule-based explanation (e) is an as-
sociation rule e={r=p⇒q}that describes the
association between p(a Boolean condition of feature
values (i.e., antecedent, left-hand-side, LHS)) and q(the
consequence (i.e., consequent, right-hand-side, RHS)) for
the decision value y=f0(x). In this paper, we use an
arrow ( associate
=====⇒) to describe the association between the
Boolean condition (p) of feature values for a file and the
predictions (q) towards a {DEFECT,CLEAN} class. Note
that an association in general doesn’t mean that there is
a causal relationship.
Second, motivated by the limitations of Microsoft’s
Code Defect AI tool (see Figure 2), we hypothesize
that the following four types of guidance (G) that are
presented in a form of rule-based explanations are bene-
ficial to guide practitioners when developing SQA plans.
Below, we present the definition, the motivation, and an
example of the four types of guidance.
G1: Risky current practices that lead the defect model
to predict a file as defective are needed to help
practitioners understand what current practices are
problematic. For example, an association rule of
{LOC >100}associate
=====⇒DEFECT indicates that a
Obj1: Investigating the
Perceptions of SQA
Planning & our Guidance
Obj3: Evaluating the
Visualization of our
SQAPlanner
Obj2: Evaluating our
SQAPlanner
Approach
RQ1, RQ2
RQ6, RQ7
RQ3, RQ4, RQ5
Qualitative Survey
Qualitative Survey
Empirical Evaluation
Aim: To help practitioners make data-informed SQA planning
Fig. 3: An overview of our study design and research
questions.
file with LOC greater than 100 is associated with
the predictions towards a defective class. Thus,
practitioners should consider decreasing the LOC
to less than 100, as this may likely decrease the risk
of having defects.
G2: Non-risky current practices that lead the defect
model to predict a file as clean are needed to
help practitioners understand what current prac-
tices contribute towards a low risk of having defects.
For example, an association rule of {Ownership >
0.8}associate
=====⇒CLEAN indicates that a file with an
ownership value greater than 0.8 is associated with
the predictions towards a clean class. Thus, prac-
titioners should consider maintaining or increasing
the ownership value to more than 0.8 to potentially
decrease the risk of having defects.
G3: Potential practices to avoid to not increase the risk
of having defects are needed to help practition-
ers understand which currently not implemented
practices to avoid to not increase the risk of hav-
ing defects. For example, an association rule of
{MinorDeveloper >0}associate
=====⇒DEFECT indicates
that a file with a number of minor developers of
greater than 0 is associated with the predictions
towards a defective class. Thus, practitioners should
avoid increasing the number of minor developers to
greater than zero to not increase the risk of having
defects.
G4: Potential practices to follow to decrease the risk
of having defects are needed to help practitioners
understand which practices to newly implement to
decrease the risk of having defects. For example,
an association rule of {RatioCommentToCode >
0.6}associate
=====⇒CLEAN indicates that a file with
a proportion of comments to code that is larger
than 60% is associated with the predictions towards
the clean class. Thus, practitioners should consider
increasing the proportion of comments to code to
greater than 60% to decrease the risk of having
defects.
6
3 ST UDY DESIGN AND RESEARCH QU ES -
TIONS
In this paper, we aim to help practitioners make data-
informed SQA planning by providing guidance on (1)
what practitioners should do to decrease the risk of
having defects and (2) what practitioners should avoid in
order not to increase the risk of having defects with (3) a
risk threshold in the form of rule-based explanations for
the predictions of defect prediction models. To achieve
this aim, we design our case study according to the
following objectives (see Figure 3):
Objective 1—Investigating the practitioners’ percep-
tions of SQA planning and the proposed four types
of guidance. SQA planning activities are important in
software development processes (e.g., to define initial
software development policies), but often vary from
organization to organization [11]. However, there exist
no empirical studies that investigate how practitioners
perceive the importance of SQA planning activities in
their organization and what are their key challenges.
Thus, we formulate the following research question:
•(RQ1) How do practitioners perceive SQA plan-
ning activities?
One of the most important SQA planning activities is
to define development policies and their associated risk
thresholds [12]. Such development policies will be later
enforced for the whole team to ensure the highest quality
of software systems (e.g., the maximum file size, the
maximum code complexity, the minimum code to com-
ment ratio, and the minimum degree of code ownership).
Such policies are essential to improve software quality
and software maintainability. Recently, Microsoft’s Code
Defect AI tool has been released to the public where the
crux of this tool is defect prediction models. However,
Figure 2 shows that such tool only indicates the im-
portance scores of features that are generated by LIME,
which are still far from actionable. That means LIME
only indicates what factors are the most important to
support the predictions towards defective (G1) and clean
(G2) classes, but do not actually guide developers what
should they avoid (G3) and should do (G4) to decrease
the risk of having defects. We hypothesize that our pro-
posed four types of guidance that are presented in a form
of rule-based explanation would be more actionable to
guide practitioners when developing SQA plans. Thus,
we formulate the following research question:
•(RQ2) How do practitioners perceive our proposed
four types of guidance to support SQA planning?
Objective 2—Developing and Evaluating our AI-
Driven SQAPlanner Approach. To address the practi-
tioners’ challenges of SQA planning and the limitations
of Microsoft’s Code Defect AI tool, we propose SQA-
Planner to help practitioners make data-informed deci-
sions when developing SQA plans. First, SQAPlanner
develops a defect prediction model to generate a predic-
tion. Then, SQAPlanner generates a rule-based explana-
tion of the prediction to provide actionable guidance.
However, there are different local rule-based model-
agnostic techniques for generating explanations in the
eXplainable AI (XAI) domain available (e.g., Anchor [38]
and LORE [15]). Thus, it remains unclear whether our
SQAPlanner outperforms the state-of-the-art rule-based
model-agnostic techniques. Therefore, we conduct an
empirical study to evaluate our approach and compare
with the baseline techniques. Thus, we formulate the
following research questions.
•(RQ3) How effective are the rule-based expla-
nations generated by our SQAPlanner approach
when compared to the state-of-the-art approaches?
•(RQ4) How stable are the rule-based explanations
generated by our SQAPlanner approach when they
are regenerated?
•(RQ5) How applicable are the rule-based expla-
nations generated by our SQAPlanner approach
to minimize the risk of having defects in the
subsequent releases?
Objective 3—Developing the Visualization of SQA-
Planner and Investigating the Practitioners’ Percep-
tions. While the rule-based explanations of our SQA-
Planner are designed to help practitioners understand
the logic behind the predictions of defect models, such
rule-based explanations may not be immediately ac-
tionable and easily understandable by practitioners.
Thus, we develop a proof-of-concept by translating the
rule-based explanations of the actionable guidance into
human-understandable explanations. The visualization
of our SQAPlanner is designed to provide the following
key information: (1) the list of guidance that practitioners
should follow and should avoid; (2) the actual metric
values of that file; and (3) the risk threshold and range
values for practitioners to follow to mitigate the risk
of having defects. Then, we conduct a post-validation
qualitative survey with practitioners to evaluate their
perceptions of the visualization of our SQAPlanner when
comparing to the existing visualization of Microsoft’s
Code Defect AI (see Figure 2). Thus, we formulate the
following research questions:
•(RQ6) How do practitioners perceive the visual-
ization of SQAPlanner when comparing to the
visualization of the state-of-the-art?
•(RQ7) How do practitioners perceive the actual
guidance generated by our SQAPlanner?
4 PRACTITIONERS’ PERCEPTIONS ON SQA
PLA NN ING AND THE FOUR TYPE S OF GUID-
AN CE
In this section, we aim to investigate the practitioners’
perceptions of (1) the SQA planning activities (RQ1) and
(2) the proposed four types of guidance to support SQA
planning (RQ2). Below, we describe the approach and
present the results.
7
TABLE 1: (RQ1 and RQ2) A summary of the agreement percentage, the disagreement percentage, and the agreement
factor for the practitioners’ perception of SQA planning activities and our proposed four types of guidance.
Dimension Statement %Agreement %Disagreement Agreement Factor
(RQ1) Perceived importance
SQA planning activities
86% 6% 14.33
(RQ1) Being used in practice 70% 10% 7.00
(RQ1) Perceived time-consuming 66% 10% 6.60
(RQ1) Perceived difficulty 58% 24% 2.42
(RQ2) Perceived usefulness
G1: Risky current practices that lead
the defect model to predict a file as defective 82% 6% 13.67
G2: Non-risky current practices that lead
the defect model to predict a file as clean 64% 10% 6.40
G3: Potential practices to avoid to
not increase the risk of having defects 52% 20% 2.60
G4: Potential practices to follow to decrease
the risk of having defects 80% 8% 10.00
(RQ2) Perceived importance
G1: Risky current practices that lead
the defect model to predict a file as defective 64% 10% 6.40
G2: Non-risky current practices that lead
the defect model to predict a file as clean 60% 10% 6.00
G3: Potential practices to avoid to
not increase the risk of having defects 64% 24% 2.67
G4: Potential practices to follow to decrease
the risk of having defects 82% 6% 13.67
(RQ2) Willingness to adopt
G1: Risky current practices that lead
the defect model to predict a file as defective 74% 12% 6.17
G2: Non-risky current practices that lead
the defect model to predict a file as clean 66% 12% 5.50
G3: Potential practices to avoid to
not increase the risk of having defects 52% 22% 2.36
G4: Potential practices to follow to decrease
the risk of having defects 72% 12% 6.00
4.1 Approach
To investigate practitioners’ perceptions of SQA plan-
ning activities and their feedback on our proposed four
types of data-driven guidance to support such activities,
we conducted a survey study with 50 software practi-
tioners. As suggested by Kitchenham and Pfleeger [24],
we considered the following steps when conducting our
study: (1) design and develop a survey, (2) evaluate a
survey, (3) recruit and select participants, (4) verify data,
and (5) analyse data. We describe each step below.
(Step 1) Design and develop a survey. We first
devised the concept of data-driven software quality as-
surance (SQA) planning with respect to the 4 types of
rules generated by our approach. We then wanted to in-
vestigate practitioners’ perceptions along 4 dimensions,
i.e., perceived importance, being used in practice, would
it be time-consuming, and what are key difficulties. We
designed our survey as a cross-sectional study where
participants provide their responses at one fixed point in
time. The survey consists of 16 closed-ended questions
and 4 open-ended questions. For closed-ended ques-
tions, we use agreement and evaluation ordinal scales.
To mitigate any inconsistency of the interpretation of
numeric ordinal scales, we labeled each level of the
ordinal scales with words as suggested by Krosnick [26]
(e.g., strongly disagree, disagree, neutral, agree, and
strongly agree). The format of the survey is an online
questionnaire where we use an online questionnaire ser-
vice as provided by Google Forms. When accessing the
survey, each participant is provided with an explanatory
statement that describes the purpose of the study, why
the participant is chosen for this study, possible benefits
and risks, and confidentiality. The survey takes approx-
imately 15 minutes to complete and is anonymous.
(Step 2) Evaluate a survey. We carefully evaluated the
survey via pre-testing [28] to assess the reliability and
validity of the survey. We revised the evaluation process
to identify and fix potential problems (e.g., missing,
unnecessary, or ambiguous questions) until reaching a
consensus. Finally, the survey has been rigorously re-
viewed and approved by the Monash University Human
Research Ethics Committee (MUHREC ID: 22542).
(Step 3) Recruit and select participants. The target
population of the survey is software practitioners. To
reach the target population, we used a recruiting service
provided by the Amazon Mechanical Turk to recruit
50 participants as a representative subset of the tar-
get population. We use the participant filter options of
"Employment Industry - Software & IT Services" and
"Job Function - Information Technology" to ensure that
the recruited participants are valid samples representing
the target population. We pay 6.4 USD as a monetary
incentive for each participant [10, 40].
8
6%
10%
10%
24%
86%
70%
66%
58%
8%
20%
24%
18%
Difficult
Time−consuming
Being used in practice
Importance
100 50 0 50 100
Percentage
Response Strongly disagree Disagree Neutral Agree Strongly agree
SQA planning activities are:
Fig. 4: (RQ1) The likert scores of the practitioners’ per-
ceptions of SQA planning along four dimensions i.e.,
importance, being used in practice, time-consuming, and
difficulty.
(Step 4) Verify data. To verify our survey response
data, we manually read all of the open-question re-
sponses to check the completeness of the responses i.e.,
whether all questions were appropriately answered. We
excluded 11 responses that are missing and are not
related to the questions. In the end, we had a set of 989
responses. We summarized and presented the results of
closed-ended responses in a Likert scale with stacked
bar plots, while we discussed and provided examples of
open-ended responses.
(Step 5) Analyse data. We manually analysed the re-
sponses of the open-ended questions to extract in-depth
insights. For closed-ended questions, we summarise and
present key statistical results. We compute the agree-
ment and disagreement percentage of each closed-ended
question. The agreement percentage of a statement is the
percentage of respondents who strongly agree or agree
with a statement (% strongly agree + % agree), while the
disagreement percentage of a statement is the percentage
of respondents who strongly disagree or disagree with a
statement (% strongly disagree + % disagree). We also com-
pute an agreement factor of each statement as suggested
by Wan et al. [51]. The agreement factor is a measure of
agreement between respondents, which is calculated for
each statement using the following equation: (% strongly
agree + % agree)/(% strongly disagree + % disagree). High
values of agreement factors indicate a high agreement
of respondents to a statement. The agreement factor of 1
indicates that the numbers of respondents who agree and
disagree with a statement are equal. Finally, low values
of agreement factors indicate that a high disagreement
of respondents to a statement.
4.2 Respondent Demographics
The demographics of our 50 practitioner survey respon-
dents are as follows:
•Country of Birth: India (58%) and US (36%)
•Roles: developers (50%), managers (42%), and others
(8%)
•Years of Professional Experience: less than 5 years
(26%), 6–10 years (38%), 11–15 years (22%), 16–20
years (12%), and more than 25 years (2%)
6%
8%
10%
20%
82%
80%
64%
52%
12%
12%
26%
28%
G4: Potential practices to follow to
decrease the risk of having defects
G3: Potential practices to avoid to
not increase the risk of having defects
G2: Non−risky current practices that lead
the defect model to predict a file as clean
G1: Risky current practices that lead
the defect model to predict a file as defective
100 50 0 50 100
Percentage
Response
Not at all useful Not useful
Neutral Useful
Extremely useful
(a) Perceived usefulness
6%
10%
24%
10%
82%
64%
64%
60%
12%
26%
12%
30%
G4: Potential practices to follow to
decrease the risk of having defects
G3: Potential practices to avoid to
not increase the risk of having defects
G2: Non−risky current practices that lead
the defect model to predict a file as clean
G1: Risky current practices that lead
the defect model to predict a file as defective
100 50 0 50 100
Percentage
Response
Not at all important Not important
Neutral Important
Extremely important
(b) Perceived importance
12%
12%
12%
22%
74%
72%
66%
52%
14%
16%
22%
26%
G4: Potential practices to follow to
decrease the risk of having defects
G3: Potential practices to avoid to
not increase the risk of having defects
G2: Non−risky current practices that lead
the defect model to predict a file as clean
G1: Risky current practices that lead
the defect model to predict a file as defective
100 50 0 50 100
Percentage
Response
Not at all considered Not considered
Neutral Considered
Extremely considered
(c) Willingness to adopt
Fig. 5: (RQ2) The likert scores of the perceived useful-
ness, the perceive importance, and the willingness to
adopt of the respondents for each proposed guidance.
•Programming Language: Java (44%), Python (30%),
C/C++/C# (28%), and JavaScript (12%)
•Use of Static Analysis Tools: Yes (62%) and No (38%)
These demographics indicate that the responses are
collected from practitioners who reside in various coun-
tries, have a range of roles, varied years of experience,
and varied programming language expertise. This indi-
cates that our findings are likely not bound to specific
characteristics of practitioners.
4.3 Results
(RQ1) How do practitioners perceive SQA planning
activities?
Results.For SQA planning activities, 86% of the
respondents perceive as important and 70% perceived
9
as being used in practice. However, 66% perceived
as time-consuming and 58% perceived as difficult.
Figure 4 shows the distributions of likert scores of the
practitioners’ perceptions of SQA planning activities.
The survey results show that SQA planning activities are
perceived as important by 86% of the respondents, and
are being used in practice by 70% of the respondents.
However, they are perceived as time-consuming by 66%
of the respondents and as difficult to do by 58% of the re-
spondents. Table 1 also shows that the agreement factor
of all studied dimensions of SQA planning activities are
of above 1 with the values of 2.42 - 14.33. This indicates
that most respondents agree (while having very few
respondents who disagree) that SQA planning activities
are important, being used in practice, time-consuming,
and difficult.
Respondents described that some of the SQA planning
activities in their organisations involve human heuris-
tics in decision-making. For example, they used docu-
mentation and review checklists [7] (e.g., R34: “Lessons
learnt from projects are documented and common mistakes
are included in review checklists to ensure that they are not
repeated.”), and team meetings (e.g., R10: “team meetings,
brainstorm, and in house system”, and R48: “... through
step by step manual processes working together in a core
team”). These findings indicates that a data-informed
SQA planning tool is needed to support QA teams make
better data-informed decision- and policy-making.
(RQ2) How do practitioners perceive our proposed
four types of guidance to support SQA planning?
Results.Both (G1) the guidance on risky practices that
lead a model to predict a file as defective and (G4)
the guidance on the practices to follow to decrease
the risk of having defects are perceived as among the
most useful, most important, and most considered will-
ingness to adopt by the respondents. Figure 4 shows
the likert scores of the practitioners’ perceptions of SQA
planning along four dimensions i.e., importance, being
used in practice, time-consuming, and difficulty. The sur-
vey results show that all types of guidance are perceived
as useful by 52%-80% of the respondents, important by
60%-82% of the respondents, and considered willing to
adopt by 52%-72% of the respondents. Similar to RQ1,
we observed that the values of agreement factor for all
of the proposed guidance are higher than 1 for all of the
studied dimensions. This suggests that most respondents
agree (while having a very few of those who disagree)
that all proposed guidance are useful, important, and
willing to adopt these four types of guidance.
Respondents provided positive feedback of our pro-
posed four types of guidance since these types of guid-
ance can help with SQA planning (e.g., R37: “It allows the
QA team who might not necessarily know the changes that
have gone into each program to focus their energy on the most
risky components, programs, or functionalities. It also gives
managers a great view of the risks involved and how it could
potentially be reduced or mitigated.”).
However, some respondents raise critical concerns
related to the potential negative impact on the devel-
opment process made by these four types of guidance.
For example, cost of implementation and internal re-
sistance (e.g., R27: “Some extra time spent improving the
process. Needing to implement the process including training.
Employee resistance to adoption.”), and lax development
practice (e.g., R30: “Sometimes we get too reliant on the
automated processes and other things slip through ...”).
5 OU R AI-DRIVEN SQAPLA NNER APPROAC H
Our SQAPlanner consisted of two major phases: (1)
developing defect prediction models; and (2) generating
four types of guidance using a local rule-based model-
agnostic technique to explain the predictions of defect
models. Figure 6 presents an overview workflow of our
SQAPlanner approach.
5.1 Phase 1: Developing Defect Prediction Models
There is a plethora of classification techniques that have
been used to develop defect prediction models [13,
17, 46]. We first select the following five classification
techniques, i.e., Decision Trees (DT), Logistic Regression
(LR), multi-layer Neural Network (NN), Random Forest
(RF), and Support Vector Machine (SVM). These classi-
fication techniques are popularly-used in defect predic-
tion studies. Since the performance of defect prediction
models may vary depending on the studied datasets,
we first conduct a preliminary analysis to identify the
most accurate classification techniques for our study. We
use the implementation of the selected five classification
techniques provided by the scikit-learn Python package.
For each training dataset, we build defect prediction
models using all of the 65 software metrics (see Table 3
and Table 4). To ensure that our experiment is strictly-
controlled and fair across the studied classification tech-
niques, we use the default setting of the classification
techniques provided by the scikit-learn Python package,
do not apply feature selection techniques, and do not ap-
ply class rebalancing techniques. This setting will ensure
that the results are not bound to (i.e., not sensitive to)
the randomization of the non-deterministic optimization
algorithms [48], feature selection algorithms [22], and
class rebalancing algorithms [43]. Then, we evaluate
the performance of each classification technique using
testing datasets. Then, we measure the predictive ability
of defect models using an Area Under the Receiver
Operating Characteristic Curve (AUROC or AUC). AUC
measures the ability to distinguish defective and clean
files. The values of AUC range from 0 to 1. The AUC
value of 0 is considered the worst performance, the AUC
value of 0.5 is considered as merely random guessing,
and the AUC value of 1 is considered the best perfor-
mance [18].
Then, we use the Non-Parametric Scott-Knott ESD
test (Version 3.0) to find the classification techniques
that perform best across our studied datasets. We chose
10
If {DEV>10} then {BUG}
If {DEV>10} then {BUG}
Defect Models
Select
instances from
neighbourhood
Generate
instances from
neighbourhood
Generate
predictions
from global
defect models
Association
Rule
Mining
Algorithm
ID,similarity,class
ID,similarity,class
ID,predict
T
F
T
T
F
F
Phase 1: Developing Defect Models Phase 2: Generating Rule-based Explanations
Develop
Defect
Models
Training
Dataset
Generate
predictions
Predictions
SQAPlanner
Testing
Dataset
+
Fig. 6: An overview diagram of our SQAPlanner to
generate four types of guidance in the form of rule-based
explanations for each file.
the Non-Parametric Scott-Knott ESD test, since it does
not produce overlapping groups like other post-hoc
tests (e.g., Nemenyi’s test) [47] and it does not require
the assumptions of normal distributions, homogeneous
distributions, and the minimum sample size. The Non-
Parametric ScottKnott ESD test is a multiple compar-
ison approach that leverages a hierarchical clustering
to partition the set of median values of techniques
(e.g., medians of variable importance scores, medians
of model performance) into statistically distinct groups
with non-negligible difference. The mechanism of the
Non-Parametric Scott-Knott ESD test consists of 2 steps:
(Step 1) Find a partition that maximizes the median
of each distribution between groups using the non-
parametric Kruskal-Wallis test with Chi-square statistics.
(Step 2) Split the distributions into two groups or merg-
ing into one group using the non-parametric Cliff |δ|
effect size. The implementation of the Non-Parametric
ScottKnott ESD test is available in the ScottKnott ESD R
package (Version 3.0).4
Random Forest is the most accurate studied classifi-
cation technique with a median AUC value of 0.77.
Figure 7 presents the Scott-Knott ESD ranking of the
studied classification techniques with the distribution
of the AUC values. We find that other classification
techniques achieve a median AUC value of 0.74, 0.63,
0.65, and 0.59 for SVM, DT, NN, LR, respectively. Finally,
the ScottKnottESD test confirms that random forests
statistically outperforms other classification techniques.
For the rest of the paper, we focus on the random forest
models due to the following reasons:
•Random Forest is one of the most accurate studied
classification techniques for our case study and is
less sensitive to parameter settings [46, 48];
•Random Forest is a classification technique that is
to a certain degree explainable with its own built-in
4. http://github.com/klainfo/ScottKnottESD
●●
●
●
●
●
Rank 1
Rank 2
Rank 3
RF SVM NN DT LR
0.2
0.4
0.6
0.8
Classification Techniques
AUC
Fig. 7: The Non-Parametric Scott-Knott ESD ranking of
the studied classification techniques with the distribu-
tion of the AUC values.
feature importance techniques (e.g., gini importance
and permutation importance) [5, 21–23]. Since SVM
does not have its own built-in feature importance
techniques, we excluded SVM from our analysis;
and
•Random Forest is a classification technique that is
robust to overfitting [46], outliers [43], and class
mislabelling [45].
5.2 Phase 2: Generating Four Types of Guidance
Using a Local Rule-based Model-agnostic Technique
There are 5 major steps for generating four types of guid-
ance using a local rule-based model-agnostic technique.
First, for each instance to be explained (iexplain), we
select the nearest instances surrounding such an instance
to be explained from the training set (Inearest), cf. Line
1. Second, we generate synthetic instances (Isynthetic)
around the neighbourhood of each instance to be ex-
plained, cf. Line 2. Then, we create a set of combined
instances as Icombined =Inearest ∪Isynthetic,cf. Line 3,
which is a combination of the nearest instances and
the synthetic instances. Third, we use the global defect
prediction models to generate the predictions of the
combined instances (i.e., PIcombined ), cf. Line 4. Fourth, to
learn the associations between the synthetic features and
the predictions of the global defect prediction models,
we use the Magnum Opus association rule learning
algorithm [53] to generate a set of optimal association
rules that are the most predictive (i.e., rules with the
highest confidence) and the most interesting (i.e., rules
with the highest lift) from the combined instances and
their predictions, cf. Line 5. Finally, we classify the set of
association rules into four types of rule-based guidance
with respect to a contingency table of such association
rules and identify the best rule for each type of guidance,
cf. Line 6. Below, we explain each major step in details.
Phase 2-1: Select the nearest instances surrounding an
instance to be explained
We assume that instances from the neighbourhood of the
instance to be explained have approximately equivalent
11
Ie
Euclidean !
distances
TrX
exp onen tia l()
Instances!
around the !
neighbourhood
(sorted by sim. scores)
ID,score
Select the top-N instances!
of each class
ID,score,class
Use the similarity score of !
! as a threshold!
to select the minimum number !
of the highest similar instances
min(simT
Nth,simF
Nth)
Selected instances!
around the !
neighbourhood
ID,score,class
T
T
T
F
F
F
T
F
T
ID,score,class
0.9
0.8
0.7
0.6
0.5
Fig. 8: An approach to select instances around the neigh-
bourhood.
characteristics to an instance to be explained. Figure 8
presents an overview of the steps to select the nearest
instances from the neighbourhood of the instance to be
explained. In particular, there are three steps as follows:
(Step 1) – Normalize feature values. Different features
may have different units and thus their range values may
vary greatly. For example, LOC (e.g., 100 lines of code)
and Ownership (e.g., an ownership score of 0.5). Thus,
we first apply a Z-score normalization to each feature in
defect datasets.
(Step 2) – Compute the similarity scores of instances
in training data. To do so, we first compute the Eu-
clidean distance between the instances in the training
data (T rx) and the instance to be explained (ie). Then,
we apply an exponential kernel function to convert
such Euclidean distances into similarity scores. using an
exponential kernel function to make the distance more
linearly distributed.
(Step 3) – Select the smallest number of the most
similar instances using the top-N instances of each
class. To do so, we first sort the similarity scores of
instances (sim) in descending order for each class. Then,
we select the top Ninstances of each class from the
sorted similarity scores. The lowest similarity score of the
top Ninstances of each class (i.e., Min(simTrue
Nth , simFalse
Nth ))
is used as a threshold to select the minimum number
of the most similar instances. Such the lowest similarity
score among the top Ninstances of both classes is
used to determine the boundary of the neighbourhood.
For example, given an example of N= 10, the lowest
similarity scores of the top-10 instances with the highest
similarity scores of DEFECT and CLEAN classes are
0.8 and 0.9, respectively. Therefore, in this example,
the similarity score of 0.8 (the 10th instance from class
DEFECT) is used to determine the boundary of the
neighbourhood. The selected instances are instances that
have the similarity scores of above 0.8 (i.e., sim ≥0.8).
Phase 2-2: Generate synthetic instances to expand the
neighbourhood
The number of selected nearest instances in the neigh-
bourhood may not be enough to accurately learn the
behaviour of the instance to be explained. Thus, we
generate synthetic instances to expand the neighbour-
hood. To do so, we use the crossover (or interpolation)
technique and the mutation technique to generate new
Algorithm 1: A Local Rule-based Model Inter-
pretability with k-optimal Associations
Input : T rx−training instances without target (class label)
T ry−target (class label) of training instances
iexplain −an instance need to be explained
M−a global defect prediction model
Nfeatures −# of features
Nsynthetic −# of the new instances to be generated
Output: Giexplain −Four types of rule-based guidance for the
instance to explain iexplain
1Inearest ←SelectFromNeighbourhood(T rpx, iexplain)
2Isynthetic ←GenerateFromNeighbourhood(Iselected,
Nfeatures, Nsynthetic , iexplain)
3Icombined ←Inearest ∪Isynthetic
4PIcombined ←GetPredictFromGlobalModel(Icombined, M)
5Riexplain ←GenerateMagnumOpusRules(Icombined,
PIcombined )
6Giexplain ←GenerateRuleGuidance(Riexplain ,
iexplain, Piexplain )
7return Giexplain
synthetic instances while ensuring that the majority of
such synthetic instances are within the neighbourhood
of the instance to be explained. Below, we describe how
we generate synthetic instances using the crossover and
the mutation techniques in details.
Generate synthetic instances using the crossover
technique. To do so, we randomly select two different
instances from the neighbourhood of the instance to
be explained. Then, we generate the synthetic instances
based on the crossover technique using the following
equation:
Icrossover =x+ (y−x)∗α(1)
where xand yare random parent instances from the
training set, and αis a randomly generated number
between 0and 1.
Generate synthetic instances using the mutation
techniques. To do so, we randomly select three different
instances from the neighbourhood of the instance to be
explained. Then, we generate synthetic instances based
on the mutation technique [42] using the following equa-
tion:
Imutation =x+ (y−z)∗µ(2)
where x, y and zare random parent instances from the
training set, and µis a randomly generated number
between 0.5and 1.
Phase 2-3: Generate the predictions of the nearest in-
stances and the synthetic instances from defect prediction
models
Firstly, we name a set of such the nearest instances
(generated in Phase 2-1) and the synthetic instances
(generated in Phase 2-2) as the combined instances
Icombined, where Icombined =Inearest ∪Isynthetic Then, we
generate the predictions of such combined instances in
the neighbourhood (i.e., PredictionInearest∪Isynthetic ) from
defect prediction models to learn the behaviour and the
logics of such defect prediction models.
12
Phase 2-4: Generate association rules using Magnum
Opus association rule mining
The Magnum OPUS association rule mining algorithm
performs statistically sound association rule mining
by combining k-optimal association discovery tech-
niques [54] and the OPUS search algorithm [53] to find
the kmost interesting associations according to a defined
criterion (e.g., lift, confidence, coverage). The effective-
ness of our SQAPlanner relies on this algorithm to gen-
erate the rule-based explanations. With the functionality
of the OPUS search algorithm, it will effectively prune
the search space by discarding the associations which
are likely to be spurious, and removing false positives
by performing Fisher’s exact hypothesis test. We use an
implementation of the k-optimal association rule mining
technique as provided by the BigML platform.5
Phase 2-5: Generate four types of rule-based guidance
Finally, we classify the optimal set of association rules
that are identified by Magnum OPUS into four categories
with respect to a contingency table of the LHS and RHS
of the association rules. Then, we identify the best rule
that is the most predictive and the most interesting for
each type of guidance as the output of SQAPlanner.
To better illustrate how we classify the output rules
generated by Magnum OPUS, we use four examples
of an association rule as a subject of this explanation.
Given an instance to explain iexplain that has 200 lines
of code (LOC = 200) and is predicted as DEFECT
by the global defect prediction model, our SQAPlanner
framework generates the following four types of rule-
based explanations:
G1: Risky current practices that lead the defect model
to predict a file as defective.
Technical Name. Supporting Rules (<+).
Definition. if LHS = true, then RHS = true.
Example. {LOC >150}associate
=====⇒DEFECT
Interpretation. This example is a supporting rule,
since (1) the antecedent (LHS) of the rule hold true
as the actual LOC of iexplain (i.e., 200) is actually
higher than 150, and (2) the consequent (RHS) of the
rule hold true as the prediction of iexplain generated
by the global defect prediction model is DEFECT.
G2: Non-risky current practices that lead the defect
model to predict a file as clean.
Technical Name. Contradicting Rules (<−).
Definition. if LHS = true, then RHS = false.
Example. {LOC <500}associate
=====⇒CLEAN
Interpretation. This example is a contradicting rule,
since (1) the antecedent (LHS) of the rule hold true
as the actual LOC of iexplain (i.e., 200) is actually
lower than 500, yet (2) the consequent (RHS) of the
rule does not hold true as the prediction of iexplain
generated by the global defect prediction model is
DEFECT.
5. https://bigml.com/
G3: Potential practices to avoid to not increase the risk
of having defects.
Technical Name. Hypothetical Supporting Rules
(<H+).
Definition. if LHS = false, then RHS = true.
Example. {LOC >300}associate
=====⇒DEFECT
Interpretation. This example is a hypothetical sup-
porting rule, since (1) the antecedent (LHS) of the
rule does not hold true as the actual LOC of iexplain
(i.e., 200) is not higher than 300, yet (2) the conse-
quent (RHS) of the rule hold true as the prediction
of iexplain generated by the global defect prediction
model is DEFECT.
G4: Potential practices to follow to decrease the risk
of having defects.
Technical Name. Hypothetical Contradicting Rules or
Counterfactual Rules (<H−).
Definition. if LHS = false, then RHS = false.
Example. {LOC <100}associate
=====⇒CLEAN
Interpretation. This example is a hypothetical con-
tradicting rule, since (1) the antecedent (LHS) of
the rule does not hold true as the actual LOC of
iexplain (i.e., 200) is not lower than 100, and (2) the
consequent (RHS) of the rule does not hold true
as the prediction of iexplain generated by the global
defect prediction model is DEFECT.
6 EXPERIMENTAL DESIGN AND RESULTS
In this section, we aim to investigate (RQ3) the effec-
tiveness, (RQ4) the stability, and (RQ5) the applicability
of the rule-based explanations generated by our SQA-
Planner. Below, we describe the studied projects, the
experimental design, and present the results.
6.1 Studied Projects
To select some suitable projects, we identified three
important criteria that need to be satisfied:
•Criterion 1 — Publicly-available defect datasets:
To support verifiability and foster replicability of
our study, we choose to train our defect prediction
models using publicly available defect datasets.
•Criterion 2 — Multiple releases: The central hy-
pothesis of our approach is that the guidance that is
derived from past knowledge (a release k−1) can be
used to explain the predictions of defective files in
the target releases (a release k) and be applicable to
prevent software defects in future releases (a release
k+ 1). Thus, we need multiple releases for each
software project to validate our hypothesis.
•Criterion 3 — Labels of defective files are based on
actual affected releases: Prior work raises concerns
that the approximation of the post-release window
periods (e.g., 6 months) that are popularly-used in
many defect datasets may introduce bias to the con-
struct to the validity of our results [56]. Instead of
relying on traditional post-release window periods,
we choose to use defect datasets that are labeled
13
TABLE 2: A statistical summary of the studied systems.
Name Description #DefectReports No. of files Defective Rate KLOC Studied Releases
ActiveMQ Messaging and Integration Patterns server 3,157 1,884-3,420 6%-15% 142-299 5.0.0, 5.1.0, 5.2.0, 5.3.0, 5.8.0
Camel Enterprise Integration Framework 2,312 1,515-8,846 2%-18% 75-383 1.4.0, 2.9.0, 2.10.0, 2.11.0
Derby Relational Database 3,731 1,963-2,705 14%-33% 412-533 10.2.1.6, 10.3.1.4, 10.5.1.1
Groovy Java-syntax-compatible OOP for JAVA 3,943 757-884 3%-8% 74-90 1.5.7, 1.6.0.Beta_1, 1.6.0.Beta_2
HBase Distributed Scalable Data Store 5,360 1,059-1,834 20%-26% 246-534 0.94.0, 0.95.0, 0.95.2
Hive Data Warehouse System for Hadoop 3,306 1,416-2,662 8%-19% 287-563 0.9.0, 0.10.0, 0.12.0
JRuby Ruby Programming Lang for JVM 5,475 731-1,614 5%-18% 105-238 1.1, 1.4, 1.5, 1.7
Lucene Text Search Engine Library 2,316 8,05-2,806 3%-24% 101-342 2.3.0, 2.9.0, 3.0.0, 3.1.0
Wicket Web Application Framework 3,327 1,672-2,578 4%-7% 109-165 1.3.0.beta1, 1.3.0.beta2, 1.5.3
TABLE 3: A summary of the studied code metrics.
Granularity Metrics Count
File AvgCyclomatic, AvgCyclomaticModified, AvgCyclomaticStrict, AvgEssential, AvgLine, AvgLineBlank, AvgLineCode,
AvgLineComment, CountDeclClass, CountDeclClassMethod, CountDeclClassVariable, CountDeclFunction, CountDe-
clInstanceMethod, CountDeclInstanceVariable, CountDeclMethod, CountDeclMethodDefault, CountDeclMethodPrivate,
CountDeclMethodProtected, CountDeclMethodPublic, CountLine, CountLineBlank, CountLineCode, CountLineCod-
eDecl, CountLineCodeExe, CountLineComment, CountSemicolon, CountStmt, CountStmtDecl, CountStmtExe, MaxCy-
clomatic, MaxCyclomaticModified, MaxCyclomaticStrict, RatioCommentToCode, SumCyclomatic, SumCyclomaticModi-
fied, SumCyclomaticStrict, SumEssential
37
Class CountClassBase, CountClassCoupled, CountClassDerived, MaxInheritanceTree, PercentLackOfCohesion 5
Method CountInput_{Min, Mean, Max}, CountOutput_{Min, Mean, Max}, CountPath_{Min, Mean, Max}, MaxNesting_{Min,
Mean, Max}
12
TABLE 4: A summary of the studied process and own-
ership metrics.
Metrics Description
Process Metrics
COMM The number of Git commits
ADDED_LINES The normalized number of lines added to the module
DEL_LINES The normalized number of lines deleted from the mod-
ule
ADEV The number of active developers
DDEV The number of distinct developers
Ownership Metrics
MINOR_COMMIT The number of unique developers who have contributed
less than 5% of the total code changes (i.e., Git commits)
on the module
MINOR_LINE The number of unique developers who have contributed
less than 5% of the total lines of code on the module
MAJOR_COMMIT The number of unique developers who have contributed
more than 5% of the total code changes (i.e., Git commits)
on the module
MAJOR_LINE The number of unique developers who have contributed
more than 5% of the total lines of code on the module
OWN_COMMIT The proportion of code changes (i.e., Git commits) made
by the developer who has the highest contribution of
code changes on the module
OWN_LINE The proportion of lines of code written by the developer
who has the highest contribution of lines of code on the
module
based affected releases, as suggested by recent stud-
ies [8, 56].
Thus, we finally selected a corpus of publicly available
defect datasets provided by Yatish et al. [56] where the
ground-truths are labeled based on the affected releases.
These datasets consist of 32 releases that span 9 open-
source, real-world, non-trivial software systems. Table 2
shows a statistical summary of the studied datasets. Each
dataset has 65 software metrics along 3 dimensions, i.e.,
54 code metrics, 5 process metrics, and 6 human metrics.
Table 3 shows a summary of the static code metrics,
while Table 4 shows a summary of the process and
human metrics. The full details of the data collection
process are available at Yatish et al. [56].
6.2 Experimental Design
We hypothesize that the guidance that is derived from
past knowledge (a release k−1) can be used to explain
the predictions of defective files in the target releases (a
release k) and be applicable to prevent software defects
in future releases (a release k+ 1). Thus, we evaluate
our approach (see Figure 9) using a set of three con-
secutive releases (k-1, k, and k+1) for training, testing,
and explanation evaluation, respectively. We first trained
our defect models using a random forest classification
technique on a training release (i.e., a release k−1). Then,
we generate rule-based explanations for each file in the
testing release (i.e., a release k). Finally, we evaluate
the applicability of the rule-based explanations with the
explanation evaluation release (i.e., a release k+ 1). Let’s
take an example of the ActiveMQ project, we first use
the release 5.0.0 for training, the release 5.1.0 for testing,
and the release 5.2.0 for explanation evaluation. We
repeat the experiment similarly for the other consecutive
releases (i.e., {5.1.0, 5.2.0, 5.3.0}, {5.2.0, 5.3.0, 5.8.0}) and
for other projects.
6.3 Results
(RQ3) How effective are the rule-based explanations
generated by our SQAPlanner approach when com-
pared to the state-of-the-art approaches?
Motivation. Our SQAPlanner is based on the assump-
tion that our rule-based explanations are generated
based on the approximation of the characteristics of files
that are similar to the file to be explained. This assump-
tion is similar to those of many local rule-based model-
agnostic techniques [15, 37, 38] that the behaviour of the
instance to be explained is similar to the behaviours of
the instances around its neighbourhood. According to
the definition of rule-based explanations in Section 2.4,
14
Defect Models
(RQ3) How effective is the
guidance generated by our
SQAPlanner?
(RQ4) How stable is the
guidance generated by our
SQAPlanner?
(RQ5) How applicable is
the guidance generated by
our SQAPlanner?
Testing
Dataset
(Release k)
Training
Dataset
(Release k-1)
Explanation
Evaluation
Dataset
(Release k+1)
Generate
Explanations
SQAPlanner Rules
Fig. 9: An evaluation framework of our SQAPlanner
approach
our SQAPlanner generated rule-based explanations will
be considered effective if such rule-based explanations
achieve a high coverage and high confidence value.
Approach. To address RQ3, we evaluate the rule-based
explanations generated by our SQAPlanner using the
traditional association rule evaluation measures (i.e.,
coverage, confidence, and lift).
Coverage measures support of the antecedent of an
association rule, i.e., the percentage of files that support
the rule conditions. Formally, Coverage(p→q) =
Support(p)where Support(p) is the proportion of files
that fulfill p.
Support(p) = |files ∈Dataset, such that files fulfill p|
#total files
For example, a rule-based explanation (G1) of {DEV >
10}associate
=====⇒DEFECT with a coverage value of 0.9
indicates that 90% of the files fulfill a risky practice of
having more than ten developers who touch a file. A
high coverage value of the G1 guidance indicates that
such a risky practice is a common risky practice to many
files of the dataset.
Confidence (i.e., Precision or Strength) measures the
percentage of files that fulfill the antecedent and conse-
quent together over the number of files that only fulfill
the antecedent, which can be defined as follows:
Confidence(p→q) = Support(p→q)/Support(p).
For example, a rule-based explanation (G1) of {DEV >
10}associate
=====⇒DEFECT with a confidence value of 0.8 in-
dicates that, there are 80% of the files that fulfill the risky
practice of having more than ten developers who touch
a file are actually defectives. A high confidence value
of the G1 guidance indicates that such risky practice is a
high confident risky practice to many files of the dataset.
Lift measures how many times more often the an-
tecedent and consequent occur together compared to
what would be expected when they (i.e., both antecedent
and consequent) were statistically independent, which
can be defined as follows:
Lift(p→q) = Support(p→q)
Support(p)×Support(q)
●
●●
●●
●
●
●●
Coverage
Confidence
Lift
Our Framework
LORE
Anchor
Our Framework
LORE
Anchor
Our Framework
LORE
Anchor
0
5
10
15
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
Techniques
Value
Fig. 10: (RQ3) The distribution of the evaluation mea-
sures of our rule-based explanations when compared to
baseline approaches (i.e., LORE and Anchor).
For example, a rule-based explanation (G1) of {DEV >
10}associate
=====⇒DEFECT with a life value of 5 indicates
that, the file will be 5 times (i.e., 500%) more likely to
be defective if the rule is fulfilled. A lift value greater
than one means that a file is likely to be defective if the
conditions are fulfilled, while a lift value less than one
means a file is unlikely to be defective if the conditions
are fulfilled. A high lift value of the G1 guidance in-
dicates that there is a high chance that a file is likely
to be defective if such risky practice is fulfilled. Thus,
practitioners should pay attention to guidance rules with
a high lift value.
Baseline comparison. We compare our SQAPlanner
with the two state-of-the-art local rule-based model-
agnostic techniques, i.e., Anchor [38] and LORE [15] [37].
Anchor, an extension of LIME [37], was proposed by
Ribeiro et al. [38]. The key idea of Anchor is to select if-
then rules – so-called anchors – that have high confidence,
in a way that features that are not included in the rules
do not affect the prediction outcome if their feature
values are changed. In particular, Anchor selects only
rules with a minimum confidence of 95%, and then
selects the rule with the highest coverage if multiple
rules have the same confidence value.
LORE is proposed by Guidotti et al. [15]. For each
instance to be explained, LORE generates files around
the neighbourhood using a genetic algorithm. LORE
then obtains predictions of the generated files from the
global defect models to learn the behaviour and the
logics of the defect models. Finally, a decision tree is
built on the defined neighbourhood of the instance to
be explained and is then later converted to rules.
Results. Figure 10 presents the results for coverage, con-
fidence, and lift of the local rule-based model-agnostic
techniques.
(Coverage) At the median, 89% of files are supported
15
by the rule-based explanations, suggesting that our
SQAPlanner outperforms the LORE and Anchor local
rule-based model-agnostic techniques. Figure 10 shows
that the median coverage is 89%, 34%, and 6% for
our SQAPlanner, LORE, and Anchor, respectively. We
suspect that the high coverage values that are achieved
by our SQAPlanner are due to the flexibility of the k-
optimal search that allows us to search particularly for
rules with high coverage. High coverage is important as
it is a measure for how representative a rule is for a given
dataset, so that our results suggest that our SQAPlanner
achieves the most representative rules.
(Confidence) At the median, 99% of files are sup-
ported by the antecedent and the consequent of
the rule-based explanations, which outperforms the
LORE and Anchor model-agnostic techniques. Fig-
ure 10 shows the distributions of the confidence for
our SQAPlanner, LORE, and Anchor, respectively. We
find that LORE and Anchor achieve high confidence
with median confidence of 95% and 98%, respectively.
We find that the comparable confidence values achieved
by LORE and Anchor have to do with the main opti-
mization goal of Anchor and LORE, since both LORE
and Anchor techniques aim to search for rules with
the highest confidence. Nevertheless, we find that our
SQAPlanner achieves the highest median confidence of
99%.
(Lift) The rule-based explanations generated by our
SQAPlanner achieve a median lift value of 6.6, which
outperforms the LORE and Anchor model-agnostic
techniques. Figure 10 shows that the median lift is 6.6,
5.2, 0.98 for our SQAPlanner, LORE and Anchor respec-
tively. The highest lift value of 6.6 indicates that files will
be 6.6 times (i.e., 660%) more likely to be defective if the
rule is matched. Similarly, the highest lift value of our
SQAPlanner can be attributed to the flexibility of the k-
optimal search that allows us to search particularly for
rules with the highest lift. On the other hand, Anchor
achieves a lower lift score, since Anchor constructs the
neighbourhood in a way that it contains only files of the
same class as the instance in consideration. Thus, the lift
scores for Anchor under these circumstances are equal
to the confidence values.
(RQ4) How stable are the rule-based explanations
generated by our SQAPlanner approach when they
are regenerated?
Motivation. Our SQAPlanner approach and the two
state-of-the-art local rule-based model-agnostic tech-
niques (i.e., LORE and Anchor) involve random data
generation when generating synthetic instances around
the neighbourhood. As such, the randomization bias
may produce different rule-based explanations when the
approaches are re-executed. Thus, we aim to investigate
the consistency of the rule-based explanation of the same
instance when these model-agnostic techniques are re-
executed.
●
●
0.4
0.6
0.8
1.0
Our Framework LORE Anchor
Model−Agnostic Techniques
Jaccard Coefficient
Fig. 11: (RQ4) The distribution of the Jaccard Coefficients
of the rule-based model-agnostic techniques.
Approach. To address RQ4, we repeat our experiment
ten times to investigate the stability of our rules. Since
the rules generated by the baseline comparison are
optimized based on confidence only, we focus on the
rules generated by our approach that are optimized for
confidence as well.
For each rule-based explanation of each file, we use the
Jaccard coefficient to measure the consistency of the gen-
erated rule-based explanations. The Jaccard coefficient
compares the common and the distinct features in two
given sets (e.g., Xand Y) using the following equation:
J(X, Y ) = |X∩Y|/|X∪Y|. The coefficient ranges from
0% to 100%. The higher the coefficient the higher the
similarity of rules over two independent runs.
Results.Our SQAPlanner approach produces the most
consistent rule-based explanations when compared to
LORE and Anchor. Figure 11 shows that our SQAPlan-
ner achieves a median Jaccard coefficient of 0.92, while
LORE and Anchor achieve a median Jaccard coefficient
of 0.42, and 0.79, respectively. In other words, for each
prediction of an instance to be explained, our rule-
based explanations are (at the median) 92% consistent
with the rule-based explanations when re-executing our
framework in multiple independent runs. In addition,
our SQAPlanner’s rule-based explanations are (at the
median) 13% and 50% more consistent than the rule-
based explanations generated by Anchor and LORE,
respectively. We suspect that the highest consistency
achieved by our approach is a result of the more ro-
bust nature of our framework when selecting similar
instances from the training data and when generating
synthetic instances around the neighbourhood (as de-
scribed in Sections 5.2 and 7). In contrast, Anchor uses
a bandit algorithm [25] to generate neighbours, while
LORE uses a genetic algorithm to generate neighbours.
(RQ5) How applicable are the rule-based explanations
generated by our SQAPlanner approach to minimize
the risk of having defects in the subsequent re-
leases?
Motivation. The central hypothesis of our approach
is that the rule-based explanations derived from past
knowledge (a release k−1) can be used to explain the
16
LOC
Actual
Predict
Counterfactual Rules
A.java
1,000
Bug
Bug
B.java
1,200
Bug
Bug
Defect Models
RH−
A.java :LOC < 900 ⇒Clean
Training
Past Release
(k-1)
Target Release
(k)
Validation Release
(k+1)
Testing
LOC
Actual
Predict
A.java
850
Clean
Clean
B.java
1,500
Bug
Bug
RH−
B.java :LOC < 1,100 ⇒Clean
Predictions
Our SQA Planner
Rules
RQ5-a measures the number of instances where the hypothetical contradicting rule follows the actual
feature values in the validation data when the prediction is changed (i.e., Bug in k but Clean in k+1)
e.g., is in accordance with the validation data (i.e., LOC=850)
RH−
A.java :LOC < 900 ⇒Clean
Compare
RQ5-b measures the number of instances where the hypothetical contradicting
rule does not follow the actual feature values in the validation data
when the prediction is not changed (i.e., Bug in k and Bug in k+1)
e.g., is in accordance with the validation data
RH−
B.java :LOC < 1,100 ⇒Clean
Fig. 12: (RQ5) An approach to evaluate the applicability of the hypothetical contradicting rules.
predictions of defective files in a target release (a release
k), and thus be applicable to guide SQA planning to
prevent software defects in future releases (release k+1).
We want to investigate what are the proportion of files
where the rule-based explanations are satisfied and not
satisfied with the actual feature values in the subsequent
release.
Approach. To address RQ5, we focus on the hypo-
thetical contradicting rules, which are rules that guide
what are the practices to follow to decrease the risk
of having defects (i.e, whether the prediction of the
same file could be reversed if the rule is followed in
a subsequent release). We note that not all of the files
whose hypothetical contradicting rules can be gener-
ated by Anchor and LORE, since we find that LORE
produces a maximum of 69% hypothetical contradicting
rules across projects (median amount of rules produced
per project is 41%), and Anchor by definition does not
generate any hypothetical contradicting rules. Since our
approach is the only one that can generate hypothetical
contradicting rules, we focus only on our SQAPlanner
approach. Figure 12 presents an approach to evaluate
the applicability of the hypothetical contradicting rules.
We analyze the applicability of the hypothetical contra-
dicting rules along 2 perspectives:
RQ5-a: Are hypothetical contradicting rules applied
when the prediction of an instance changes from defec-
tive in a testing release kto clean in a validation release
k+1?Hypothetical contradicting rules are considered as
applicable if such rules follow the actual feature values
in the validation data when the prediction of the instance
changes from defective in kto clean in k+1. For example,
A.java is predicted to be defective in the testing data (k)
but predicted to be clean in the validation data (k+1). We
consider that the generated hypothetical contradicting
rules (e.g., {LOC <900}associate
=====⇒CLEAN) is correct if
such rule is in accordance with the actual feature values
in the validation data (i.e., LOC = 850). Like in this
example, the hypothetical contradicting rule suggests
developers reduce the lines of code to less than 900 to
potentially reverse the decision of the defect models from
defective to clean, which is consistent with the validation
data (LOC = 850).
RQ5-b: Are hypothetical contradicting rules
not applied when the prediction of an instance
does not change from defective in a testing release
kto clean in a validation release k+ 1?Hypothetical
contradicting rules are considered applicable if such
rules do not follow the actual feature values in the
validation data when the prediction of the instance
does not change from defective in kto clean in k+ 1.
For example, we consider B.java to be predicted to
be defective in both testing data and validation data.
Thus, we consider that the generated hypothetical
contradicting rule (e.g., {LOC <1100}associate
=====⇒CLEAN)
is applicable if such rule does not follow the actual
feature values in the validation data (i.e., LOC = 1,500).
For each perspective, we compute the number of
instances where the hypothetical contradicting rule does
follow and does not follow the actual feature values in
the subsequent release in RQ5-a and RQ5-b, respectively.
Figure 13 presents the proportion of files where its
hypothetical contradicting rule does follow (RQ5-a) and
does not follow (RQ5-b) the actual feature values in the
subsequent release for each measure.
Results.For 55%-87% of the instances in the subse-
quent releases, our SQAPlanner’s hypothetical con-
tradicting rules are correctly applicable when the
prediction of rules changes from defective to clean.
Figure 13 shows that there are 87%, 82% and 55% of
the instances in the subsequent releases that our hy-
pothetical contradicting rules follow the actual feature
values in the validation data with respect to coverage,
confidence, and lift, respectively. This finding indicates
that our SQAPlanner’s hypothetical contradicting rules
learned from past knowledge (k−1) to explain the
predictions of instances from the target release (k) could
potentially reverse the predictions of the same instance
in the subsequent release (k+ 1) from having defects to
clean.
For 67%-81% of the instances in the subsequent
releases, our hypothetical contradicting rules are cor-
17
●
●
Coverage
Confidence
Lift
RQ5−a RQ5−b RQ5−a RQ5−b RQ5−a RQ5−b
0
20
40
60
80
100
Analysis
Percentage
Fig. 13: The percentage of of the instances in the sub-
sequent releases where our hypothetical contradicting
rules (RQ5-a) follow the actual feature values in the val-
idation data when the decision is changed and (RQ5-b)
do not follow the actual feature values in the validation
data when the decision is not changed for each measure.
rectly non-applicable when the prediction of rules does
not change. Figure 13 shows there are 67%, 81% and
71% of the instances in the subsequent releases that our
SQAPlanner’s hypothetical contradicting rules do not fol-
low the actual feature values in the validation data when
the prediction of rules does not change with respect
to coverage, confidence, and lift, respectively. In other
words, when files are still defective in the subsequent
release, our hypothetical contradicting rules are still
largely in agreement (i.e., our hypothetical contradicting
rules are correctly non-applicable).
6.4 Discussion & Qualitative Analysis
We conducted a qualitative analysis to illustrate the
effectiveness of our guidance generated by our SQAPlan-
ner. We selected the ErrorHandlerBuilderRef.java of the
release 2.9.0 of the Camel software system as the subject
of this qualitative analysis. Our SQAPlanner approach
correctly predicts this file as defective with a probability
score of 70%. Below, we discuss the implications of our
rule-based explanations to guide developers on what
they could follow and could avoid to decrease the risk
of having defects.
What are risky practices that lead a model to predict a file
as defective?
To answer this question, we use the supporting rule to
generate guidance (G1) for this file as follows:
<+={LOCDeclaration >28.150 &
DistinctDeveloper >1.68 &
Ownership <0.85}associate
=====⇒DEFECT
Implication. This supporting rule indicates that this
file is being predicted as defective since it is associated
with the conditions of having more than 28 lines of
declarative code, more than 1.68 distinct developers,
and a line-based ownership score of less than 85%.
When comparing this to the actual feature values of
the file {LOCDeclaration = 34,DistinctDeveloper = 3,
Ownership = 0.65}, we find that the conditions of this
supporting rule are consistent with the actual feature
values and the consequent is consistent with our SQA-
Planner’s prediction (i.e., defective).
What are the non-risky practices that lead a model to
predict a file as clean?
To answer this question, we use our contradicting rule
to generate guidance (G2) for this file as follows:
<−={0.440 <= RatioCommentToCode <= 0.960}
associate
=====⇒CLEAN
Implication. We find that our contradicting rule is
consistent with the actual feature values of the file to
be explained. The actual feature values of this file is
{RatioCommentToCode = 0.51}, meaning that 51% of
the total lines of code are comment lines (i.e., #com-
ments/#LOC). The contradicting rule (<−) indicates that
the condition that supports its prediction as not being
defective is {0.440 <= RatioCommentToCode <= 0.960},
indicating that files that have a RatioCommentToCode
of more than 44% but less than 96% are likely not
to be defective. Developers should thus adhere to the
contradicting rule i.e., having the comment ratio for
more than 44% of the file, to not increase the risk of
having defects.
What are practices to avoid to not increase the risk of
having defects?
To answer this question, we use our hypothetical sup-
porting rule to generate guidance (G3) for this file as
follows:
<H+={MinorCommit >0.000}associate
=====⇒DEFECT
Implication. Having more than zero minor developers
will increase the risk of having defects. The actual feature
value of this file is {MinorCommit = 0}, meaning that
this file has no minor developers (i.e., minor) who edit
or change the file. This finding is consistent with Bird et
al. [2] and Rahman [32] who found that minor devel-
opers often introduce defects. Thus, developers should
adhere to the hypothetical supporting rule in order not
to increase the risk of having defects.
What are the practices to follow to decrease the risk of
having defects?
To answer this question, we use our hypothetical con-
tradicting rule to generate guidance (G4) for this file as
follows:
18
<H−={LOCBlank <7.62 &
OutputMean <1.98}associate
=====⇒CLEAN
Implication. If developers changed the file to have less
than 8 blank lines and less than 2 output variables,
this could reverse the prediction of having defects to
being clean. The actual feature values of this file are
{LOCBlank = 19,OutputMean = 3.72}, meaning that
this file has 19 blank lines and an average of 3.7 output
variables (i.e., fan-out) of functions in a file. The hypo-
thetical contradicting rule indicates that if {LOCBlank <
7.62 & OutputMean <1.98} then the file is likely
to reverse the prediction of having defects to being
clean. Thus, our hypothetical contradicting rule provides
suggestions to the developers of what they should do to
decrease the risk of having defects. It should be noted
that our contradicting rule shows correlations that may
not necessarily be causal.
7 PRACTITIONERS’ PERCEPTIONS OF OUR
SQAPLA NN ER VISUALIZATION
In this section, we aim to investigate the practitioners’
perceptions of the visualization of SQAPlanner when
comparing to the visualization of the state-of-the-art
(RQ6) and the actual guidance generated by SQAPlanner
(RQ7). Below, we describe the approach and present the
results.
7.1 Approach
To address RQs 6 and 7, we developed a proof-of-
concept to visualize the actionable guidance generated
by our SQAPlanner. Traditionally, the importance scores
of Random Forests or LIME’s model-agnostic techniques
are commonly presented using a bar chart. However,
such bar charts can only indicate the importance scores,
without providing guidance on what to do and what not
to do.
To address this challenge, we propose to use a bullet
plot (see Figure 14). The visualization of our SQAPlanner
is designed to provide the following key information:
(1) the list of guidance that practitioners should follow
and should avoid; (2) the actual feature value of that file;
and (3) its threshold and range values for practitioners to
follow to mitigate the risk of having defects. The green
shades indicate the non-risky range values of features,
while the red shades indicate the risky range values
of features. The vertical bars indicate the actual values
of features for a given file. The green arrows provide
directions of how a feature should be changed (i.e.,
increase or decrease). The list of guidance is structured
into two parts: (1) what to do to decrease the risk of
being defective; and (2) what to avoid to not increase the
risk of being defective. For each guidance, we translate
a rule-based explanation into an actionable guidance. A
guidance is presented in the form of natural language
to ensure that it is actionable and understandable by
practitioners.
To translate the rule-based explanations into actual
guidance, we focus on only the ErrorHandlerBuilder-
Ref.java of the release 2.9.0 of the Camel software system.
We use the rule-based explanations from Section 6.4 as
a reference. Finally, we derive the following statements
according to the reference rule-based explanations in
Section 6.4:
•(S1) Decreasing the number of class and method
declaration lines to less than 29 lines to decrease
the risk of being defective.
•(S2) Decreasing the number of distinct developers to
less than 2 developers to decrease the risk of being
defective.
•(S3) Increasing the ownership code proportion to
more than 0.85 to decrease the risk of being defec-
tive.
•(S4) Avoid decreasing the comment to code ratio
to less than 0.44 to not increase the risk of being
defective.
•(S5) Avoid increasing the number of minor devel-
opers to more than 0 developers to not increase the
risk of being defective.
•(S6) Decreasing the number of blank lines to less
than 8 lines to decrease the risky of being defective.
•(S7) Decreasing the number of output variables to
less than 2 variables to decrease the risk of being
defective.
To implement the visualization of our SQAPlanner
approach, we decided to use the Microsoft’s Code Defect
AI as our core infrastructure. We first downloaded the
repository of Code Defect AI from GitHub.6Then, we
carefully studied their repository and deployed Code
Defect AI in our local environment with continuous sup-
port from the core developer of Code Defect AI. Then,
we integrated our SQAPlanner approach and replaced
their visualization (bar plots) with our visualization
generated by SQAPlanner using the implementation of
bullet plots as provided by the d3.js Javascript library.7
To investigate the practitioners’ perceptions of our
SQAPlanner visualization, we used a qualitative survey
as a research method. We also used the visualization of
Microsoft’s Code Defect AI (see Figure 2) as a baseline
comparison. The objectives of the survey are as follows:
(1) to investigate the practitioners’ perceptions of the
visualization of our SQAPlanner; and (2) to investigate
the practitioners’ perceptions of the actionable guidance
generated by our SQAPlanner. Similar to Section 4, we
designed the survey as a cross-sectional study where
participants provide their responses at one fixed point
in time. The design of our survey is described below.
Part 1—Practitioners’ perceptions of the visualizations of
our SQAPlanner: We first provided the concept of defect
prediction models and described how our SQAPlan-
6. https://github.com/aricent/codedefectai
7. https://bl.ocks.org/mbostock/4061961
19
OK
Project Name :Apache Camel (Release 2.9.0)
File Name:ErrorHandlerBuilderRef.java
Commit ID: 0a02dd5f58a77282dd18f6468d7fa6d5c50ce326 Commit Date: 2019-08-15| 08:09:14 PM
File History
Risk Score: 70%
What to do to decrease the risk of having defects?
0 5 10 15 20 25 30 35 40 45 50
Decreasing the number of class and method declaration lines to less than 29 lines
Actual = 34 lines
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Decreasing the number of distinct developers to less than 2 developers
Actual = 3 developers
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Increasing the ownership code proportion to more than 0.85
Actual = 0.65
0 5 10 15 20 25 30
Decreasing the number of blank lines to less than 8 lines
Actual = 19 blank lines
0 1 2 3 4 5 6 7 8 9 10
Decreasing the number of output variables to less than 2 variables
Actual = 4 variables
What to avoid to not increase the risk of having defects?
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Avoid decreasing the comment to code ratio to less than 0.44
Actual = 0.51
0 1 2 3 4 5 6 7 8 9 10
Avoid increasing the number of minor developers to more than 0 developers
Actual = 0 minor developers
* In the bullet plots, the red shade indicates the range of values that are high risk of being defective, while the green shade indicates the range of
values that are low risk of defective. The bold vertical line indicates the actual values for each feature of this file.
Bug Risk Prediction: Yes
OK
Project Name :Apache Camel (Release 2.9.0)
File Name:ErrorHandlerBuilderRef.java
Commit ID: 0a02dd5f58a77282dd18f6468d7fa6d5c50ce326 Commit Date: 2019-08-15| 08:09:14 PM
File History
Risk Score: 70%
What to do to decrease the risk of having defects?
0 5 10 15 20 25 30 35 40 45 50
Decreasing the number of class and method declaration lines to less than 29 lines
Actual = 34 lines
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Decreasing the number of distinct developers to less than 2 developers
Actual = 3 developers
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Increasing the ownership code proportion to more than 0.85
Actual = 0.65
0 5 10 15 20 25 30
Decreasing the number of blank lines to less than 8 lines
Actual = 19 blank lines
012345678910
Decreasing the number of output variables to less than 2 variables
Actual = 4 variables
What to avoid to not increase the risk of having defects?
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Avoid decreasing the comment to code ratio to less than 0.44
Actual = 0.51
012345678910
Avoid increasing the number of minor developers to more than 0 developers
Actual = 0 minor developers
* In the bullet plots, the red shade indicates the range of values that are high risk of being defective, while the green shade indicates the range of
values that are low risk of defective. The bold vertical line indicates the actual values for each feature of this file.
Bug Risk Prediction: Yes
Fig. 14: The visualization of our SQAPlanner is designed to provide the following key information: (1) the list of
guidance that practitioners should follow and should avoid; (2) the actual feature value of that file; and (3) its
threshold and range values for practitioners to follow to mitigate the risk of having defects.
ner can be used to support SQA planning. Then, we
presented the visualization of our SQAPlanner and the
visualization of Microsoft’s Code Defect AI. We asked
the participants a closed-ended question to inquire about
which of the visualization is the best to provide action-
able guidance on how to mitigate the risk of having
defects. We also asked the participants an open-ended
question to inquire about their rationale on why the
selected visualization is preferred over another visual-
ization.
Part 2—Practitioners’ perceptions of the actual guidance
generated by SQAPlanner: We again presented the vi-
sualization of SQAPlanner. Then, for each statement,
we asked the participants a closed-ended question to
inquire whether the participants agree for each of the
seven statements that we translated from the rule-based
explanations.
We used an online questionnaire service as provided
by Google Forms. We carefully evaluated the survey via
pre-testing [28] to assess the reliability and validity of
the survey. The survey has been rigorously reviewed
and approved by the Monash University Human Re-
search Ethics Committee (MUHREC Project ID: 27209).
We used a recruiting service provided by the MTurk to
recruit participants. We received 240 closed-ended and
30 open-ended responses from 30 respondents. Finally,
we manually verified and analyzed the survey responses
to ensure that the responses are of high quality.
20
80%
20%
0
25
50
75
100
SQAPlanner
Baseline
(Code Defect AI)
Percentage
(a) Perceptions of visual-
ization.
10%
17%
20%
27%
27%
33%
37%
90%
83%
80%
73%
73%
67%
63%
S1
S2
S3
S4
S5
S6
S7
100 50 0 50 100
Percentage
Response Disagree Agree
(b) Perceptions of the actual guidance generated by our SQAPlanner.
Fig. 15: (RQ6,RQ7) The results of a qualitative survey with practitioners.
7.2 Results
(RQ6) How do practitioners perceive the visualization
of SQAPlanner when comparing to the visualization
of the state-of-the-art?
Results.80% of our respondents agree that the visu-
alization of our SQAPlanner is better for providing
actionable guidance when compared to the visualiza-
tion of Microsoft’s Code Defect AI. Figure 15a shows
the percentage of the respondents who select which
visualization is best to provide actionable guidance on
how to mitigate the risk of having defects.
After analyzing the open-end responses, practitioners
(e.g., R10 and R12) provided rationales that the sug-
gested threshold values of each factor and directional
arrows provided by SQAPlanner make the visualization
more clear on what developers should do and should
avoid to decrease the risk of having defects. Respondents
(e.g., R19, R20, and R23) also pointed out that the
summary of "What to do" and "What to avoid" is straight
to the point and helpful.
On the other hand, 20% of the respondents rate the
visualization of Microsoft’s Code Defect AI as better. Re-
spondents (e.g., R5 and R16) provided rationales that the
visualization of Microsoft’s Code Defect AI is presented
in a more simple and concise manner (i.e., only present
the most important factors that are associated with the
risk of having defects). Thus, future research should
take into consideration the complexity of the provided
information when designing a novel visualization for AI-
driven defect prediction.
(RQ7) How do practitioners perceive the actual guid-
ance generated by our SQAPlanner?
Results.63%-90% of the respondents agree with the
seven statements derived from the actual guidance gen-
erated by our SQAPlanner. Figure 15b presents that the
percentage of the respondents who agree with the seven
statements derived from the actual guidance generated
by our SQAPlanner. We find that 90% of the respondents
agree the most with (S1) Decreasing the number of class and
method declaration lines to less than 29 lines to decrease the
risk of being defective.
On the other hand, only 63% of the respondents agree
with (S3) Increasing the ownership code proportion to more
than 0.85 to decrease the risk of being defective. We suspect
that the wide range of agreement rates for our state-
ments has to do with the degree of understandability of
the software metrics, since practitioners may find that
the number of class and method declaration lines for
S1 is more intuitive and easy to understand than the
ownership code proportion for S3. Thus, future research
should take into consideration the degree of understand-
ability of the software metrics when designing a novel
visualization for AI-driven defect prediction.
8 TH RE ATS TO VALIDITY
Construct Validity. Many local model-agnostic tech-
niques could be used to generate many forms of explana-
tions e.g., feature importance and rules. In this paper, we
focused only on rule-based explanations by comparing
with LORE and Anchor, an extension of LIME. We also
studied only a limited number of available classification
techniques. Thus, our results may not be applicable or
generalise to the use of other techniques. Nonetheless,
other classification techniques can be explored in future
work to see if they improve on our results.
Internal Validity. The practicality of rule-based expla-
nations heavily relies on software metrics that are used
to train the models. In this paper, we chose to generate
rules based on 65 well-known and hand-crafted software
21
metrics, rather than using advanced automated feature
generation like deep learning. Future work may focus on
trying to explain other machine learning-based models,
such as explaining deep learning models used in an SQA
context.
The goal of our SQAPlanner (aka. the local rule-based
model-agnostic technique) is a post-hoc analysis of the
global defect prediction models. That means, SQAPlan-
ner can only explain the behavior of the (global) defect
prediction models, regardless of the correct or incor-
rect predictions. If the predictions of the global defect
models for the testing dataset are incorrect, SQAPlanner
will explain why the global defect prediction models
generate wrong predictions in the form of rule-based
explanations. Therefore, the robustness or the sensitivity
of our SQAPlanner does not depend on the accuracy of
the predictions of the global defect prediction models.
External Validity. We applied our SQA Planner ap-
proach to a limited number of software systems. Thus,
our results may not generalize to other datasets, do-
mains, ecosystems. However, we mitigated this by
choosing a range of different non-trivial, real-world,
open-source software applications. Nonetheless, addi-
tional replication studies in a proprietary setting and
other ecosystems will prove useful to compare to our
results reported here.
SQA planning involves various activities. However,
this paper only focused on helping practitioners to
define development policies and their associated risk
thresholds [12], without considering other activities. In
addition, the dependent variable that we used in this
study only focused on software quality (i.e., defective or
clean), without considering other aspects (e.g., testability,
reusability, robustness, and maintainability). Thus, other
SQA planning activities and other quality attributes can
be explored in future work.
9 RE LATED WORK
In this section, we discuss related work and gaps to
highlight the contributions of our work to the literature.
9.1 Explainable AI in Software Engineering
Despite the advances of AI/ML techniques that are
tightly integrated into software development practices
(e.g., defect prediction [17], automated code review [1,
50], automated code completion [19, 20]), such AI/ML
techniques come with their own limitations. The central
problem of AI/ML techniques is that most AI/ML mod-
els are considered black-box models i.e., we understand
the underlying mathematical principles without explicit
declarative knowledge representation. In other words,
developers do not understand how decisions are made
by such AI/ML techniques. In addition, the current de-
fect modelling practices do not uphold the current data
privacy laws and regulations, which require justifications
of individual predictions for any decisions made by
an AI/ML model. Therefore, applying such black-box
AI/ML techniques in the software development prac-
tices for safety-critical and cyber-physical systems [4, 55]
which involve safety, security, business, personal, or mil-
itary operations is unfavourable and must be avoided.
Explainable AI is essential in software engineering
to building appropriate trust (including Fairness, Ac-
countability, and Transparency (FAT)). Developers can
then (1) understand the reasons and the logic behind
every decision and (2) effectively improve the predic-
tion models by understanding any unsound predictions
made by the models. Recently, explainable AI has been
employed in software engineering [44], by making defect
prediction models more practical [52] (i.e., using LIME
to explain which tokens and which lines are likely to be
defective in the future) and explainable [21] (i.e, using
LIME to explain a prediction why a file is predicted as
defective). However, there exists no studies that able to
provide concrete guidance on what developers should
do or should not do to support SQA planning. To the
best of our knowledge, this paper is the first to generate
local rule-based explanations to help QA teams make
data-informed decisions in software quality assurance
planning.
9.2 Towards Explainable and Actionable Analytics
for Software Defects
There are two key approaches for achieving explainabil-
ity in defect prediction models. The first is to make the
entire decision process transparent and comprehensible
(i.e., global explainability). The second is to explicitly
provide an explanation for each individual prediction
(i.e., local explainability).
Examples of global explainability methods are re-
gression models [33, 35], decision trees [57], decision
rules [39], and Fast-and-Frugal trees [6]. These trans-
parent AI/ML techniques often provide built-in model
interpretation techniques to uncover the relationships
between the studied features and defect-proneness. For
example, an ANOVA analysis provided for logistic re-
gression or a variable importance analysis provided for
random forest. However, the insights derived from these
transparent AI/ML techniques do not provide justifica-
tions for each individual prediction.
Model-agnostic techniques are techniques for explic-
itly providing an instance explanation for each decision
of AI/ML models (i.e., local explainability) for a given
testing instance [16]. Formally, given a defect model f
and an instance x, the instance explanation problem
aims to provide an explanation efor the prediction
f(x) = y. To do so, we address the problem by building
a local interpretable model f0that mimics the local
behaviour of the global defect model f. An explanation
of the prediction is then derived from the local inter-
pretable model f0. The local interpretable model focuses
on learning the behaviour of the defect models in the
neighbourhood of the specific instance x, without aiming
at providing a single description of the logic of the
22
black box for all possible instances. Thus, an explanation
e∈Eis obtained through f0, if e=(f0, x)for some
explanation logic (., .)which reasons over f0and x.
Two common ways to represent explanations are feature-
importance explanations and rule-based explanations.
Unlike model-specific explanation techniques discussed
above, the great advantage of model-agnostic techniques
is their flexibility. Such model-agnostic techniques can
(1) interpret any learning algorithms (e.g., regression,
random forest, and neural networks); (2) are not limited
to a certain form of explanations (e.g., feature importance
or rules); and (3) are able to process any input data (e.g.,
features, words, and images [36]).
There are a plethora of model-agnostic techniques [16]
for identifying the most important feature at the instance
level. For example, LIME (i.e., Local Interpretable Model-
agnostic Explanations) [37] is a model-agnostic technique
that mimics the behaviour of the black-box model with
a local linear model to generate the explanations of
the predictions. BreakDown [14, 41] is a model-agnostic
technique that uses the greedy strategy to sequentially
measure contributions of metrics towards the expected
prediction. However, none of these techniques can gen-
erate explanations with the logic behind.
Despite the advances of model-agnostic techniques in
the Explainable AI communities, such techniques have
not been employed in practical software engineering
contexts. To the best of our knowledge, this paper is the
first to generate local rule-based explanations to help QA
teams make data-informed decisions in software quality
assurance planning.
10 CONCLUSIONS
Defect prediction models have been proposed to gen-
erate insights (e.g., the most important factors that are
associated with software quality). However, such in-
sights derived from traditional defect models are far
from actionable—i.e., practitioners still do not know
what they should do and should avoid to decrease the
risk of having defects, and what is a risk threshold for
each metric. A lack of actionable guidance and its risk
threshold could lead to inefficient and ineffective SQA
planning processes.
In this paper, we investigate practitioners perceptions
and their challenges of current SQA planning activities
and the perceptions of our proposed four types of guid-
ance. Then, we propose and evaluate our SQAPlanner
approach—i.e., an approach for generating four types of
guidance and its risk threshold in the form of rule-based
explanation for the predictions of defect prediction mod-
els. Finally, we develop and evaluate the visualization of
our SQAPlanner approach.
Through the use of qualitative survey and empirical
evaluation, our results lead us to conclude that SQA-
Planner is needed, important, effective, stable, and ap-
plicable. We also find that 80% of respondents perceived
that our visualization is more actionable. Thus, our
SQAPlanner paves a way for novel research in actionable
software analytics.
Finally, we note that we do not seek to claim the
generalization and causation of our proposed guidance.
Instead, the key message of our study is that our rule-
based guidance can explain the behaviour of the defect
models that learnt from the relationship between soft-
ware features and defect-proneness from the past release
data. Thus, they can indicate important relationships in
the data and provide a useful tool to support decision-
and policy-making in software quality assurance. Our
rule-based guidance could be used as a guidance tool
for supporting decision-making so that developers can
(1) understand the reasons and the logic behind every
prediction, and (2) effectively improve the prediction
models by understanding any unsound prediction made
by the models.
ACKNOWLEDGMENTS
C. Tantithamthavorn was partially supported by the
Australian Research Council’s Discovery Early Ca-
reer Researcher Award (ARC DECRA) funding scheme
(DE200100941). C. Bergmeir was partially supported by
the Australian Research Council’s Discovery Early Ca-
reer Researcher Award (ARC DECRA) funding scheme
(DE190100045). J. Grundy was partially supported by
the Australian Research Council’s Laureate Fellowship
funding scheme (FL190100035).
REFERENCES
[1] S. Asthana, R. Kumar, R. Bhagwan, C. Bird, C. Bansal, C. Maddila,
S. Mehta, and B. Ashok, “Whodo: automating reviewer sugges-
tions at scale,” in Proceedings of the 2019 27th ACM Joint Meeting
on European Software Engineering Conference and Symposium on the
Foundations of Software Engineering. ACM, 2019, pp. 937–945.
[2] C. Bird, B. Murphy, and H. Gall, “Don’t Touch My Code !
Examining the Effects of Ownership on Software Quality,” in
Proceedings of the European Conference on Foundations of Software
Engineering (ESEC/FSE), 2011, pp. 4–14.
[3] B. Boehm and V. R. Basili, “Software defect reduction top 10 list,”
Foundations of empirical software engineering: the legacy of Victor R.
Basili, vol. 426, no. 37, pp. 426–431, 2005.
[4] M. Borg, S. Gerasimou, N. Hochgeschwender, and N. Khakpour,
“Explainability for safety and security,” Explainable Software for
Cyber-Physical Systems (ES4CPS), Report from the GI Dagstuhl Sem-
inar 19023, p. 15, 2019.
[5] L. Breiman, A. Cutler, A. Liaw, and M. Wiener, “randomForest
: Breiman and Cutler’s Random Forests for Classification and
Regression. R package version 4.6-12.” Software available at URL:
https://cran.r-project.org/package=randomForest.
[6] D. Chen, W. Fu, R. Krishna, and T. Menzies, “Applications of
Psychological Science for Actionable Analytics,” in Proceedings of
the 2018 26th ACM Joint Meeting on European Software Engineering
Conference and Symposium on the Foundations of Software Engineer-
ing. ACM, 2018, pp. 456–467.
[7] C. Y. Chong, P. Thongtanunam, and C. Tantithamthavorn, “As-
sessing the students understanding and their mistakes in code re-
view checklists–an experience report of 1,791 code review check-
lists from 394 students,” in International Conference on Software
Engineering: Joint Software Engineering Education and Training track
(ICSE-JSEET), 2021.
[8] D. A. da Costa, S. McIntosh, W. Shang, U. Kulesza, R. Coelho, and
A. E. Hassan, “A Framework for Evaluating the Results of the SZZ
23
Approach for Identifying Bug-introducing Changes,” Transactions
on Software Engineering (TSE), vol. 43, no. 7, pp. 641–657, 2017.
[9] M. D’Ambros, M. Lanza, and R. Robbes, “An Extensive Compar-
ison of Bug Prediction Approaches,” in Proceedings of the Interna-
tional Conference on Mining Software Repositories (MSR), 2010, pp.
31–41.
[10] P. Edwards, I. Roberts, M. Clarke, C. DiGuiseppi, S. Pratap,
R. Wentz, and I. Kwan, “"increasing response rates to postal
questionnaires: Systematic review",” Bmj, vol. 324, no. 7347, p.
1183, 2002.
[11] S. Farooqui and W. Mahmood, “A survey of pakistan’s sqa
practices: A comparative study,” in 29th International Business
Information Management Association Conference, 2017.
[12] D. Galin, Software quality: concepts and practice. John Wiley &
Sons, 2018.
[13] B. Ghotra, S. McIntosh, and A. E. Hassan, “Revisiting the Impact
of Classification Techniques on the Performance of Defect Pre-
diction Models,” in Proceedings of the International Conference on
Software Engineering (ICSE), 2015, pp. 789–800.
[14] A. Gosiewska and P. Biecek, “iBreakDown: Uncertainty of Model
Explanations for Non-additive Predictive Models,” arXiv preprint
arXiv:1903.11420, 2019.
[15] R. Guidotti, A. Monreale, S. Ruggieri, D. Pedreschi, F. Turini, and
F. Giannotti, “Local rule-based explanations of black box decision
systems,” arXiv preprint arXiv:1805.10820, 2018.
[16] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, D. Pedreschi,
and F. Giannotti, “A Survey Of Methods For Explaining Black
Box Models,” vol. 51, no. 5, pp. 1–45, 2018. [Online]. Available:
http://arxiv.org/abs/1802.01933
[17] T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell,
“A Systematic Literature Review on Fault Prediction
Performance in Software Engineering,” Transactions on Software
Engineering (TSE), vol. 38, no. 6, pp. 1276–1304, 2012.
[Online]. Available: http://ieeexplore.ieee.org.pc124152.oulu.fi:
8080/xpls/abs{_}all.jsp?arnumber=6035727
[18] J. A. Hanley and B. J. McNeil, “The meaning and use of the area
under a receiver operating characteristic (ROC) curve,” Radiology,
vol. 143, no. 1, pp. 29–36, Apr. 1982. [Online]. Available:
http://dx.doi.org/10.1148/radiology.143.1.7063747
[19] V. J. Hellendoorn, C. Bird, E. T. Barr, and M. Allamanis, “Deep
learning type inference,” in Proceedings of the 2018 26th ACM Joint
Meeting on European Software Engineering Conference and Symposium
on the Foundations of Software Engineering. ACM, 2018, pp. 152–
162.
[20] V. J. Hellendoorn, S. Proksch, H. C. Gall, and A. Bacchelli, “When
code completion fails: a case study on real-world completions,”
in Proceedings of the 41st International Conference on Software Engi-
neering. IEEE Press, 2019, pp. 960–970.
[21] J. Jiarpakdee, C. Tantithamthavorn, H. K. Dam, and J. Grundy,
“An empirical study of model-agnostics techniques for defect
prediction models,” 2020.
[22] J. Jiarpakdee, C. Tantithamthavorn, and A. E. Hassan, “The
Impact of Correlated Metrics on Defect Models,” Transactions on
Software Engineering (TSE), p. To Appear, 2019.
[23] J. Jiarpakdee, C. Tantithamthavorn, and C. Treude, “AutoSpear-
man: Automatically Mitigating Correlated Software Metrics for
Interpreting Defect Models,” in Proceedings of the International
Conference on Software Maintenance and Evolution (ICSME), 2018,
pp. 92–103.
[24] B. A. Kitchenham and S. L. Pfleeger, “Personal opinion surveys,”
in Guide to Advanced Empirical Software Engineering. Springer,
2008, pp. 63–92.
[25] L. Kocsis and C. Szepesvári, “Bandit based monte-carlo plan-
ning,” in European conference on machine learning. Springer, 2006,
pp. 282–293.
[26] J. A. Krosnick, “Survey research,” Annual Review of Psychology,
vol. 50, no. 1, pp. 537–567, 1999.
[27] S. Kumaresh and R. Baskaran, “Defect analysis and prevention
for software process quality improvement,” International Journal
of Computer Applications, vol. 8, no. 7, pp. 42–47, 2010.
[28] M. S. Litwin, How to Measure Survey Reliability and Validity. Sage,
1995, vol. 7.
[29] B. R. Maxim and M. Kessentini, “An introduction to modern soft-
ware quality assurance,” in Software Quality Assurance. Elsevier,
2016, pp. 19–46.
[30] S. McIntosh, Y. Kamei, B. Adams, and A. E. Hassan, “The Impact
of Code Review Coverage and Code Review Participation on
Software Quality,” in Proceedings of the International Conference on
Mining Software Repositories (MSR), 2014, pp. 192–201.
[31] T. Menzies, J. Greenwald, and A. Frank, “Data Mining Static Code
Attributes to Learn Defect Predictors,” Transactions on Software
Engineering (TSE), vol. 33, no. 1, pp. 2–13, 2007.
[32] F. Rahman and P. Devanbu, “Ownership, experience and defects:
a fine-grained study of authorship,” in Proceedings of the Interna-
tional Conference on Software Engineering (ICSE), 2011, pp. 491–500.
[33] ——, “How, and Why, Process Metrics are Better,” in Proceedings
of the International Conference on Software Engineering (ICSE), 2013,
pp. 432–441.
[34] D. Rajapaksha, C. Bergmeir, and W. Buntine, “LoRMIkA: Local
rule-based model interpretability with k-optimal associations,”
Information Sciences, vol. 540, pp. 221–241, 2020.
[35] G. K. Rajbahadur, S. Wang, Y. Kamei, and A. E. Hassan, “The
Impact of Using Regression Models to Build Defect Classifiers,”
in Proceedings of the International Conference on Mining Software
Repositories (MSR), 2017, pp. 135–145.
[36] M. T. Ribeiro, S. Singh, and C. Guestrin, “Model-agnostic Inter-
pretability of Machine Learning,” arXiv preprint arXiv:1606.05386,
2016.
[37] ——, “Why should I trust you?: Explaining the Predictions of
Any Classifier,” in Proceedings of the International Conference on
Knowledge Discovery and Data Mining (KDDM), 2016, pp. 1135–
1144.
[38] ——, “Anchors: High-precision model-agnostic explanations,” in
Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
[39] D. Rodríguez, R. Ruiz, J. C. Riquelme, and J. S. Aguilar-Ruiz,
“Searching for Rules to Detect Defective Modules: A Subgroup
Discovery Approach,” Information Sciences, vol. 191, pp. 14–30,
2012.
[40] E. Smith, R. Loftin, E. Murphy-Hill, C. Bird, and T. Zimmermann,
“"improving developer participation rates in surveys",” in Proceed-
ings of the International Workshop on Cooperative and Human Aspects
of Software Engineering (CHASE), 2013, pp. 89–92.
[41] M. Staniak and P. Biecek, “Explanations of Model Predictions with
live and breakDown Packages,” arXiv preprint arXiv:1804.01955,
2018.
[42] R. Storn and K. Price, “Differential evolution – a simple
and efficient heuristic for global optimization over continuous
spaces,” Journal of Global Optimization, vol. 11, no. 4, pp. 341–
359, Dec. 1997. [Online]. Available: https://doi.org/10.1023/A:
1008202821328
[43] C. Tantithamthavorn, A. E. Hassan, and K. Matsumoto, “The
Impact of Class Rebalancing Techniques on The Performance
and Interpretation of Defect Prediction Models,” Transactions on
Software Engineering (TSE), p. To Appear, 2019.
[44] C. Tantithamthavorn, J. Jiarpakdee, and J. Grundy, “Explainable
ai for software engineering,” arXiv preprint arXiv:2012.01614, 2020.
[45] C. Tantithamthavorn, S. McIntosh, A. E. Hassan, A. Ihara, and
K. Matsumoto, “The Impact of Mislabelling on the Performance
and Interpretation of Defect Prediction Models,” in Proceeding of
the International Conference on Software Engineering (ICSE), 2015,
pp. 812–823.
[46] C. Tantithamthavorn, S. McIntosh, A. E. Hassan, and K. Mat-
sumoto, “Automated Parameter Optimization of Classification
Techniques for Defect Prediction Models,” in Proceedings of the
International Conference on Software Engineering (ICSE), 2016, pp.
321–332.
[47] ——, “An Empirical Comparison of Model Validation Techniques
for Defect Prediction Models,” Transactions on Software Engineering
(TSE), vol. 43, no. 1, pp. 1–18, 2017.
[48] ——, “The Impact of Automated Parameter Optimization on
Defect Prediction Models,” Transactions on Software Engineering
(TSE), pp. 683–711, 2018.
24
[49] P. Thongtanunam, S. McIntosh, A. E. Hassan, and H. Iida, “Revis-
iting Code Ownership and its Relationship with Software Quality
in the Scope of Modern Code Review,” in Proceedings of the
International Conference on Software Engineering (ICSE), 2016, pp.
1039–1050.
[50] P. Thongtanunam, C. Tantithamthavorn, R. G. Kula, N. Yoshida,
H. Iida, and K.-i. Matsumoto, “Who Should Review My Code?
A File Location-based Code-reviewer Recommendation Approach
for Modern Code Review,” in Proceedings of the International Con-
ference on Software Analysis, Evolution, and Reengineering (SANER),
2015, pp. 141–150.
[51] Z. Wan, X. Xia, A. E. Hassan, D. Lo, J. Yin, and X. Yang,
“Perceptions, expectations, and challenges in defect prediction,”
IEEE Transactions on Software Engineering, 2018.
[52] S. Wattanakriengkrai, P. Thongtanunam, C. Tantithamthavorn,
H. Hata, and K. Matsumoto, “Predicting defective lines using a
model-agnostic technique,” 2020.
[53] G. I. Webb, “Opus: An efficient admissible algorithm for un-
ordered search,” Journal of Artificial Intelligence Research, vol. 3,
pp. 431–465, 1995.
[54] G. I. Webb and S. Zhang, “K-optimal rule discovery,” Data Mining
and Knowledge Discovery, vol. 10, no. 1, pp. 39–79, 2005.
[55] Y. Yang, D. Falessi, T. Menzies, and J. Hihn, “Actionable analytics
for software engineering,” IEEE Software, vol. 35, no. 1, pp. 51–53,
2017.
[56] S. Yathish, J. Jiarpakdee, P. Thongtanunam, and C. Tantithamtha-
vorn, “Mining Software Defects: Should We Consider Affected
Releases?” in In Proceedings of the International Conference on Soft-
ware Engineering (ICSE), 2019, p. To Appear.
[57] T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy,
“Cross-project Defect Prediction,” in Proceedings of the European
Software Engineering Conference and the Symposium on the Founda-
tions of Software Engineering (ESEC/FSE), 2009, pp. 91–100.
Dilini Rajapaksha received the BSc(hons) de-
gree from Sri Lanka Institute of Information
Technology (SLIIT). She is currently a Ph.D.
candidate at Monash University, Australia. Her
research interests include Machine Learning
and Time-series Forecasting. The goal of her
Ph.D. is to provide local explanations for the pre-
dictions given by the time-series and machine
learning models.
Chakkrit Tantithamthavorn is a Lecturer in
Software Engineering and a 2020 ARC DECRA
Fellow in the Faculty of Information Technol-
ogy, Monash University, Australia. His current
fellowship is focusing on the development of
“Practical and Explainable Analytics to Prevent
Future Software Defects”. His work has been
published at several top-tier software engineer-
ing venues, such as the IEEE Transactions on
Software Engineering (TSE), the Springer Jour-
nal of Empirical Software Engineering (EMSE)
and the International Conference on Software Engineering (ICSE). More
about Chakkrit and his work is available online at http://chakkrit.com.
Jirayus Jiarpakdee is a Ph.D. candidate at
Monash University, Australia. His research inter-
ests include empirical software engineering and
mining software repositories (MSR). The goal of
his Ph.D. is to apply the knowledge of statistical
modelling, experimental design, and software
engineering to improve the explainability of de-
fect prediction models.
Christoph Bergmier is a Lecturer in Data Sci-
ence and Artificial Intelligence, and a 2019 ARC
DECRA Fellow in the Monash Faculty of In-
formation Technology. His fellowship is on the
development of "efficient and effective analyt-
ics for real-world time series forecasting". He
also works as a Data Scientist in a variety of
projects with external partners in diverse sec-
tors, e.g. in healthcare or infrastructure main-
tenance. Christoph holds a PhD in Computer
Science from the University of Granada, Spain,
and an M.Sc. degree in Computer Science from the University of Ulm,
Germany.
John Grundy is Australian Laureate Fellow and
Professor of Software Engineering at Monash
University, Australia. He has published widely
in automated software engineering, domain-
specific visual languages, model-driven engi-
neering, software architecture, and empirical
software engineering, among many other areas.
He is Fellow of Automated Software Engineering
and Fellow of Engineers Australia.
Wray Buntine is a Professor in Machine Learn-
ing in the Faculty of Information Technology at
Monash University in Melbourne, Australia, and
a recipient of the 2019 Google AI Impact Chal-
lenge. He is known for his theoretical and ap-
plied work, probabilistic methods for document
and text analysis, social networks, data mining
and machine learning. His work is supported by
many high-profile organisations such as Google
and the NASA Ames Research Centre, as well
as Wall Street and Silicon Valley startups.