PreprintPDF Available

Evaluating the effectiveness of local explanation methods on source code-based defect prediction models

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

Interpretation has been considered as one of key factors for applying defect prediction in practice. As one way for interpretation, local explanation methods has been widely used for certain predictions on datasets of traditional features. There are also attempts to use local explanation methods on source code-based defect prediction models, but unfortunately, it will get poor results. Since it is unclear how effective those local explanation methods are, we evaluate such methods with automatic metrics which focus on local faithfulness and explanation precision. Based on the results of experiments, we find that the effectiveness of local explanation methods depends on the adopted defect prediction models. They are effective on token frequency-based models, while they may not be effective enough to explain all predictions of deep learning-based models. Besides, we also find that the hyperparameter of local explanation methods should be carefully optimized to get more precise and meaningful explanation.
Evaluating the effectiveness of local explanation methods on
source code-based defect prediction models
Yuxiang Gao
School of Computer Science
Jiangsu Normal University
Xuzhou, Jiangsu, China
gaoyx@jsnu.edu.cn
Yi Zhu*
School of Computer Science
Jiangsu Normal University
Xuzhou, Jiangsu, China
zhuy@jsnu.edu.cn
Qiao Yu
School of Computer Science
Jiangsu Normal University
Xuzhou, Jiangsu, China
yuqiao@jsnu.edu.cn
ABSTRACT
Interpretation has been considered as one of key factors for
applying defect prediction in practice. As one way for
interpretation, local explanation methods has been widely used
for certain predictions on datasets of traditional features. There
are also attempts to use local explanation methods on source
code-based defect prediction models, but unfortunately, it will
get poor results. Since it is unclear how effective those local
explanation methods are, we evaluate such methods with
automatic metrics which focus on local faithfulness and
explanation precision. Based on the results of experiments, we
find that the effectiveness of local explanation methods depends
on the adopted defect prediction models. They are effective on
token frequency-based models, while they may not be effective
enough to explain all predictions of deep learning-based models.
Besides, we also find that the hyperparameter of local
explanation methods should be carefully optimized to get more
precise and meaningful explanation.
CCS CONCEPTS
Software and its engineering →Maintaining software.
KEYWORDS
Software Defect Prediction, Local Explanation, Explainable
Machine Learning, LIME
1 INTRODUCTION
Interpretation of defect prediction models has been considered as
one of key factors for applying defect prediction in practice [1, 2].
As one important way to explain defect prediction models, local
explanation methods have been adopted to defect prediction
studies to explain why a certain prediction is buggy [3-7].
However, many of them focus on defect prediction models based
on traditional features (i.e., handcraft features), and if the
explanation can tell the defects from source code, it may be more
useful in practice. Motivated by this, Wattanakriengkrai et al. [6]
attempt to use LIME to explain token frequency-based defect
prediction models and predict defect lines according to key
tokens provided by LIME. Unfortunately, by investigating the
relationship between local explanations and actual defects
manually, Aleithan [7] finds that the local explanation from
defect prediction models poorly represents the ground truth (i.e.,
the key tokens are not related to actual defects) based on token
frequency features in the scenario of Just-In-Time defect
prediction. Inefficient local explanation methods may lead to
such poor result, and it is still unclear how effective these local
explanation techniques are when they are applied on source
code-based defect prediction models.
Thus, in this paper, we are going to evaluate the effectiveness
of local explanation methods when they are applied on source
code-based defect prediction models. In detail, we first apply
several popular local explanation methods on different file-level
source code-based defect prediction models and then introduce
automatic metrics which focus on the local faithfulness and
explanation precision to evaluate effectiveness of such methods.
We find that those local explanation methods are much more
effective than random guessing in general, and they are effective
on token frequency-based defect prediction models. However,
when they are applied on deep learning-based models (e.g., CNN
and DBN), they may not be effective enough as expected. We
observe that though the explanations of those methods on CNN
model are precise (i.e., the explanation itself can be predicted
correctly), those explanations cannot characterize all the factors
contributing to the prediction of deep learning-based models (i.e.,
the prediction will not be shifted when code tokens in
explanations are removed). Besides, we also find that the
hyperparameter k (i.e., number of features used for explanation)
of local explanation methods heavily impacts the effectiveness.
Based on the above findings, we suggest:
(1) It is feasible to apply explanation methods based on
individual code tokens (such as LIME and Word Omission) on
bag of word-based defect prediction models. However, when
applying such local explanation methods on deep learning-based
defect prediction models (Such as DBN and CNN), it is necessary
to evaluate the effectiveness of local explanation methods first
instead of using them directly.
(2) Hyperparameter k should be carefully optimized to get
meaningful explanation.
We also share our implementation, which includes source
code, data, and simple instructions at
https://github.com/gyx1997/msr22-defects.
The rest of this paper is organized as follows. Section 2 gives
the background of source code-based defect prediction and local
explanation methods. Section 3 presents the design of our case
study. Section 4 gives the result of our experiments and brief
Gao et al
.
discussions about our findings. We discuss threats to validity in
Section 5. In Section 6 we present our conclusion.
2 BACKGROUND
2.1 Source Code-based Defect Prediction
Figure 1. Framework for constructing deep semantic defect
prediction models
The source code has structural and semantic information based
on AST [8, 9]. To automatically capture the semantic features of
programs, several studies use deep learning-based approaches to
construct defect prediction models directly from source code
based on the AST with various deep learning techniques [10-14].
They share the common framework proposed by [10], which is
described in Figure 1. Firstly, source files are parsed into abstract
syntax trees (ASTs). Next, the trees are traversed to get
sequences of AST nodes which belong to specific types (e.g.,
control flow nodes, method invocation nodes, etc.). Each node
will be represented by a string code token (e.g., method names
for method invocation nodes and the node type for control flow
nodes), and each unique code token is assigned with an integer
identifier, to transform sequences of code tokens into integer
vectors. Finally, those vectors are fed into deep neural networks
for semantic features extraction and model construction. Various
deep learning techniques are adopted. For example, Wang et al.
first uses the deep belief network (DBN) to extract semantic
features from the code token sequence from nodes of abstract
syntax tree [10]. Based on their proposed framework, more
advanced techniques such as CNN [11], LSTM with attention
mechanism [12], gated LSTM [13] and novel word embedding
technique [14] are used to improve the performance.
Besides the deep learning-based approaches, token
frequency-based approaches which use Bag of Word (BoW)
feature vectors (vector of frequencies for each code token) to
represent source code are also used by prior studies since it is
simple and computational efficient [6, 15]. As Figure 2 shows, for
token frequency-based approaches, irrelevant characters such as
semicolon and comments are removed firstly. Then, vectors of
token frequencies are built by counting the code tokens. Finally,
models are trained on those vectors.
Figure 2. Framework for constructing token frequency-
based defect prediction models
2.2 Local Explanation Methods
As one way for interpreting black-box machine learning models,
local explanation methods are post-hoc methods which explain
why a certain prediction is made [16]. For a certain instance
x
in test set
test
D
, a local explanation method can generate
corresponding local explanation
,
l
e ε
pred x
where
ε
is the
explanation function and
pred
is the prediction function of the
black-box model. Compared with other explanation methods,
local explanation method can interpret any black-box models
since only prediction function is required. It can also deal with
any input data with flexible forms of representation. For example,
local explanation of black-box models which are built on tabular
data will be a set of features with contribution to the predictions,
while explanations of image or text models are usually parts of
the input (i.e., super-pixels of images, and words, phrases for
texts) which mainly contribute to the prediction.
Recently, for tabular data with traditional features, local
explanation methods such as LIME have been adopted and well
evaluated on defect prediction models [3]. Further, rule-based
methods which are more effective and actionable (i.e., providing
clearer guidance for practitioners) are proposed for file-level
defect prediction [4] and just-in-time defect prediction [5].
However, those studies only focus on tabular data with
traditional features. To our best knowledge, though LIME has
been adopted on token frequency-based models to predict defect
lines [6] and explain just-in-time models [7], effectiveness of
local explanation methods has not been evaluated when they are
applied on source code-based defect prediction models. The
unclear effectiveness of local explanation methods may result in
pitfalls when applying them in practice, and thus we evaluate
their effectiveness in our case study.
3 CASE STUDY DESIGN
The overall framework of our case study is shown in Figure 3. By
following the practice of prior studies, we first select project
with continuous versions from PROMISE repository and clone
the source code from GitHub. Then we construct source code-
based defect prediction models on project of old version. Next,
we use local explanation methods to explain each file which is
predicted defective in project of new version and calculate
automatic metrics. Finally, averaged results are reported after
repeating these steps for 5 times.
Figure 3. Framework of the case study.
3.1 Dataset
In this study, we select projects in PROMISE repository [17]
which satisfy: (1) the project should have at least 2 continuous
versions; (2) the defect labels and source code of corresponding
versions are available; and (3) the training data are nearly
balanced since imbalanced data has great negative impact on
interpretation [18]. As a result, we pick up lucene, poi and xalan
and clone the source code of corresponding versions from
GitHub. Details of them are shown in Table 1.
Table 1 Studied Projects
P
roject
O
ld version
New version
%
D
efective
Lucene 2.0 2.2 46.67
Poi 1.5 2.0 59.49
Xalan 2.5 2.6 48.19
3.2 Defect Prediction Models
Figure 4. Framework of adopted source code-Based defect
prediction models
We adopt DBN [10] and CNN [11] as deep learning-based and TF
[6] as token frequency-based defect prediction models. The
framework of adopted models is shown in Figure 4. We first
parse java source files as ASTs by python package javalang, and
then we traverse the parsed ASTs to get sequences of code
tokens. For deep learning-based models, we map each code token
to a unique integer identifier, and then we get vectors of code
tokens which use integer identifiers to represent sequences. For
token frequency-based models, we count the frequencies of code
tokens, and then we get the vectors of code tokens based on bag
of word features. Finally, we train the defect prediction models.
3.3 Local Explanation Methods
In Section 3.2 we get a sequence of code tokens for a certain file,
which can be thought as a sentence in text classification. Thus,
we adopt several local explanation techniques from NLP field.
LIME. LIME [19] is a model-agnostic approach which trains a
local regression model on perturbed samples around the specific
data and uses the coefficients of the regression model as
interpretation. In this study, it first generates perturbed samples
by randomly removing code tokens from the given sample. Then,
the labels of perturbed samples are generated by black-box
models. Finally, it trains the local regression model with those
perturbed samples based on bag of word features. Since the
number of perturbed samples highly impacts on the explanation
[20], we set the number of perturbed samples as 5,000 in all
experiments for LIME.
Word Omission. Word Omission [21] aims to estimate the
contribution of individual words, and then measure the effect of
certain words to a black-box model. In this study, we first get a
set of unique tokens appeared in the given sample. Then, we
remove all the appearances of each unique code token from the
given sample, and then calculate the difference of prediction
probability
 
\
ˆˆt
p y s p y s as the importance score for each
code token. Here ˆ
yis the predicted class, and \
,t
s s are sequences
of code tokens with, and without the specific code token t
respectively. Finally, we output k code tokens with highest
scores as explanation, i.e., the code tokens which contribute to
the prediction.
We also adopt Random Guessing, i.e., randomly choose k
unique code tokens as most important features, as baseline for
comparison.
3.4 Automatic Metrics
We adopt automatic metrics to evaluate the local explanation
techniques from the perspective of local faithfulness and
explanation precision. By following the practice in [20], we
assume the golden label of instances in test project are unknown,
and we examine the instances by the predicted label. For
simplicity, we only investigate instances which are predicted to
be defective, since the non-defective instances are usually trivial.
3.4.1 Local Faithfulness
Local faithfulness is one of the key criteria for local explanation
techniques [20, 22]. Thus, we mainly consider automatic metrics
which evaluate local faithfulness in this study. In detail, we
evaluate the local faithfulness of adopted local explanation
methods with deletion-based metrics AOPC and Percentage of
Decision Flip.
Area over the perturbation curve (AOPC) estimates the average
probability change when important features are removed [23],
and it has been widely used to evaluate the quality of local
explanation on text classification models [20, 22]. In our case
study, it is defined as
 
 
 
 
 
1
1ˆˆ
Nk
i i
i
AOPC k p y s p y s
N
 
, (1)
where N is the number of investigated instances,
 
ˆ
p y is the
probability of predicted class (defective in this study), and
 
,k
i i
s s
is the token sequence with, and without the top-k features (i.e.,
explanation) respectively. Larger AOPC indicates the better local
faithfulness.
Motivated by Decision Flip-Most Informative Token [24], we
also propose a new metric percentage of decision flip (PDF), to
estimate extent of prediction shifting when features in
explanation are removed. It is the percentage of investigated
instances which prediction will be shifted if k features from
explanation are removed, which is defined as
Gao et al.
 
 
 
 
 
1
1Nk
i i
i
PDF k I s I s
N
 
. (2)
Here
 
I is the prediction function, and
 
1I s indicates that
token sequence s is predicted to be defective, while
 
0I s
means that s is predicated to be clean. The larger PDF means
the explanation will better reflect the features (code tokens in
this study) which contribute to the prediction.
3.4.2 Explanation Precision
Intuitively, an optimal explanation should be able to reflect the
certain prediction independently without any other features. In
another word, the subset of code token sequences picked up by
explanation should also result in the same prediction. Inspired by
evaluation of rationale in text classification tasks [25], we
evaluate the precision of explanation by calculate the percentage
of explanations which result in the same prediction as
explanation precision (EP), which is defined as follows.
For i-th sequence of code tokens
 
1 2
, , ,
i n
s x x x and its
explanation (with k features)
   
1 2
, , ,
i k
Expl s t t t, we replace
all
 
j i
x Expl s with a special padding token pad
t, and get a
new sequence ,i Expl
s which elements are either pad
t or in
 
i
Expl s . Then, we feed ,i Expl
s into the black-box model and get
the prediction
 
,i Expl
I s . The EP for all sequences is defined as
 
 
 
,
1
1N
i Expl
i
EP k I s
N
. (3)
The larger EP indicates that the explanation is more likely to be
classified as defective by the black-box model.
4 RESULTS AND DISCUSSION
Figure 5. AOPC of different local explanation methods on
different defect prediction models.
Figure 6. PDF of different local explanation methods on
different defect prediction models.
Figure 7. EP of different local explanation methods on
different defect prediction models.
We present the results of our experiments in Figure 5-7. The x-
axis is the number of features (i.e., the hyperparameter k) used
for explanation.
We find that local explanation methods are usually more
effective than random guessing. In most situations (except project
xalan with DBN-based model for all metrics and poi with DBN-
model for metric explanation precision), both LIME and Word
Omission achieve better local faithfulness and explanation
precision compared with random guessing.
We also find that the effectiveness of such methods depends on
the adopted models. The results shows that those methods are both
local faithful and precise for token frequency-based defect
prediction models (TF in Figure 5-7). On the other hand, they may
not be effective as expected when they are applied on deep
learning-based defect prediction models (DBN and CNN in Figure
5-7). Specially, all investigated local explanation methods
performs poorly on DBN model. Since deep semantic models
may consider the interactions between code tokens, explanation
methods based on individual tokens may not be able to capture
such interactions. Thus, it should be careful to use those
methods on deep semantic models since the explanation may not
always actually reflect the prediction, and it is preferred to avoid
using explanations which do not cause the shift of prediction.
Further, we find that the investigated local explanation methods
are sensitive to the hyperparameter k, especially on local
faithfulness. Generally, as Figure 5 and Figure 6 show, the
explanation will be more local faithful to the prediction of black-
box model if more features are used for explanation. Figure 7
also shows that for CNN-based defect prediction model, the
explanation will become more precise when k increases.
However, we notice that for some source code files, the
corresponding token sequences only contain few unique code
tokens (which can be thought as features here). If
hyperparameter k is too large, all the tokens will be used as
explanation, and such explanation will also be meaningless. Thus,
the hyperparameter k should be carefully tuned to get more
precise and meaningful explanation.
5 THREAT TO VALIDITY
Here we give brief discussions about threat to validity.
Dataset. We only study limited of open-source projects. This
may introduce biases to our conclusion.
Defect Prediction Models. We only adopt DBN [10] and CNN
[11] as deep learning-based models in this study. More deep
learning approaches which use advanced techniques [12-14] are
not evaluated. Moreover, source code-based defect prediction
models are not limited by sequences of code tokens. For example,
the ASTs of source code can be processed with Graph Neural
Network [26], and Network embedding techniques can also be
used for Class Dependency Networks extracted from source code
[27]. Besides, AST n-grams from source code may also be
captured as features for source code-based defect prediction [28].
Further, Just-In-Time defect prediction studies may also use both
source code and commit messages (i.e., mixed input of both
programming language and natural language) [29] and code
changes [30] to train deep learning models. Thus, our
conclusions may not be generalized to all source code-based
defect prediction models.
Local Explanation Methods. We only consider LIME [19] and
Word Omission [21] in this study. More popular methods such
as BreakDown [31], and other methods which consider the
interactions between words, are not evaluated. The effectiveness
of those methods is still unclear and needs to be explored in
future.
Implementation. No implementations of these defect prediction
models are publicly available. We cannot ensure our replication
is the same as original even we replicate them carefully.
6 CONCLUSIONS
In this study, we evaluate local explanation methods LIME and
Word Omission with automatic metrics which focus on local
faithfulness and explanation precision on several popular source
code-based defect prediction models. We find that those local
explanation methods are much more effective than random
guessing. We also find that their effectiveness depends on the
adopted defect prediction models. They perform well on token
frequency-based defect prediction models; however, their
effectiveness on deep learning-based models such as DBN and
CNN are not good enough as expected. Their explanations
cannot characterize all the factors contributing to the prediction,
and thus they may not be capable to explain all predictions of
deep learning-based models due to poor local faithfulness.
Besides, we also found that the hyperparameter k of local
explanation methods impacts the effectiveness heavily.
Based on these findings, we suggest:
(1) It is feasible to apply local explanation methods based on
individual code tokens such as LIME and Word Omission on
token frequency-based models. However, when applying such
local explanation methods on deep learning-based defect
prediction models (Such as DBN and CNN), it is necessary to
evaluate the effectiveness of those local explanation methods
first instead of using them directly.
(2) The hyperparameter k of local explanation methods
should be carefully optimized to get meaningful explanation.
ACKNOWLEDGMENTS
This work was partly supported by National Natural Science
Foundation of China (No. 62077029), Future Network Scientific
Research Fund Project (No. FNSRFP-2021-YB-32), and the
Graduate Science Research Innovation Program of Jiangsu
Normal University (No. 2021XKT1394).
REFERENCES
[1] Zhiyuan Wan, Xin Xia, David Lo, Jianwei Yin, and Xiaohu Yang. 2020.
Perceptions, Expectations, and Challenges in Defect Prediction. IEEE
Transactions on Software Engineering, (Nov. 2020).
[2] Jirayus Jiarpakdee, Chakkrit Tantithamthavorn, and John Grundy. 2021.
Practitioners’ Perceptions of the Goals and Visual Explanations of Defect
Prediction Models. In Proceedings of 2021 IEEE/ACM 18th International
Conference on Mining Software Repositories (MSR'21). IEEE, 432–443.
[3] Jirayus Jiarpakdee, Chakkrit Tantithamthavorn, Hoa Khanh Dam, and John
Grundy. 2020. An Empirical Study of Model-Agnostic Techniques for Defect
Prediction Models. IEEE Transactions on Software Engineering, (Mar. 2020).
[4] Dilini Rajapaksha, Chakkrit Tantithamthavorn, Christoph Bergmeir, Wray
Buntine, Jirayus Jiarpakdee and John Grundy. 2021. SQAPlanner:
Generating Data-Informed Software Quality Improvement Plans. IEEE
Transactions on Software Engineering, (Apr. 2021).
[5] Chanathip Pornprasit, Chakkrit Tantithamthavorn, Jirayus Jiarpakdee,
Michael Fu and Patanamon Thongtanunam. 2021. PyExplainer: Explaining
the Predictions of Just-In-Time Defect Models. In Proceedings of the 36th
International Conference on Automated Software Engineering (ASE'21). IEEE,
407–418.
[6] Supatsara Wattanakriengkrai, Patanamon Thongtanunam, Chakkrit
Tantithamthavorn, Hideaki Hata and Kenichi Matsumoto. 2020. Predicting
defective lines using a model-agnostic technique. CoRR.
https://arxiv.org/abs/2009.03612.
[7] Reem Aleithan. 2021. Explainable Just-In-Time Bug Prediction: Are We
There Yet? In Proceedings of 2021 IEEE/ACM 43rd International Conference on
Software Engineering: Companion Proceedings (ICSE-Companion). IEEE, 129–
131.
Gao et al
.
[8] Tung Thanh Nguyen, Hoan Anh Nguyen, Nam H. Pham, Jafar M. Al-Kofahi,
and Tien N. Nguyen. 2009. Graph-based mining of multiple object usage
patterns. In Proceedings of the 7th joint meeting of the European Software
Engineering Conference and the ACM SIGSOFT symposium on the Foundations
of Software Engineering (ESEC/FSE'09). Association for Computing
Machinery, New York, NY, USA, 383–392.
[9] Anh Tuan Nguyen and Tien N. Nguyen. 2015. Graph-based statistical
language model for code. In Proceedings of the 37th International Conference
on Software Engineering (ICSE'15). IEEE, 858–868.
[10] Song Wang, Taiyue Liu and Lin Tan. 2016. Automatically learning semantic
features for defect prediction. In Proceedings of the 38th International
Conference on Software Engineering (ICSE'16). Association for Computing
Machinery, New York, NY, USA, 297–308.
[11] Jian Li, Pinjia He, Jieming Zhu and Michael R. Lyu. 2017. Software Defect
Prediction via Convolutional Neural Network. In Proceedings of 2017 IEEE
International Conference on Software Quality, Reliability, and Security
(QRS'17). IEEE, 318–328.
[12] Guisheng Fan, Xuyang Diao, Huiqun Yu, Kang Yang and Liqiong Chen. 2019.
Deep Semantic Feature Learning with Embedded Static Metrics for Software
Defect Prediction. In Proceedings of the 26th Asia-Pacific Software
Engineering Conference (APSEC'19). IEEE, 244–251.
[13] Hao Wang, Weiyuan Zhuang and, Xiaofang Zhang. 2021. Software Defect
Prediction Based on Gated Hierarchical LSTMs. IEEE Transactions on
Reliability, 70, 2, (June, 2021), 711–727.
[14] Hao Li, Xiaohong Li, Xiang Chen, Xiaofei Xie, Yanzhou Mu and Zhiyong
Feng. 2019. Cross-project Defect Prediction via ASTToken2Vec and BLSTM-
based Neural Network. In Proceedings of 2019 International Joint Conference
on Neural Networks (IJCNN'19). IEEE, 1–8.
[15] Chanathip Pornprasit and Chakkrit Tantithamthavorn. JITLine: A Simpler,
Better, Faster, Finer-grained Just-In-Time Defect Prediction. 2021. In
Proceedings of 2021 IEEE/ACM 18th International Conference on Mining
Software Repositories (MSR'21). IEEE, 369–379.
[16] Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca
Giannotti, and Dino Pedreschi. 2018. A Survey of Methods for Explaining
Black Box Models. ACM Comput. Surv. 51, 5, Article 93 (Aug. 2018).
[17] Marian Jureczko and Lech Madeyski. 2010. Towards identifying software
project clusters with regard to defect prediction. In Proceedings of the 6th
International Conference on Predictive Models in Software Engineering
(PROMISE '10). Association for Computing Machinery, New York, NY, USA,
Article 9, 1–10.
[18] Chakkrit Tantithamthavorn, Ahmed E. Hassan, and Kenichi Matsumoto.
2020. The impact of class rebalancing techniques on the performance and
interpretation of defect prediction models. IEEE Transactions on Software
Engineering, 46, 11, (Nov. 2020), 1200–1219.
[19] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should
I Trust You?": Explaining the Predictions of Any Classifier. In Proceedings of
the 22nd ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining (KDD '16). Association for Computing Machinery, New York,
NY, USA, 1135–1144.
[20] Dong Nguyen. 2018. Comparing Automatic and Human Evaluation of Local
Explanations for Text Classification. In Proceedings of NAACL-HLT 2018.
Association for Computational Linguistics, 1069–1078.
[21] Marko Robnik-Sikonja and Igor Kononenko. 2008. Explaining classifications
for individual instances. IEEE Transactions on Knowledge and Data
Engineering, 20, 5, (Mar, 2008), 589–600.
[22] Hanjie Chen, Guangtao Zhang, and Yanfeng Ji. 2020. Generating
hierarchical explanations on text classification via feature interaction
detection. In Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics (ACL'20). Association for Computational
Linguistics, 5578–5593.
[23] Wojciech Samek, Alexander Binder, Gregoire Montavon, Sebastian
Lapuschkin, and Klaus-Robert Muller. 2017. Evaluating the visualization of
what a deep neural network has learned. IEEE Transactions on Neural
Networks and Learning Systems, 28, 11, (Nov. 2017), 2660–2673.
[24] George Chrysostomou, and Nikolaos Aletras. 2021. Improving the
faithfulness of attention-based explanations with task-specific information
for text classification. In Proceedings of the 59th Annual Meeting of the
Association for Computational Linguistics (ACL'21). Association for
Computational Linguistics, 477–488.
[25] Tao Lei, Regina Barzilay, Tommi Jaakkola. 2016. Rationalizing Neural
Predictions. In Proceedings of the 2016 Conference on Empirical Methods in
Natural Language Processing (EMNLP'16). Association for Computational
Linguistics, USA, 107–117.
[26] Jiaxi Xu, Fei Wang and, Jun Ai. 2021. Defect Prediction with Semantics and
Context Features of Codes Based on Graph Representation Learning. IEEE
Transactions on Reliability, 70, 2, (June, 2021), 613–625.
[27] Yu Qu, Ting Liu, Jianlei Chi, Yangxu Jin, Di Cui, Ancheng He, and Qinghua
Zheng. 2018. node2defect: Using Network Embedding to Improve Software
Defect Prediction. In Proceedings of the 2018 33rd ACM/IEEE International
Conference on Automated Software Engineering (ASE '18). ACM, New York,
NY, USA, 844–849.
[28] Thomas Shippey, David Bowes, and Tracy Hall. 2019. Automatically
identifying code features for software defect prediction: Using AST N-grams.
Information and Software Technology.
[29] Thong Hoang, Hoa Khanh Dam, Yasutaka Kamei, David Lo, and Naoyasu
Ubayashi. 2019. DeepJIT: An End-To-End Deep Learning Framework for
Just-In-Time Defect Prediction. In Proceedings of 2019 IEEE/ACM 16th
International Conference on Mining Software Repositories (MSR'19). IEEE, 34–
45.
[30] Thong Hoang, Hong Jin Kang, David Lo, and Julia Lawall. 2020. CC2Vec:
Distributed Representations of Code Changes. In Proceedings of 2020
IEEE/ACM 42nd International Conference on Software Engineering (ICSE'20).
Association for Computing Machinery, New York, NY, USA, 518–529.
[31] Alicja Gosiewska and Przemyslaw Biecek. 2020. Do Not Trust Additive
Explanations. CoRR. https://arxiv.org/abs/1903.11420.
... By generating feature importance for each local prediction, XAI provides insights into how models make decisions, by enhancing their interpretability (Cambria et al., 2023), and their outputs were stable according to recent empirical studies Yang et al., 2021b). Unfortunately, such explanations could still be unreasonable (Gao et al., 2022); thus, they would hinder the trust of developers in models. In contrast to the up-mentioned study that aimed to identify the causes of defects, which is not the primary objective of a defect prediction model, it is our conviction that the XAI outputs of such models should align with the primary expectations of the practitioners who employ them (Jiarpakdee et al., 2021b), i.e., it should explain why a code component is defect-prone or not. ...
Article
Full-text available
Delivering reliable software under the constraint of limited time and budget is a significant challenge. Recent progress in software defect prediction is helping developers to locate defect-prone code components and allocate quality assurance resources more efficiently. However, practitioners’ criticisms on defect predictors from academia are not practical since they rely heavily on size metrics such as lines of code (LOC), which over-abstracts technical details and provides limited insights for software maintenance. Thus, the performance of predictors may be overclaimed. In response, based on a state-of-the-art defect prediction model, we (1) exclude size metrics and evaluate the impact on performance, (2) include new features such as network dependency metrics, and (3) explore which ones are better alternatives to size metrics using explainable artificial intelligence (XAI) technique. We find that excluding size metrics decreases model performance by 1.99% and 0.66% on AUC-ROC in within- and cross-project prediction respectively. The results show that two involved network dependence metrics (i.e., Betweenness and pWeakC(out)) and four other code metrics (i.e., LCOM, AVG(CC), LCOM3, and CAM) could effectively preserve or improve the prediction performance, even if we exclude size metrics. In conclusion, we suggest discarding size metrics and involving the mentioned network dependency metrics for better performance and explainability.
... This aims to assist testers in effectively allocating their limited testing resources and crafting efficient, resource-saving software quality assurance (SQA) plans [6,7]. However, recent researches [8][9][10] pointed out that improving the prediction performance of SDP models to formulate SQA is only one of the multiple goals of software defect prediction. The SDP model should also have the capability to offer practitioners profound insights. ...
Article
Full-text available
The performance of software defect prediction (SDP) models determines the priority of test resource allocation. Researchers also use interpretability techniques to gain empirical knowledge about software quality from SDP models. However, SDP methods designed in the past research rarely consider the impact of data transformation methods, simple but commonly used preprocessing techniques, on the performance and interpretability of SDP models. Therefore, in this paper, we investigate the impact of three data transformation methods (Log, Minmax, and Z-score) on the performance and interpretability of SDP models. Through empirical research on (i) six classification techniques (random forest, decision tree, logistic regression, Naive Bayes, K-nearest neighbors, and multilayer perceptron), (ii) six performance evaluation indicators (Accuracy, Precision, Recall, F1, MCC, and AUC), (iii) two interpretable methods (permutation and SHAP), (iv) two feature importance measures (Top-k feature rank overlap and difference), and (v) three datasets (Promise, Relink, and AEEEM), our results show that the data transformation methods can significantly improve the performance of the SDP models and greatly affect the variation of the most important features. Specifically, the impact of data transformation methods on the performance and interpretability of SDP models depends on the classification techniques and evaluation indicators. We observe that log transformation improves NB model performance by 7%–61% on the other five indicators with a 5% drop in Precision. Minmax and Z-score transformation improves NB model performance by 2%–9% across all indicators. However, all three transformation methods lead to substantial changes in the Top-5 important feature ranks, with differences exceeding 2 in 40%–80% of cases (detailed results available in the main content). Based on our findings, we recommend that (1) considering the impact of data transformation methods on model performance and interpretability when designing SDP approaches as transformations can improve model accuracy, and potentially obscure important features, which lead to challenges in interpretation, (2) conducting comparative experiments with and without the transformations to validate the effectiveness of proposed methods which are designed to improve the prediction performance, and (3) tracking changes in the most important features before and after applying data transformation methods to ensure precise and traceable interpretability conclusions to gain insights. Our study reminds researchers and practitioners of the need for comprehensive considerations even when using other similar simple data processing methods.
Chapter
Automating the documentation, explanation, and modification of source code has been an area of intense research interest for several decades. This paper proposes a novel framework for analyzing source codes to generate natural language summaries, answer complex user queries by reasoning, and even support syntactic error checking and code conversion to other languages. The framework incorporates transformer-based models and knowledge graphs to understand the context of the code and the user’s queries. The contributions of this proposed approach represent a significant leap forward by overcoming the limitations of the most advanced ChatGPT model. The framework ensures the accuracy and reliability of the responses while handling multiple source code files efficiently, making it a valuable asset for software developers and stakeholders.
Article
Interpretation is important for adopting software defect prediction in practice. Model-agnostic techniques such as Local Interpretable Model-agnostic Explanation (LIME) can help practitioners understand the factors which contribute to the prediction. They are effective and useful for models constructed on tabular data with traditional features. However, when they are applied on source code-based models, they cannot differentiate the contribution of code tokens in different locations for deep learning-based models with Bag-of-Word features. Besides, only using limited features as explanation may result in information loss about actual riskiness. Such limitations may lead to inaccurate explanation for source code-based models, and make model-agnostic techniques not useful and helpful as expected. Thus, we apply a perturbation-based approach Randomized Input Sampling Explanation (RISE) for source code-based defect prediction. Besides, to fill the gap that there lacks a systematical evaluation on model-agnostic techniques on source code-based defect models, we also conduct an extensive case study on the model-agnostic techniques on both token frequency-based and deep learning-based models. We find that (1) model-agnostic techniques are effective to identify the most important code tokens for an individual prediction and predict defective lines based on the importance scores, (2) using limited features (code tokens) for explanation may result in information loss about actual riskiness, and (3) RISE is more effective than others as it can generate more accurate explanation, achieve better cost-effectiveness for line-level prediction, and result in less information loss about actual riskiness. Based on such findings, we suggest that model-agnostic techniques can be a supplement to file-level source code-based defect models, while such explanations should be used with caution as actual risky tokens may be ignored. Also, compared with LIME, we would recommend RISE for a more effective explanation.
Conference Paper
Full-text available
Just-In-Time (JIT) defect prediction (i.e., an AI/ML model to predict defect-introducing commits) is proposed to help developers prioritize their limited Software Quality Assurance (SQA) resources on the most risky commits. However, the explainability of JIT defect models remains largely unexplored (i.e., practitioners still do not know why a commit is predicted as defect-introducing). Recently, LIME has been used to generate explanations for any AI/ML models. However, the random perturbation approach used by LIME to generate synthetic neighbors is still suboptimal, i.e., generating synthetic neighbors that may not be similar to an instance to be explained, producing low accuracy of the local models, leading to inaccurate explanations for just-in-time defect models. In this paper, we propose PyExplainer-i.e., a local rule-based model-agnostic technique for generating explanations (i.e., why a commit is predicted as defective) of JIT defect models. Through a case study of two open-source software projects, we find that our PyExplainer produces (1) synthetic neighbors that are 41%-45% more similar to an instance to be explained; (2) 18%-38% more accurate local models; and (3) explanations that are 69%-98% more unique and 17%-54% more consistent with the actual characteristics of defect-introducing commits in the future than LIME (a state-of-the-art model-agnostic technique). This could help practitioners focus on the most important aspects of the commits to mitigate the risk of being defect-introducing. Thus, the contributions of this paper build an important step towards Explainable AI for Software Engineering, making software analytics more explainable and actionable. Finally, we publish our PyExplainer as a Python package to support practitioners and researchers (https://github.com/awsm-research/PyExplainer).
Conference Paper
Full-text available
Explaining the prediction results of software bug prediction models is a challenging task, which can provide useful information for developers to understand and fix the predicted bugs. Recently, Jirayus et al. [4] proposed to use two model-agnostic techniques (i.e., LIME and iBreakDown) to explain the prediction results of bug prediction models. Although their experiments on file-level bug prediction show promising results, the performance of these techniques on explaining the results of just-in-time (i.e., change-level) bug prediction is unknown. This paper conducts the first empirical study to explore the explainability of these model-agnostic techniques on just-in-time bug prediction models. Specifically, this study takes a three-step approach, 1) replicating previously widely used just-in-time bug prediction models, 2) applying Local Interpretability Model-agnostic Explanation Technique (LIME) and iBreakDown on the prediction results, and 3) manually evaluating the explanations for buggy instances (i.e. positive predictions) against the root cause of the bugs. The results of our experiment show that LIME and iBreakDown fail to explain defect prediction explanations for just-in-time bug prediction models, unlike file-level [4]. This paper urges for new approaches for explaining the results of just-in-time bug prediction models.
Article
Full-text available
Software Quality Assurance (SQA) planning aims to define proactive plans, such as defining maximum file size, to prevent the occurrence of software defects in future releases. To aid this, defect prediction models have been proposed to generate insights as the most important factors that are associated with software quality. Such insights that are derived from traditional defect models are far from actionable---i.e., practitioners still do not know what they should do or avoid to decrease the risk of having defects, and what is the risk threshold for each metric. A lack of actionable guidance and risk threshold can lead to inefficient and ineffective SQA planning processes. In this paper, we investigate the practitioners' perceptions of current SQA planning activities, current challenges of such SQA planning activities, and propose four types of guidance to support SQA planning. We then propose and evaluate our AI-Driven SQAPlanner approach, a novel approach for generating four types of guidance and their associated risk thresholds in the form of rule-based explanations for the predictions of defect prediction models. Finally, we develop and evaluate a visualization for our SQAPlanner approach. Through the use of qualitative survey and empirical evaluation, our results lead us to conclude that SQAPlanner is needed, effective, stable, and practically applicable. We also find that 80% of our survey respondents perceived that our visualization is more actionable. Thus, our SQAPlanner paves a way for novel research in actionable software analytics---i.e., generating actionable guidance on what should practitioners do and not do to decrease the risk of having defects to support SQA planning.
Article
Full-text available
Defect prediction models are proposed to help a team prioritize source code areas files that need Software Quality Assurance (SQA) based on the likelihood of having defects. However, developers may waste their unnecessary effort on the whole file while only a small fraction of its source code lines are defective. Indeed, we find that as little as 1%-3% of lines of a file are defective. Hence, in this work, we propose a novel framework (called LINE-DP) to identify defective lines using a model-agnostic technique, i.e., an Explainable AI technique that provides information why the model makes such a prediction. Broadly speaking, our LINE-DP first builds a file-level defect model using code token features. Then, our LINE-DP uses a state-of-the-art model-agnostic technique (i.e., LIME) to identify risky tokens, i.e., code tokens that lead the file-level defect model to predict that the file will be defective. Then, the lines that contain risky tokens are predicted as defective lines. Through a case study of 32 releases of nine Java open source systems, our evaluation results show that our LINE-DP achieves an average recall of 0.61, a false alarm rate of 0.47, a top 20%LOC recall of 0.27, and an initial false alarm of 16, which are statistically better than six baseline approaches. Our evaluation shows that our LINE-DP requires an average computation time of 10 seconds including model construction and defective identification time. In addition, we find that 63% of defective lines that can be identified by our LINE-DP are related to common defects (e.g., argument change, condition change). These results suggest that our LINE-DP can effectively identify defective lines that contain common defects while requiring a smaller amount of inspection effort and a manageable computation cost. The contribution of this paper builds an important step towards line-level defect prediction by leveraging a model-agnostic technique.
Article
Software defect prediction, aimed at assisting software practitioners in allocating test resources more efficiently, predicts the potential defective modules in software products. With the development of defect prediction technology, the inability of traditional software features to capture semantic information is exposed, hence related researchers have turned to semantic features to build defect prediction models. However, sometimes traditional features such as lines of code (LOC) also play an important role in defect prediction. Most of the existing researches only focus on using a single type of feature as the input of the model. In this article, a defect prediction method based on gated hierarchical long short-term memory networks (GH-LSTMs) is proposed, which uses hierarchical LSTM networks to extract both semantic features from word embeddings of abstract syntax trees (ASTs) of source code files, and traditional features provided by the PROMISE repository. More importantly, we adopt a gated fusion strategy to combine the outputs of the hierarchical networks properly. Experimental results show that GH-LSTMs outperforms existing methods under both noneffort-aware and effort-aware scenarios.
Article
To optimize the process of software testing and to improve software quality and reliability, many attempts have been made to develop more effective methods for predicting software defects. Previous work on defect prediction has used machine learning and artificial software metrics. Unfortunately, artificial metrics are unable to represent the features of syntactic, semantic, and context information of defective modules. In this article, therefore, we propose a practical approach for identifying software defect patterns via the combination of semantics and context information using abstract syntax tree representation learning. Graph neural networks are also leveraged to capture the latent defect information of defective subtrees, which are pruned based on a fix-inducing change. To validate the proposed approach for predicting defects, we define mining rules based on the GitHub workflow and collect 6052 defects from 307 projects. The experiments indicate that the proposed approach performs better than the state-of-the-art approach and five traditional machine learning baselines. An ablation study shows that the information about code concepts leads to a significant increase in accuracy.