Content uploaded by Yuxiang Gao
Author content
All content in this area was uploaded by Yuxiang Gao on Apr 30, 2022
Content may be subject to copyright.
Evaluating the effectiveness of local explanation methods on
source code-based defect prediction models
Yuxiang Gao
School of Computer Science
Jiangsu Normal University
Xuzhou, Jiangsu, China
gaoyx@jsnu.edu.cn
Yi Zhu*
School of Computer Science
Jiangsu Normal University
Xuzhou, Jiangsu, China
zhuy@jsnu.edu.cn
Qiao Yu
School of Computer Science
Jiangsu Normal University
Xuzhou, Jiangsu, China
yuqiao@jsnu.edu.cn
ABSTRACT
Interpretation has been considered as one of key factors for
applying defect prediction in practice. As one way for
interpretation, local explanation methods has been widely used
for certain predictions on datasets of traditional features. There
are also attempts to use local explanation methods on source
code-based defect prediction models, but unfortunately, it will
get poor results. Since it is unclear how effective those local
explanation methods are, we evaluate such methods with
automatic metrics which focus on local faithfulness and
explanation precision. Based on the results of experiments, we
find that the effectiveness of local explanation methods depends
on the adopted defect prediction models. They are effective on
token frequency-based models, while they may not be effective
enough to explain all predictions of deep learning-based models.
Besides, we also find that the hyperparameter of local
explanation methods should be carefully optimized to get more
precise and meaningful explanation.
CCS CONCEPTS
• Software and its engineering →Maintaining software.
KEYWORDS
Software Defect Prediction, Local Explanation, Explainable
Machine Learning, LIME
1 INTRODUCTION
Interpretation of defect prediction models has been considered as
one of key factors for applying defect prediction in practice [1, 2].
As one important way to explain defect prediction models, local
explanation methods have been adopted to defect prediction
studies to explain why a certain prediction is buggy [3-7].
However, many of them focus on defect prediction models based
on traditional features (i.e., handcraft features), and if the
explanation can tell the defects from source code, it may be more
useful in practice. Motivated by this, Wattanakriengkrai et al. [6]
attempt to use LIME to explain token frequency-based defect
prediction models and predict defect lines according to key
tokens provided by LIME. Unfortunately, by investigating the
relationship between local explanations and actual defects
manually, Aleithan [7] finds that the local explanation from
defect prediction models poorly represents the ground truth (i.e.,
the key tokens are not related to actual defects) based on token
frequency features in the scenario of Just-In-Time defect
prediction. Inefficient local explanation methods may lead to
such poor result, and it is still unclear how effective these local
explanation techniques are when they are applied on source
code-based defect prediction models.
Thus, in this paper, we are going to evaluate the effectiveness
of local explanation methods when they are applied on source
code-based defect prediction models. In detail, we first apply
several popular local explanation methods on different file-level
source code-based defect prediction models and then introduce
automatic metrics which focus on the local faithfulness and
explanation precision to evaluate effectiveness of such methods.
We find that those local explanation methods are much more
effective than random guessing in general, and they are effective
on token frequency-based defect prediction models. However,
when they are applied on deep learning-based models (e.g., CNN
and DBN), they may not be effective enough as expected. We
observe that though the explanations of those methods on CNN
model are precise (i.e., the explanation itself can be predicted
correctly), those explanations cannot characterize all the factors
contributing to the prediction of deep learning-based models (i.e.,
the prediction will not be shifted when code tokens in
explanations are removed). Besides, we also find that the
hyperparameter k (i.e., number of features used for explanation)
of local explanation methods heavily impacts the effectiveness.
Based on the above findings, we suggest:
(1) It is feasible to apply explanation methods based on
individual code tokens (such as LIME and Word Omission) on
bag of word-based defect prediction models. However, when
applying such local explanation methods on deep learning-based
defect prediction models (Such as DBN and CNN), it is necessary
to evaluate the effectiveness of local explanation methods first
instead of using them directly.
(2) Hyperparameter k should be carefully optimized to get
meaningful explanation.
We also share our implementation, which includes source
code, data, and simple instructions at
https://github.com/gyx1997/msr22-defects.
The rest of this paper is organized as follows. Section 2 gives
the background of source code-based defect prediction and local
explanation methods. Section 3 presents the design of our case
study. Section 4 gives the result of our experiments and brief
Gao et al
.
discussions about our findings. We discuss threats to validity in
Section 5. In Section 6 we present our conclusion.
2 BACKGROUND
2.1 Source Code-based Defect Prediction
Figure 1. Framework for constructing deep semantic defect
prediction models
The source code has structural and semantic information based
on AST [8, 9]. To automatically capture the semantic features of
programs, several studies use deep learning-based approaches to
construct defect prediction models directly from source code
based on the AST with various deep learning techniques [10-14].
They share the common framework proposed by [10], which is
described in Figure 1. Firstly, source files are parsed into abstract
syntax trees (ASTs). Next, the trees are traversed to get
sequences of AST nodes which belong to specific types (e.g.,
control flow nodes, method invocation nodes, etc.). Each node
will be represented by a string code token (e.g., method names
for method invocation nodes and the node type for control flow
nodes), and each unique code token is assigned with an integer
identifier, to transform sequences of code tokens into integer
vectors. Finally, those vectors are fed into deep neural networks
for semantic features extraction and model construction. Various
deep learning techniques are adopted. For example, Wang et al.
first uses the deep belief network (DBN) to extract semantic
features from the code token sequence from nodes of abstract
syntax tree [10]. Based on their proposed framework, more
advanced techniques such as CNN [11], LSTM with attention
mechanism [12], gated LSTM [13] and novel word embedding
technique [14] are used to improve the performance.
Besides the deep learning-based approaches, token
frequency-based approaches which use Bag of Word (BoW)
feature vectors (vector of frequencies for each code token) to
represent source code are also used by prior studies since it is
simple and computational efficient [6, 15]. As Figure 2 shows, for
token frequency-based approaches, irrelevant characters such as
semicolon and comments are removed firstly. Then, vectors of
token frequencies are built by counting the code tokens. Finally,
models are trained on those vectors.
Figure 2. Framework for constructing token frequency-
based defect prediction models
2.2 Local Explanation Methods
As one way for interpreting black-box machine learning models,
local explanation methods are post-hoc methods which explain
why a certain prediction is made [16]. For a certain instance
x
in test set
test
D
, a local explanation method can generate
corresponding local explanation
,
l
e ε
pred x
where
ε
is the
explanation function and
pred
is the prediction function of the
black-box model. Compared with other explanation methods,
local explanation method can interpret any black-box models
since only prediction function is required. It can also deal with
any input data with flexible forms of representation. For example,
local explanation of black-box models which are built on tabular
data will be a set of features with contribution to the predictions,
while explanations of image or text models are usually parts of
the input (i.e., super-pixels of images, and words, phrases for
texts) which mainly contribute to the prediction.
Recently, for tabular data with traditional features, local
explanation methods such as LIME have been adopted and well
evaluated on defect prediction models [3]. Further, rule-based
methods which are more effective and actionable (i.e., providing
clearer guidance for practitioners) are proposed for file-level
defect prediction [4] and just-in-time defect prediction [5].
However, those studies only focus on tabular data with
traditional features. To our best knowledge, though LIME has
been adopted on token frequency-based models to predict defect
lines [6] and explain just-in-time models [7], effectiveness of
local explanation methods has not been evaluated when they are
applied on source code-based defect prediction models. The
unclear effectiveness of local explanation methods may result in
pitfalls when applying them in practice, and thus we evaluate
their effectiveness in our case study.
3 CASE STUDY DESIGN
The overall framework of our case study is shown in Figure 3. By
following the practice of prior studies, we first select project
with continuous versions from PROMISE repository and clone
the source code from GitHub. Then we construct source code-
based defect prediction models on project of old version. Next,
we use local explanation methods to explain each file which is
predicted defective in project of new version and calculate
automatic metrics. Finally, averaged results are reported after
repeating these steps for 5 times.
Figure 3. Framework of the case study.
3.1 Dataset
In this study, we select projects in PROMISE repository [17]
which satisfy: (1) the project should have at least 2 continuous
versions; (2) the defect labels and source code of corresponding
versions are available; and (3) the training data are nearly
balanced since imbalanced data has great negative impact on
interpretation [18]. As a result, we pick up lucene, poi and xalan
and clone the source code of corresponding versions from
GitHub. Details of them are shown in Table 1.
Table 1 Studied Projects
P
roject
O
ld version
New version
%
D
efective
Lucene 2.0 2.2 46.67
Poi 1.5 2.0 59.49
Xalan 2.5 2.6 48.19
3.2 Defect Prediction Models
Figure 4. Framework of adopted source code-Based defect
prediction models
We adopt DBN [10] and CNN [11] as deep learning-based and TF
[6] as token frequency-based defect prediction models. The
framework of adopted models is shown in Figure 4. We first
parse java source files as ASTs by python package javalang, and
then we traverse the parsed ASTs to get sequences of code
tokens. For deep learning-based models, we map each code token
to a unique integer identifier, and then we get vectors of code
tokens which use integer identifiers to represent sequences. For
token frequency-based models, we count the frequencies of code
tokens, and then we get the vectors of code tokens based on bag
of word features. Finally, we train the defect prediction models.
3.3 Local Explanation Methods
In Section 3.2 we get a sequence of code tokens for a certain file,
which can be thought as a sentence in text classification. Thus,
we adopt several local explanation techniques from NLP field.
LIME. LIME [19] is a model-agnostic approach which trains a
local regression model on perturbed samples around the specific
data and uses the coefficients of the regression model as
interpretation. In this study, it first generates perturbed samples
by randomly removing code tokens from the given sample. Then,
the labels of perturbed samples are generated by black-box
models. Finally, it trains the local regression model with those
perturbed samples based on bag of word features. Since the
number of perturbed samples highly impacts on the explanation
[20], we set the number of perturbed samples as 5,000 in all
experiments for LIME.
Word Omission. Word Omission [21] aims to estimate the
contribution of individual words, and then measure the effect of
certain words to a black-box model. In this study, we first get a
set of unique tokens appeared in the given sample. Then, we
remove all the appearances of each unique code token from the
given sample, and then calculate the difference of prediction
probability
\
ˆˆt
p y s p y s as the importance score for each
code token. Here ˆ
yis the predicted class, and \
,t
s s are sequences
of code tokens with, and without the specific code token t
respectively. Finally, we output k code tokens with highest
scores as explanation, i.e., the code tokens which contribute to
the prediction.
We also adopt Random Guessing, i.e., randomly choose k
unique code tokens as most important features, as baseline for
comparison.
3.4 Automatic Metrics
We adopt automatic metrics to evaluate the local explanation
techniques from the perspective of local faithfulness and
explanation precision. By following the practice in [20], we
assume the golden label of instances in test project are unknown,
and we examine the instances by the predicted label. For
simplicity, we only investigate instances which are predicted to
be defective, since the non-defective instances are usually trivial.
3.4.1 Local Faithfulness
Local faithfulness is one of the key criteria for local explanation
techniques [20, 22]. Thus, we mainly consider automatic metrics
which evaluate local faithfulness in this study. In detail, we
evaluate the local faithfulness of adopted local explanation
methods with deletion-based metrics AOPC and Percentage of
Decision Flip.
Area over the perturbation curve (AOPC) estimates the average
probability change when important features are removed [23],
and it has been widely used to evaluate the quality of local
explanation on text classification models [20, 22]. In our case
study, it is defined as
1
1ˆˆ
Nk
i i
i
AOPC k p y s p y s
N
, (1)
where N is the number of investigated instances,
ˆ
p y is the
probability of predicted class (defective in this study), and
,k
i i
s s
is the token sequence with, and without the top-k features (i.e.,
explanation) respectively. Larger AOPC indicates the better local
faithfulness.
Motivated by Decision Flip-Most Informative Token [24], we
also propose a new metric percentage of decision flip (PDF), to
estimate extent of prediction shifting when features in
explanation are removed. It is the percentage of investigated
instances which prediction will be shifted if k features from
explanation are removed, which is defined as
Gao et al.
1
1Nk
i i
i
PDF k I s I s
N
. (2)
Here
I is the prediction function, and
1I s indicates that
token sequence s is predicted to be defective, while
0I s
means that s is predicated to be clean. The larger PDF means
the explanation will better reflect the features (code tokens in
this study) which contribute to the prediction.
3.4.2 Explanation Precision
Intuitively, an optimal explanation should be able to reflect the
certain prediction independently without any other features. In
another word, the subset of code token sequences picked up by
explanation should also result in the same prediction. Inspired by
evaluation of rationale in text classification tasks [25], we
evaluate the precision of explanation by calculate the percentage
of explanations which result in the same prediction as
explanation precision (EP), which is defined as follows.
For i-th sequence of code tokens
1 2
, , ,
i n
s x x x and its
explanation (with k features)
1 2
, , ,
i k
Expl s t t t, we replace
all
j i
x Expl s with a special padding token pad
t, and get a
new sequence ,i Expl
s which elements are either pad
t or in
i
Expl s . Then, we feed ,i Expl
s into the black-box model and get
the prediction
,i Expl
I s . The EP for all sequences is defined as
,
1
1N
i Expl
i
EP k I s
N
. (3)
The larger EP indicates that the explanation is more likely to be
classified as defective by the black-box model.
4 RESULTS AND DISCUSSION
Figure 5. AOPC of different local explanation methods on
different defect prediction models.
Figure 6. PDF of different local explanation methods on
different defect prediction models.
Figure 7. EP of different local explanation methods on
different defect prediction models.
We present the results of our experiments in Figure 5-7. The x-
axis is the number of features (i.e., the hyperparameter k) used
for explanation.
We find that local explanation methods are usually more
effective than random guessing. In most situations (except project
xalan with DBN-based model for all metrics and poi with DBN-
model for metric explanation precision), both LIME and Word
Omission achieve better local faithfulness and explanation
precision compared with random guessing.
We also find that the effectiveness of such methods depends on
the adopted models. The results shows that those methods are both
local faithful and precise for token frequency-based defect
prediction models (TF in Figure 5-7). On the other hand, they may
not be effective as expected when they are applied on deep
learning-based defect prediction models (DBN and CNN in Figure
5-7). Specially, all investigated local explanation methods
performs poorly on DBN model. Since deep semantic models
may consider the interactions between code tokens, explanation
methods based on individual tokens may not be able to capture
such interactions. Thus, it should be careful to use those
methods on deep semantic models since the explanation may not
always actually reflect the prediction, and it is preferred to avoid
using explanations which do not cause the shift of prediction.
Further, we find that the investigated local explanation methods
are sensitive to the hyperparameter k, especially on local
faithfulness. Generally, as Figure 5 and Figure 6 show, the
explanation will be more local faithful to the prediction of black-
box model if more features are used for explanation. Figure 7
also shows that for CNN-based defect prediction model, the
explanation will become more precise when k increases.
However, we notice that for some source code files, the
corresponding token sequences only contain few unique code
tokens (which can be thought as features here). If
hyperparameter k is too large, all the tokens will be used as
explanation, and such explanation will also be meaningless. Thus,
the hyperparameter k should be carefully tuned to get more
precise and meaningful explanation.
5 THREAT TO VALIDITY
Here we give brief discussions about threat to validity.
Dataset. We only study limited of open-source projects. This
may introduce biases to our conclusion.
Defect Prediction Models. We only adopt DBN [10] and CNN
[11] as deep learning-based models in this study. More deep
learning approaches which use advanced techniques [12-14] are
not evaluated. Moreover, source code-based defect prediction
models are not limited by sequences of code tokens. For example,
the ASTs of source code can be processed with Graph Neural
Network [26], and Network embedding techniques can also be
used for Class Dependency Networks extracted from source code
[27]. Besides, AST n-grams from source code may also be
captured as features for source code-based defect prediction [28].
Further, Just-In-Time defect prediction studies may also use both
source code and commit messages (i.e., mixed input of both
programming language and natural language) [29] and code
changes [30] to train deep learning models. Thus, our
conclusions may not be generalized to all source code-based
defect prediction models.
Local Explanation Methods. We only consider LIME [19] and
Word Omission [21] in this study. More popular methods such
as BreakDown [31], and other methods which consider the
interactions between words, are not evaluated. The effectiveness
of those methods is still unclear and needs to be explored in
future.
Implementation. No implementations of these defect prediction
models are publicly available. We cannot ensure our replication
is the same as original even we replicate them carefully.
6 CONCLUSIONS
In this study, we evaluate local explanation methods LIME and
Word Omission with automatic metrics which focus on local
faithfulness and explanation precision on several popular source
code-based defect prediction models. We find that those local
explanation methods are much more effective than random
guessing. We also find that their effectiveness depends on the
adopted defect prediction models. They perform well on token
frequency-based defect prediction models; however, their
effectiveness on deep learning-based models such as DBN and
CNN are not good enough as expected. Their explanations
cannot characterize all the factors contributing to the prediction,
and thus they may not be capable to explain all predictions of
deep learning-based models due to poor local faithfulness.
Besides, we also found that the hyperparameter k of local
explanation methods impacts the effectiveness heavily.
Based on these findings, we suggest:
(1) It is feasible to apply local explanation methods based on
individual code tokens such as LIME and Word Omission on
token frequency-based models. However, when applying such
local explanation methods on deep learning-based defect
prediction models (Such as DBN and CNN), it is necessary to
evaluate the effectiveness of those local explanation methods
first instead of using them directly.
(2) The hyperparameter k of local explanation methods
should be carefully optimized to get meaningful explanation.
ACKNOWLEDGMENTS
This work was partly supported by National Natural Science
Foundation of China (No. 62077029), Future Network Scientific
Research Fund Project (No. FNSRFP-2021-YB-32), and the
Graduate Science Research Innovation Program of Jiangsu
Normal University (No. 2021XKT1394).
REFERENCES
[1] Zhiyuan Wan, Xin Xia, David Lo, Jianwei Yin, and Xiaohu Yang. 2020.
Perceptions, Expectations, and Challenges in Defect Prediction. IEEE
Transactions on Software Engineering, (Nov. 2020).
[2] Jirayus Jiarpakdee, Chakkrit Tantithamthavorn, and John Grundy. 2021.
Practitioners’ Perceptions of the Goals and Visual Explanations of Defect
Prediction Models. In Proceedings of 2021 IEEE/ACM 18th International
Conference on Mining Software Repositories (MSR'21). IEEE, 432–443.
[3] Jirayus Jiarpakdee, Chakkrit Tantithamthavorn, Hoa Khanh Dam, and John
Grundy. 2020. An Empirical Study of Model-Agnostic Techniques for Defect
Prediction Models. IEEE Transactions on Software Engineering, (Mar. 2020).
[4] Dilini Rajapaksha, Chakkrit Tantithamthavorn, Christoph Bergmeir, Wray
Buntine, Jirayus Jiarpakdee and John Grundy. 2021. SQAPlanner:
Generating Data-Informed Software Quality Improvement Plans. IEEE
Transactions on Software Engineering, (Apr. 2021).
[5] Chanathip Pornprasit, Chakkrit Tantithamthavorn, Jirayus Jiarpakdee,
Michael Fu and Patanamon Thongtanunam. 2021. PyExplainer: Explaining
the Predictions of Just-In-Time Defect Models. In Proceedings of the 36th
International Conference on Automated Software Engineering (ASE'21). IEEE,
407–418.
[6] Supatsara Wattanakriengkrai, Patanamon Thongtanunam, Chakkrit
Tantithamthavorn, Hideaki Hata and Kenichi Matsumoto. 2020. Predicting
defective lines using a model-agnostic technique. CoRR.
https://arxiv.org/abs/2009.03612.
[7] Reem Aleithan. 2021. Explainable Just-In-Time Bug Prediction: Are We
There Yet? In Proceedings of 2021 IEEE/ACM 43rd International Conference on
Software Engineering: Companion Proceedings (ICSE-Companion). IEEE, 129–
131.
Gao et al
.
[8] Tung Thanh Nguyen, Hoan Anh Nguyen, Nam H. Pham, Jafar M. Al-Kofahi,
and Tien N. Nguyen. 2009. Graph-based mining of multiple object usage
patterns. In Proceedings of the 7th joint meeting of the European Software
Engineering Conference and the ACM SIGSOFT symposium on the Foundations
of Software Engineering (ESEC/FSE'09). Association for Computing
Machinery, New York, NY, USA, 383–392.
[9] Anh Tuan Nguyen and Tien N. Nguyen. 2015. Graph-based statistical
language model for code. In Proceedings of the 37th International Conference
on Software Engineering (ICSE'15). IEEE, 858–868.
[10] Song Wang, Taiyue Liu and Lin Tan. 2016. Automatically learning semantic
features for defect prediction. In Proceedings of the 38th International
Conference on Software Engineering (ICSE'16). Association for Computing
Machinery, New York, NY, USA, 297–308.
[11] Jian Li, Pinjia He, Jieming Zhu and Michael R. Lyu. 2017. Software Defect
Prediction via Convolutional Neural Network. In Proceedings of 2017 IEEE
International Conference on Software Quality, Reliability, and Security
(QRS'17). IEEE, 318–328.
[12] Guisheng Fan, Xuyang Diao, Huiqun Yu, Kang Yang and Liqiong Chen. 2019.
Deep Semantic Feature Learning with Embedded Static Metrics for Software
Defect Prediction. In Proceedings of the 26th Asia-Pacific Software
Engineering Conference (APSEC'19). IEEE, 244–251.
[13] Hao Wang, Weiyuan Zhuang and, Xiaofang Zhang. 2021. Software Defect
Prediction Based on Gated Hierarchical LSTMs. IEEE Transactions on
Reliability, 70, 2, (June, 2021), 711–727.
[14] Hao Li, Xiaohong Li, Xiang Chen, Xiaofei Xie, Yanzhou Mu and Zhiyong
Feng. 2019. Cross-project Defect Prediction via ASTToken2Vec and BLSTM-
based Neural Network. In Proceedings of 2019 International Joint Conference
on Neural Networks (IJCNN'19). IEEE, 1–8.
[15] Chanathip Pornprasit and Chakkrit Tantithamthavorn. JITLine: A Simpler,
Better, Faster, Finer-grained Just-In-Time Defect Prediction. 2021. In
Proceedings of 2021 IEEE/ACM 18th International Conference on Mining
Software Repositories (MSR'21). IEEE, 369–379.
[16] Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca
Giannotti, and Dino Pedreschi. 2018. A Survey of Methods for Explaining
Black Box Models. ACM Comput. Surv. 51, 5, Article 93 (Aug. 2018).
[17] Marian Jureczko and Lech Madeyski. 2010. Towards identifying software
project clusters with regard to defect prediction. In Proceedings of the 6th
International Conference on Predictive Models in Software Engineering
(PROMISE '10). Association for Computing Machinery, New York, NY, USA,
Article 9, 1–10.
[18] Chakkrit Tantithamthavorn, Ahmed E. Hassan, and Kenichi Matsumoto.
2020. The impact of class rebalancing techniques on the performance and
interpretation of defect prediction models. IEEE Transactions on Software
Engineering, 46, 11, (Nov. 2020), 1200–1219.
[19] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should
I Trust You?": Explaining the Predictions of Any Classifier. In Proceedings of
the 22nd ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining (KDD '16). Association for Computing Machinery, New York,
NY, USA, 1135–1144.
[20] Dong Nguyen. 2018. Comparing Automatic and Human Evaluation of Local
Explanations for Text Classification. In Proceedings of NAACL-HLT 2018.
Association for Computational Linguistics, 1069–1078.
[21] Marko Robnik-Sikonja and Igor Kononenko. 2008. Explaining classifications
for individual instances. IEEE Transactions on Knowledge and Data
Engineering, 20, 5, (Mar, 2008), 589–600.
[22] Hanjie Chen, Guangtao Zhang, and Yanfeng Ji. 2020. Generating
hierarchical explanations on text classification via feature interaction
detection. In Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics (ACL'20). Association for Computational
Linguistics, 5578–5593.
[23] Wojciech Samek, Alexander Binder, Gregoire Montavon, Sebastian
Lapuschkin, and Klaus-Robert Muller. 2017. Evaluating the visualization of
what a deep neural network has learned. IEEE Transactions on Neural
Networks and Learning Systems, 28, 11, (Nov. 2017), 2660–2673.
[24] George Chrysostomou, and Nikolaos Aletras. 2021. Improving the
faithfulness of attention-based explanations with task-specific information
for text classification. In Proceedings of the 59th Annual Meeting of the
Association for Computational Linguistics (ACL'21). Association for
Computational Linguistics, 477–488.
[25] Tao Lei, Regina Barzilay, Tommi Jaakkola. 2016. Rationalizing Neural
Predictions. In Proceedings of the 2016 Conference on Empirical Methods in
Natural Language Processing (EMNLP'16). Association for Computational
Linguistics, USA, 107–117.
[26] Jiaxi Xu, Fei Wang and, Jun Ai. 2021. Defect Prediction with Semantics and
Context Features of Codes Based on Graph Representation Learning. IEEE
Transactions on Reliability, 70, 2, (June, 2021), 613–625.
[27] Yu Qu, Ting Liu, Jianlei Chi, Yangxu Jin, Di Cui, Ancheng He, and Qinghua
Zheng. 2018. node2defect: Using Network Embedding to Improve Software
Defect Prediction. In Proceedings of the 2018 33rd ACM/IEEE International
Conference on Automated Software Engineering (ASE '18). ACM, New York,
NY, USA, 844–849.
[28] Thomas Shippey, David Bowes, and Tracy Hall. 2019. Automatically
identifying code features for software defect prediction: Using AST N-grams.
Information and Software Technology.
[29] Thong Hoang, Hoa Khanh Dam, Yasutaka Kamei, David Lo, and Naoyasu
Ubayashi. 2019. DeepJIT: An End-To-End Deep Learning Framework for
Just-In-Time Defect Prediction. In Proceedings of 2019 IEEE/ACM 16th
International Conference on Mining Software Repositories (MSR'19). IEEE, 34–
45.
[30] Thong Hoang, Hong Jin Kang, David Lo, and Julia Lawall. 2020. CC2Vec:
Distributed Representations of Code Changes. In Proceedings of 2020
IEEE/ACM 42nd International Conference on Software Engineering (ICSE'20).
Association for Computing Machinery, New York, NY, USA, 518–529.
[31] Alicja Gosiewska and Przemyslaw Biecek. 2020. Do Not Trust Additive
Explanations. CoRR. https://arxiv.org/abs/1903.11420.