Content uploaded by Abdulrezzak Zekiye
Author content
All content in this area was uploaded by Abdulrezzak Zekiye on Mar 21, 2023
Content may be subject to copyright.
Trusting Heart Failure Predictors
Abdulrezzak ZEKIYE1 [0000-0003-4974-9795] and Recep ALP KUT2 [0000-0002-5781-334X]
Dokuz Eylül University, Izmir, Turkey
Abstract. In recent years, there has been a significant rise in the use of artificial
intelligence models for predicting and classifying. Some of these models have
achieved performance on par with human error, while others have exceeded it.
One of the challenges facing these models is that they are often considered "black
boxes" because they do not provide any explanations for their decisions. In this
paper, we tackle the problem of heart failure prediction using a range of different
methods, including logistic regression, decision trees, artificial neural networks,
and random forests. In order to provide more transparency and understandability
for the end-user, we employed several methods to explain the predictions of our
models. These included SHapley Additive exPlanations (SHAP), Local Interpret-
able Model-agnostic Explanations (LIME), and decision path visualization for
decision tree models. Our experimental results showed that the Random Forests
model was the best predictor with an f1-score of 92.68% after applying feature
selection. However, we found that the decision tree model was more interpreta-
ble, allowing for better self-explainability. While SHAP and LIME were effec-
tive in explaining the models' predictions, we found that additional processing
was needed to make the explanations more easily readable for the end-user. We
processed the output of SHAP and LIME and presented them in a more under-
standable format.
Keywords: Explainable Artificial Intelligence, SHAP, LIME, Heart failure pre-
diction
1 Introduction
A knowledge base system in the medical domain gets experts and knowledge from
medical experts to provide decisions and predictions. MYCIN [1] and PXDES [2] are
examples of the first medical expert system in the 70’s and 80’s and recently modern
AI models were used to identify COVID-19 from chest x-rays like what the researchers
did in [3]. Using AI in the medical field has many advantages like predicting future
heart failure. To do so, we can use many AI models like Logistic Regression, K-Nearest
Neighbors (KNN), Support Vector Machines (SVM), Decision Trees, Forest Trees, Ar-
tificial Neural Networks (ANN), and many others. Using such models with the right
data, we can achieve satisfying results. However, the user does not know the reason
beyond a decision, and this is a problem. The solution is providing explanations to the
user along with the decision or prediction. Explainable Artificial Intelligence, X-AI, is
14. Tıp Bilişimi Kongresi Bildiriler Kitabı
Proceedings of 14th Turkish Congress of Medical Informatics
16-18 Mart 2023 / 16-18 March 2023
1
the field of AI that provides explanations for the AI model’s decision. Some models are
self-explainable, like decision trees, and others can be modified to be self-explainable,
like adding an attention layer to a short-long-term memory (LSTM) model [4]. For the
models that are not self-explainable nor can be modified, we can use an ad-hoc ex-
plainer to get explanations, like LIME [5] and SHAP [6]. LIME stands for Local Inter-
pretable Model-agnostic Explanations and it is a way of giving explanations for any
classifier by learning another model to do so [6]. On the other hand, SHAP, SHapley
Additive exPlanations, calculates the contribution of each feature to the final decisions,
where we can use those values to know the features that cause the model to give its
decision.
This work has the following contributions:
1. Using SHAP and LIME to explain heart failure predictors, and interpreting the
results as readable sentences for the end-users.
2. Addressing data imbalance by using SMOTE sampling technique.
3. Improving prediction quality and explanations by selecting the most useful
features.
4. Standardizing the data through scalar normalization.
The rest of the paper is organized as follows: The next section contains a brief about
the related work in the field of predicting heart failure and using XAI in the medical
field. In the third section, we talk about the dataset we used, preprocessing it, and the
prediction and explanation methods we used. The last section talks about the result and
conclusion we got.
2 Previous Studies
Researchers have done many experiments on predicting heart failure with different
artificial models. In [7], a survey was made about the used algorithms for predicting
heart failure. They mentioned some papers that used SVM, decision trees, K-means,
and Naïve Bayes algorithms for predicting heart failure. U. Pawar et. al. discussed using
explainable AI in the healthcare field in their paper [8]. They said that the lack of trust
in AI models comes from being black boxes and using X-AI in healthcare will increase
transparency, improve the model and enable tracking of the results. They also proposed
to use existing XAI models along with clinical knowledge to make AI -based systems
more useful by letting experts detecting whether the explanations are correct and in
case the prediction was wrong, we can track the problem to improve the model. Pedro
A Moreno-Shanchez in [9] used ensemble trees machine learning techniques to predict
heart failure. The best result they got was with XGBoost with an accuracy of 83%. The
author said that a study of the features’ importance can be made in order to improve the
Explainability by removing the unimportant features. They said that this would de-
crease the overfitting and computing time in addition to enhancing the model’s explain-
ability. In [10], experiments on a medical transcriptions dataset were done using five
different models for providing predictions, including self-explanatory decision trees,
14. Tıp Bilişimi Kongresi Bildiriler Kitabı
Proceedings of 14th Turkish Congress of Medical Informatics
16-18 Mart 2023 / 16-18 March 2023
2
neural network models with separate explainers, and a bidirectional LSTM model with
attentions used as explanations.
3 Material and Methods
3.1 Dataset
Heart Failure Clinical Records [11] is the dataset we used. It has 12 features where
researchers collected the data of 299 patients in 2015. The 13th column,
DEATH_EVENT, represents whether the patient suffered a heart attack, 1, or not, 0.
3.2 Preprocessing
The dataset does not contain null or empty values. The twelfths column, time, contrib-
utes a lot to the death_event. However, it is the time at which the patient died or was
censored and we cannot get this data from patient when we will use our model. There-
fore, we dropped that column. The second problem we have is imbalanced classes. For
the death_event, the number of deaths is approximately half the number of survivors.
This is a problem in classification/prediction and we can solve it by sampling. We used
Synthetic Minority Oversampling Technique SMOTE [12] over-sampler to solve it
where we got 406 samples. The distribution of the features after oversampling is shown
in Figure 1. The last step we did in preprocessing phase is scaling the data using a
standard scaler which means subtracting the mean of a feature from its value and then
dividing the result by the feature standard deviation.
Fig. 1. The distribution of the features after oversampling
14. Tıp Bilişimi Kongresi Bildiriler Kitabı
Proceedings of 14th Turkish Congress of Medical Informatics
16-18 Mart 2023 / 16-18 March 2023
3
3.3 Feature Selection
Feature selection can be done to get the best subset of features that improves the pre-
dictions [13]. Feature projection on the other hand, is the process of reducing the num-
ber of features by transforming the data like the Principal Component Analysis (PCA)
method [14]. In other words, feature selection takes a subset of original features and
feature projection transforms the original features into a smaller set of projected fea-
tures. Using feature projection will produce new features that we cannot explain. For
that reason, we used feature selection only. Since the number of features we have is 11
only, we tried all the combinations of possible subsets to obtain the subset that gives
the best performance.
3.4 Prediction
We used four different models in the prediction part. The first one is logistic regression
model, which is a statistical one. The logistic regression model tries to learn the weights
and coefficient values of a specific polynomial. The second model is a decision tree.
Decision trees use a tree-like model where the model starts the decision with the most
important feature according to the entropy value. The third predictor is forest trees.
Forest tree uses multiple decision trees in order to make the final decision. The last used
predictor is an artificial neural network, ANN. ANN has an input layer, and an output
layer and may have a hidden layer or more between them. The connections between
layers are weights and we train the ANN to learn those weights to classify or predict
the output.
3.5 Explanations
To provide an explanation for the self-explainable decision trees we plot the decision
tree. Plotting the decision tree gives us a global explanation, which means understand-
ing the model itself. When predicting a case, we also plot the decision path in order to
give a local explanation. In addition, we used LIME and SHAP ad-hoc external ex-
plainers with all the models in order to get explanations then we processed the output
of those explainers into readable sentences.
4 Experiments and results
We used sikit-learn library to build and train all the models. For LIME and SHAP, we
used the authors’ official libraries. In all of the experiments, we split the data randomly
as 80% for training and 20% for testing. After preprocessing the data as we illustrated
earlier, we implemented each predictor and explainer on the data. We calculated four
metrics; accuracy, precision, recall, and f1-score and we considered the f1-score to
make the comparison. Logistic regression gave 70.49%. Decision trees with a maxi-
mum leaf’s number equals to 10, gave 81.74%. The f1-score of the random forest with
11 estimators was 85.37%. Finally, the score of our ANN model gave 80.49%. The
14. Tıp Bilişimi Kongresi Bildiriler Kitabı
Proceedings of 14th Turkish Congress of Medical Informatics
16-18 Mart 2023 / 16-18 March 2023
4
ANN model’s structure in Figure 2 has 11 input neurons, three hidden layers each of
which with six neurons, and an output layer with one neuron to predict the death event.
The hidden layers utilize the ReLU activation function, whereas the output layer uses a
Sigmoid one. Forest trees gave the best result and ANN and decision trees came next
with a close f1-score score to each other. Considering the area under the curve (AUC)
performance, the random forest model is still the best choice. We can see the AUC
performance for the four models in Table 1.
Applying feature selection, by testing all the possible subsets of the features on the
random forests, we concluded that using the subset of the following features ('age',
'anaemia', 'creatinine_phosphokinase', 'diabetes', 'serum_creatinine', 'serum_sodium',
'sex', 'smoking') is the best among other possibilities. Using this subset, the random
forest predictor’s f1-score increased from 85.37% to 92.68%. It is important to note
that due to the sensitivity of random forests, training with random splitting for the train-
ing and testing datasets should be tried several times in order to reach this result.
For the explanation part, figure 3 shows the decision paths for two examples pre-
dicted with decision trees and figure 4 shows the whole decision tree before feature
selection. By looking at the decision tree, or reading the rule extracted from the decision
path, the end-user can understand the logic that led to the final decision, and in case the
end-user is an expert, a physician in our case, they can decide whether the expected
output makes sense or not.
Table 2 shows explanations provided by LIME and SHAP explainers for logistic
regression, decision tree, random forests, and ANN respectively for the same example
before feature selection, while Table 3 shows the same explanations after feature selec-
tion. We notice that the explanations provided by LIME are the same, though we used
different predictors. On the other hand, the explanations provided by SHAP differed a
bit between a predictor and another. The numbers represent the value that affects the
final decision. For example, in the LIME explanation we have, the negative contribu-
tion of creatinine phosphokinase, age, serum creatinine, diabetes, and blood pressure
caused the final decision to be no death. By comparing the real value of our example,
we can say that the predictor predicted no death because the patient has no diabetes nor
high blood pressure, his age is 55, and creatinine phosphokinase is at normal rates.
Fig. 2. ANN's architecture used to predict heart failure.
14. Tıp Bilişimi Kongresi Bildiriler Kitabı
Proceedings of 14th Turkish Congress of Medical Informatics
16-18 Mart 2023 / 16-18 March 2023
5
Table 1. The AUC performance for the four models before feature selection.
MODEL
AUC
LOGISTIC REGRESSION
0.775684
DECISION TREE
0.764742
RANDOM FOREST
0.843465
ANN
0.729483
Fig. 3. Decision paths for two examples where we have the node id, the rule and the example's
value before feature selection
We can notice, by comparing the explanations before and after feature selection, that
the explanations provided after feature selection are clearer. We applied a further step
on the explanations provided by SHAP and LIME for the forest trees after feature se-
lection to provide more human-readable explanations. This was done by sorting the
weights for each feature in descending order, then creating a sentence indicating how
much the feature contributed toward death prediction or not death prediction. Figure 4
shows the interpreted predictions out of SHAP when using the Random Forests predic-
tor that was done by selecting the maximum three SHAP values (negative or positive)
and then composing a sentence for each one of them.
Example 1:
node#0: serum_creatinine <= -0.516019731760025 -0.7360376662943322
node#1: ejection_fraction <= -0.7994389533996582 0.277833179717483
Example 2:
node#0: serum_creatinine <= -0.516019731760025 0.35884647755501464
node#2: ejection_fraction <= -0.2441570907831192 0.9473230971207199
node#4: serum_creatinine <= 0.27605095505714417 0.35884647755501464
Fig. 2. The plotted decision tree for predicting heart failure.
14. Tıp Bilişimi Kongresi Bildiriler Kitabı
Proceedings of 14th Turkish Congress of Medical Informatics
16-18 Mart 2023 / 16-18 March 2023
6
Table 2. Explanations using LIME and SHAP explainers for the same example.
Predictor
Logistic Regression
Decision Tree
Random Forest
ANN
Label
0
0
0
0
Predicted
Label
0
0
0
0
LIME Expla-
nation
SHAP Expla-
nation
age
-0.330
anaemia
0.004
creatinine_phospho-
kinase
-0.279
diabetes
0.000
ejection_fraction
0.095
high_blood_pressure
-0.004
platelets
-0.497
serum_creatinine
-0.088
serum_sodium
0.097
sex
-0.001
smoking
0.003
age
-0.638
anaemia
0.0
creatinine_phosphoki-
nase
0.0
diabetes
0.0
ejection_fraction
0.0
high_blood_pressure
0.0
platelets
-0.138
serum_creatinine
-0.195
serum_sodium
0.0
sex
-0.028
smoking
0.0
age
-0.359
anaemia
-0.070
creatinine_phosphoki-
nase
0.088
diabetes
-0.090
ejection_fraction
0.084
high_blood_pressure
-0.081
platelets
-0.164
serum_creatinine
-0.272
serum_sodium
-0.065
sex
-0.024
smoking
-0.044
age
-0.298
anaemia
0.002
creatinine_phosphoki-
nase
-0.262
diabetes
-0.006
ejection_fraction
0.108
high_blood_pressure
-0.002
platelets
-0.477
serum_creatinine
-0.028
serum_sodium
0.062
sex
-0.001
smoking
-0.004
Table 3. Explanations using LIME and SHAP explainers for the same example after feature se-
lection.
5 Conclusion
In this paper, we explored different ways to predict heart failure along with providing
explanations. We noticed that random forests was the best model. The decision tree and
ANN gave close scores, and the logistic regression model was the worst. From the self-
explainability point of view, decision tree is the only model to achieve this point. The
logic behind logistic regression is simple but understanding the resulting equation with
Predictor
Logistic Regression
Decision Tree
Random Forest
ANN
Label
0
0
0
0
Predicted
Label
0
0
0
0
LIME Expla-
nation
SHAP Expla-
nation
age
-0.000
anaemia
-0.332
creatinine_phosphokinase
-0.000
diabetes
0.000
sex
0.000
serum_sodium
-0.000
smoking
-0.332
serum_creatinine
-0.000
age
-0.000
anaemia
-0.332
creatinine_phosphokinase
-0.000
diabetes
0.000
sex
0.000
serum_sodium
-0.000
smoking
-0.332
serum_creatinine
-0.000
age
-0.000
anaemia
-0.332
creatinine_phosphokinase
-0.000
diabetes
0.000
sex
0.000
serum_sodium
-0.000
smoking
-0.332
serum_creatinine
-0.000
age
-0.046
anaemia
-0.305
creatinine_phosphokinase
-0.056
diabetes
+0.004
sex
-0.002
serum_sodium
+0.007
smoking
-0.305
serum_creatinine
-0.000
14. Tıp Bilişimi Kongresi Bildiriler Kitabı
Proceedings of 14th Turkish Congress of Medical Informatics
16-18 Mart 2023 / 16-18 March 2023
7
11 variables is hard. Forest trees’ logic is simple too; however, tracking all the decision
paths in all the trees and displaying them to the user is not an option since the number
of trees is big. Lastly, ANN works in a way that the user cannot understand at all. We
discussed how we could provide explanations for any model by using either LIME or
SHAP.
Applying feature selection had an important role in increasing the predictor perfor-
mance, where an increase of 8.64% in the f1-score was noticed for the random forests.
Moreover, feature selection made the provided explanations better since the number of
features to interpret is less.
In conclusion, when there is a choice between two models, one is self-explainable
and the other is a slightly better but non-explainable model, we believe that it is better
to choose the self-explainable model because the user will trust it and be able to know
whether the model working correctly or not. In the case where we have only non-ex-
plainable models or their performance is much higher than self-explainable ones, ex-
ternal ad-hoc explainers can be used to provide acceptable explanations that might help
the user decide whether to trust the prediction or not.
Acknowledgment
The researcher Abdulrezzak Zekiye was pursuing his master’s degree as a scholarship
student that is provided by “Presidency for Turks Abroad and Related Communities
(YTB)” when this project was done.
References
[1]
E. H. Shortliffe, "Mycin: A Knowledge-Based Computer Program Applied
to Infectious Diseases," p. 66–69, 1977 Oct 5.
[2]
B. Buchanan and E. Shortliffe., "Rule Based Expert Systems: The Mycin
Experiments of the Stanford Heuristic Programming Project (The Addison-
Wesley series in artificial intelligence)," 1984.
[3]
A. K. Das, S. Ghosh, S. Thunder, R. Dutta, S. Agarwal and A. Chakrabarti,
"Automatic COVID-19 detection from X-ray images using ensemble learning
with convolutional neural network," Pattern Analysis and Applications, 19
March 2021.
[4]
M. A. Clinciu and H. F. Hastie, "A Survey of Explainable AI Terminology,"
Edinburgh Centre for Robotics.
creatinine_phosphokinase contributed by 0.545209 toward no death prediction.
age contributed 0.369599 toward no death prediction.
serum_creatinine contributed 0.256969 toward no death prediction.
Fig 4. An example of an interpreted predictions of SHAP values
14. Tıp Bilişimi Kongresi Bildiriler Kitabı
Proceedings of 14th Turkish Congress of Medical Informatics
16-18 Mart 2023 / 16-18 March 2023
8
[5]
M. T. Ribeiro, S. Singh and C. Guestrin, ""Why Should I Trust You?":
Explaining the Predictions of Any Classifier," 2016.
[6]
S. Lundberg and S.-I. Lee, "A Unified Approach to Interpreting Model
Predictions," 25 Nov 2017.
[7]
B. Gnaneswar and M. E. Jebarani, "A review on prediction and diagnosis
of heart failure," 2017.
[8]
U. Pawar, D. O’Shea, S. Rea and R. O’Reilly, "Explainable AI in
Healthcare," 2020.
[9]
P. A. Moreno-Sanchez, "Development of an Explainable Prediction Model
of Heart Failure Survival by Using Ensemble Trees," 2020.
[10]
A. r. Zakieh and A. Alpkocak, "Classification of Medical Transcriptions
with Explanations," in 13th CONGRESS of MEDICAL INFORMATICS,
Turkey, 2021.
[11]
D. Chicco and G. Jurman, "Machine learning can predict survival of
patients with heart failure from serum creatinine and ejection fraction alone,"
BMC Medical Informatics and Decision Making volume 20, 03 February 2020.
[12]
N. V. Chawla, K. W. Bowyer, L. O. Hall and W. P. Kegelmeyer, "SMOTE:
Synthetic Minority Over-sampling Technique," 2002.
[13]
J. Li, K. Cheng, S. Wang, F. Morstatter, R. P. Trevino, J. Tang and H. Liu,
"Feature Selection: A Data Perspective," ACM computing surveys (CSUR),
vol. 50, no. 6, pp. 1-45, 2017.
[14]
K. Pearson, "LIII. On lines and planes of closest fit to systems of points in
space," The London, Edinburgh, and Dublin philosophical magazine and
journal of science, vol. 2, no. 11, pp. 559-572, 1901.
[15]
K. Simonyan and A. Zisserman, "Very Deep Convolutional Networks for
Large-Scale Image Recognition," 2014.
[16]
A. Krizhevsky, I. Sutskever and G. E. Hinton, "ImageNet Classification
with Deep Convolutional Neural Network".
[17]
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke and A. Rabinovich, "Going Deeper with Convolutions," 2015.
[18]
K. He, X. Zhang, S. Ren and J. Sun, "Deep Residual Learning for Image
Recognition," 2015.
14. Tıp Bilişimi Kongresi Bildiriler Kitabı
Proceedings of 14th Turkish Congress of Medical Informatics
16-18 Mart 2023 / 16-18 March 2023
9