Conference PaperPDF Available

Incorporating Explainable Artificial Intelligence (XAI) to aid the Understanding of Machine Learning in the Healthcare Domain

Authors:

Abstract and Figures

In the healthcare domain, Artificial Intelligence (AI) based systems are being increasingly adopted with applications ranging from surgical robots to automated medical diagnostics. While a Machine Learning (ML) engineer might be interested in the parameters related to the performance and accuracy of these AI-based systems, it is postulated that a medical practitioner would be more concerned with the applicability , and utility of these systems in the medical setting. However, medical practitioners are unlikely to have the prerequisite skills to enable reasonable interpretation of an AI-based system. This is a concern for two reasons. Firstly, it inhibits the adoption of systems capable of automating routine analysis work and prevents the associated productivity gains. Secondly, and perhaps more importantly, it reduces the scope of expertise available to assist in the validation, iteration, and improvement of AI-based systems in providing healthcare solutions. Explainable Artificial Intelligence (XAI) is a domain focused on techniques and approaches that facilitate the understanding and interpretation of the operation of ML models. Research interest in the domain of XAI is becoming more widespread due to the increasing adoption of AI-based solutions and the associated regulatory requirements [1]. Providing an understanding of ML models is typically approached from a Computer Science (CS) perspective [2] with a limited research emphasis being placed on supporting alternate domains [3]. In this paper, a simple, yet powerful solution for increasing the explain-ability of AI-based solutions to individuals from non-CS domains (such as medical practitioners), is presented. The proposed solution enables the explainability of ML models and the underlying workflows to be readily integrated into a standard ML workflow. Central to this solution are feature importance techniques that measure the impact of individual features on the outcomes of AI-based systems. It is envisaged that feature importance can enable a high-level understanding of a ML model and the workflow used to train the model. This could aid medical practitioners in comprehending AI-based systems and enhance their understanding of ML models' applicability and utility.
Content may be subject to copyright.
Incorporating Explainable Artificial Intelligence
(XAI) to aid the Understanding of Machine
Learning in the Healthcare Domain
Urja Pawar, Donna O’Shea, Susan Rea, and Ruairi O’Reilly
Cork Institute of Technology
Urja.Pawar@mycit.ie,Donna.OShea@cit.ie,Susan.Rea@cit.ie,
Ruairi.OReilly@cit.ie
Abstract. In the healthcare domain, Artificial Intelligence (AI) based
systems are being increasingly adopted with applications ranging from
surgical robots to automated medical diagnostics. While a Machine Learn-
ing (ML) engineer might be interested in the parameters related to the
performance and accuracy of these AI-based systems, it is postulated
that a medical practitioner would be more concerned with the appli-
cability, and utility of these systems in the medical setting. However,
medical practitioners are unlikely to have the prerequisite skills to en-
able reasonable interpretation of an AI-based system. This is a concern
for two reasons.
Firstly, it inhibits the adoption of systems capable of automating routine
analysis work and prevents the associated productivity gains. Secondly,
and perhaps more importantly, it reduces the scope of expertise avail-
able to assist in the validation, iteration, and improvement of AI-based
systems in providing healthcare solutions.
Explainable Artificial Intelligence (XAI) is a domain focused on tech-
niques and approaches that facilitate the understanding and interpre-
tation of the operation of ML models. Research interest in the domain
of XAI is becoming more widespread due to the increasing adoption of
AI-based solutions and the associated regulatory requirements [1]. Pro-
viding an understanding of ML models is typically approached from a
Computer Science (CS) perspective [2] with a limited research emphasis
being placed on supporting alternate domains [3].
In this paper, a simple, yet powerful solution for increasing the explain-
ability of AI-based solutions to individuals from non-CS domains (such
as medical practitioners), is presented. The proposed solution enables the
explainability of ML models and the underlying workflows to be readily
integrated into a standard ML workflow.
Central to this solution are feature importance techniques that measure
the impact of individual features on the outcomes of AI-based systems.
It is envisaged that feature importance can enable a high-level under-
standing of a ML model and the workflow used to train the model. This
could aid medical practitioners in comprehending AI-based systems and
enhance their understanding of ML models’ applicability and utility.
Keywords: Explainable Artificial Intelligence ·Healthcare ·Feature
Importance ·Decision trees ·Explainable Underlying Workflow
2 U. Pawar et al.
1 Introduction
Interpretability is the degree to which the rationale of a decision can be ob-
served within a system [4]. If a ML model’s operation is readily understood then
the model is interpretable. Explainability is the extent to which the internal
operation of a system can be explained in human terms. XAI is comprised of
methodologies for making AI systems interpretable and explainable [5].
The context of interpretability and explainability is generally considered
domain-specific in an applied setting. For instance, a ML engineer and a medical
practitioner would have a different perspective on what is “explainable” when
viewing the same system. Interpretability from the perspective of the ML engi-
neer relates to understanding the internal working of a system so that the techni-
cal parameters can be tuned to improve the overall performance. Interpretability
from the medical practitioner’s perspective would relate to a higher-level under-
standing of the internal operation of a system as it relates to the medical function
it provides. Explainability for a ML engineer may relate to presenting technical
information in an understandable format that enables effective evaluation of a
system while explainability for medical practitioners may be more related to the
rationale as to why a course of action is prescribed for a patient.
It is postulated that AI-based systems need to accommodate a medical prac-
titioner’s perspective to be considered explainable in a healthcare setting. This
presents several challenges which are highlighted and addressed as part of this
work:
Designing domain-agnostic systems with XAI and simultaneously
accommodating multiple perspectives is a complex problem because ex-
planations require a context of the domain (engineering, medicine, or healthcare)
and can be useful for a targeted perspective but trivial for others. For instance,
presenting interactive visualisations to explain layers of a neural network is ben-
eficial for ML engineers but of less importance to the radiologists who use the
neural network for analysing MRI scans.
The scope of interpretability and explainability for AI-based so-
lutions is broader than the operation of a ML model. It also concerns
the workflow adopted to train these models. The workflow can provide techni-
cal knowledge regarding the pre-processing steps, the ML models used, and the
evaluation criteria (e.g. accuracy, precision) to the ML engineer. It can bene-
fit medical practitioners with an overview of the underlying data, the model’s
interpretation of the data, and the performance metrics pertinent to medical
diagnostics. For instance, the ML models that are used to predict based on a
patient’s medical record might be inappropriate if the underlying training data
does not include records from similar demographics.
The subjective nature of XAI in medical setting presents challenges
such as as the association of a trained model’s knowledge with the medical
features, the provisioning of explanations with regard to the underlying medical
dataset [6], and an understanding of how the presence or absence of some medical
features’ information affects a model’s performance and its interpretation of
features.
Incorporating XAI to aid Understanding of ML in Healthcare 3
There are several nuanced issues related to the challenges articulated. These
include: (a) lack of explainability in underlying feature engineering processes
to incorporate clinical expertise; (b) complexity in the integration of XAI ap-
proaches with existing ML workflows [1]; (c) a lack of a high-level explainability
of the data and the ML model [7]; (d) and a lack of explainability of a model’s
operation in different medical settings.
A standard ML workflow consists of several stages: data collection, data
pre-processing, modeling, training, evaluation, tuning, and deployment. XAI ap-
proaches should endeavor to integrate interpretability and explainability into the
standard ML workflow. Feature Importance (FI) is a set of techniques that assign
weightings(scores) to each feature indicating their relative importance in making
a prediction or classification by a ML model [8]. FI techniques are typically used
as part of the data pre-processing to enhance feature selection.
Moving towards a solution: While addressing the challenges articulated
in their totality is beyond the scope of this paper, addressing the nuanced issues
outlined will provide the initial steps for a more complete solution to be de-
rived and is the primary contribution of this work. In this paper, FI techniques
are utilised as a means of enabling XAI. It is envisaged that FI will provide a
simple but powerful means of integrating XAI into the standard ML workflow
in a domain-agnostic manner. The approach can enable the explainability of a
ML model as well as the underlying workflow whilst accommodating multiple
perspectives. This is realised by three proposed approaches that utilise the as-
sociations between FI scores, FI techniques, the inclusion/exclusion of features,
data augmentation techniques, and performance metrics. In doing so, it enables
multiple levels of explainability encapsulating the operation of the ML model
with different underlying datasets in different medical settings. The explainabil-
ity derived is expected to enable the clinical validation of AI-based systems as
discussed in the following sections.
The remainder of the paper is organised as follows: Section 2 presents related
work with regards XAI, FI, and its utilisation in an applied ML setting. Section
3 outlines the proposed methodology for enabling XAI in a standard ML work-
flow. Section 4 outlines the results of the experimental work of the approaches
proposed. Section 5 presents a discussion and concluding remarks arising from
the work carried out to date.
2 Related Work
Preliminary work for making ML models used in clinical domains increasingly in-
terpretable and explainable has been initiated in [1, 5, 9, 10]. The interpretability
and explainability of ML models enable ML engineers to understand, and eval-
uate, a model’s parameters (weights/coefficients) and hyper-parameters (input-
size, number of layers) with the model’s outcomes (predictions/classifications).
It can also enable medical practitioners to effectively comprehend and validate
the output derived from ML models as per their medical expertise [1, 5].
There exists a variety of XAI methods that are applicable to the medical
domain. Ante-hoc XAI methods achieve interpretability without an additional
4 U. Pawar et al.
step that makes them easier to adopt in existing ML workflows. They include
inherently interpretable ML models such as Decision trees [11], Random Forests
[11] and Generalised Additive Models (GAMs) [9]. They are typically used to
achieve interpretability at the cost of lower performance scores as compared to
complex ML models. However, their contribution towards enabling the explain-
ability to non-CS perspectives in different domains is not extensively discussed
in the literature [1].
In [12], an XAI-enabled framework to include clinical expertise in AI-based
systems is proposed in an abstract format. This work also discusses the use of
FI to enable the inclusion of clinical expertise when building AI-based solutions.
In [11] FI scores based on Decision trees were used to analyse the importance
of features in classifying cervical cancer and achieving interpretability in the
model. However, the interpretability in relation to the underlying dataset was not
discussed. Also, the utilisation of the FI scores to enable explainability from the
perspective of medical practitioners was not addressed. In [13] Random forests
were used to classify arrhythmia from time-series ECG data and FI scores were
presented as a means of achieving interpretability. However, as the time-series
ECG data has numeric values for each sampled record, the FI scores assigned
to each time-stamped value were not useful as effective conclusions cannot be
drawn by associating a FI score with a single amplitude value in a time-series
ECG wave.
Post-hoc XAI methods are specifically designed for explainability and are
applied after a ML model is trained. This makes post-hoc methods difficult
to adopt but they are advantageous as they typically support multiple non-
interpretable but performant classifiers [5]. Local Interpretable Model-agnostic
Explanations (LIME) is one of the commonly used post-hoc XAI methods that
was developed to explain the predictions of any ML classifier by calculating FI
scores based on some assumptions that don’t always hold true across different
types of classifiers [14]. Shapley values are another post-hoc XAI technique that
was initially introduced in game theory to present the average expected marginal
contribution of a player in achieving a payout when all possible combinations of
players are considered [15]. In XAI, Shapley values are used to assign FI scores
to features (players) in achieving predictions (payout) made by a model. In [16]
LIME and Shapley’s FI scores were compared and it was found that Shapley’s
FI scores were more consistent when compared to LIME’s. This consistency
was derived on the basis of objective criteria including similarity, identity, and
separability, which are important considerations when generating and providing
explanations in a healthcare setting.
3 Methodology
The workflow adopted in this work is depicted in Figure 1. It follows a standard
ML workflow with the addition of the FI stage to enable post-hoc explainability.
The FI scores are calculated without modifying prior stages of the workflow and
are utilised to enable the explainability of the model and the inherent workflow.
Incorporating XAI to aid Understanding of ML in Healthcare 5
Fig. 1. Integration of FI stage in a standard ML workflow to enable derivation of
explainability using three proposed approaches: A1, A2, and A3
Three approaches are proposed that utilise FI scores to enable explainability
and interpretability of a ML model and the underlying dataset:
A1. Relative feature ranking: There needs to be a careful validation re-
garding features that are considered more or less relevant by ML models in
healthcare [9]. This approach derives FI scores using two distinct methods. The
first is generated using Decision tree FI scores, the FI score of a feature based
on its position in the conditional flow of a classification process, and the second
is generated using Shapley values, based on weighting the feature’s impact on
the model’s outcome. The derived FI scores are collated and sorted in descend-
ing order. This provides a high-level understanding of how a ML model ranks
different features to be considered while deriving an outcome.
This enables a comparison between features that are considered important
by the classification model (realised by the first approach) and the features that
fluctuate the ML model’s outcome (realised by the second approach). This en-
ables explainability to be derived as it provides a relative ranking of the features
as interpreted by a ML model along with their impact on the outcome. In the ap-
plied setting, this can be used by medical practitioners to gain an understanding
of the features that are critical in formulating a medical diagnosis, highlighting
features whose values cannot be ignored due to their high impact on the model’s
output.
A2. Feature importance in different medical settings: The availability
of medical information in different medical settings is not uniform (e.g. lack of
advanced medical tests in small clinics) and therefore, approaches followed by
medical practitioners belonging to different medical settings differ. This reduces
the associated utility of AI-based solutions. A gold-standard solution should be
designed to include all the relevant data while providing multiple versions to
acknowledge that different healthcare facilities will have different levels of access
to this data. This realisation dramatically broadens the applicability and utility
of the solution as it acknowledges the inclusion and exclusion of features in
different settings.
This approach demonstrates the relative change of FI scores and performance
metrics based on the inclusion/exclusion of features. This enables a broader un-
6 U. Pawar et al.
derstanding of a ML model and highlights its suitability to different medical
settings (e.g. a general practitioner in a clinic and an emergency room doctor
in a hospital will have access to significantly different levels of data regarding
an individual’s health). If a ML model is trained upon a set of nfeatures, ex-
plainability can be derived by training the model on all possible subsets (2n) of
features and can enhance the understanding of how features are re-ranked, and
performance is affected, based on the inclusion/exclusion of features.
In an applied setting, this approach is useful to medical practitioners as
it aids their understanding based on inclusion/exclusion of clinical test results
or medical information with an associated performance score. This enables an
informed evaluation regarding the suitability of the AI-based solution on a per
actor basis.
A3. Understanding the Data: The data on which a model was trained
and how it was pre-processed can have significant consequences in a medical
setting [6, 7]. As such, this approach demonstrates the association of FI scores
and performance metrics with the data augmentation techniques that are used in
a dataset. This association provides explainability by enabling an understanding
of how differences in underlying data impact the performance metrics and the
FI scores.
In an applied setting, the medical practitioners can validate the ranking of
features as interpreted by a ML model trained on data augmented using differ-
ent techniques and be able to associate it with the corresponding performance
metrics. This furthers the interpretability of the underlying workflow used for
processing the data and can enable better selection of augmentation techniques
by incorporating clinical expertise along with the expected performance metrics.
The dataset and the modelling technique utilised for experimental work are
discussed in Section 3.1 and 3.2 respectively. In Section 3.3, the two FI tech-
niques: one based on Decision trees and the other based on Shapley values used
in this work are discussed.
3.1 Dataset
In this work, the “Cervical Cancer Risk Factors” dataset available from the UCI
data repository is used [17]. This dataset was used in [18] to train different ML
models to predict the occurrence of cervical cancer based on a person’s health
record. The performance of different models was compared based on accuracy,
precision, recall, and F-score (harmonic mean of precision and recall) values [18].
The work did not address the interpretability and explainability of ML models
and the underlying workflow.
The dataset contains 36 feature attributes representing risk factors responsi-
ble for causing cervical cancer and the results of some preliminary and advanced
medical tests. In the dataset, 803 out of 858 records have a negative Biopsy
result while 55 have a positive result. The class-imbalance problem is addressed
using Imbalanced-learn that offers many data sampling techniques to balance the
number of the majority and minority classes [19]. Table 1 denotes the number
of records corresponding to positive and negative biopsy results after a sampling
technique is applied.
Incorporating XAI to aid Understanding of ML in Healthcare 7
Resampling Method Sam. 0:1 Ratio
Random Over sampling (ROS) 1606 803:803
Adaptive Synthetic Over sampling (ASS) 1606 803:803
Random Under sampling (RUS) 110 55:55
Neighbourhood cleaning Under sampling
(NCUS)
725 670:55
SMOTEtomek Combination sampling (S-TOM) 1600 800:800
SMOTE edited nearest neighbours Combination
sampling (S-ENN)
1429 652:777
Table 1. Number of samples in different data sampling techniques [18]. Legend: Num-
ber of Samples (Sam.), Biopsy results ratio - Positive (0): Negative (1).
3.2 ML Model
Decision trees are graphs where nodes represent sets of data samples and edges
represent conditions. Each node has an associated impurity factor indicating
the diversity of classes/labels in that node. A node is pure if all the data sam-
ples present in it belong to the same class/label. In a classification problem,
the conditions in the edges of Decision trees are designed to decrease the im-
purity. Therefore, from root node to leaf nodes, the impurity factor decreases
and each leaf node should contain data samples that are classified under a single
class/label.
Decision trees are more interpretable as compared to complex models such
as Support Vector Machines or Neural Networks [5]. They also provide sufficient
performance scores in the given dataset [18]. In this work, a Decision tree was
chosen as it achieves sufficient performance while retaining interpretability [1].
3.3 Feature Importance (FI)
FI identifies the important features as considered by a ML model from a dataset
for making a classification or prediction. In this paper, FI using decision trees and
Shapley were used. When a decision tree is trained, FI scores can be calculated by
measuring how much a feature contributes towards a decrease in the impurity
[20]. The FI scores obtained represent features considered important by the
Decision tree model. Shapley values can be used to generate the FI score of a
feature by first calculating a model’s output including and excluding that feature
to get the contribution of that feature alone. This contribution is then weighted
in presence of all subsets of features. This whole process is summed for all the
subsets of features to get a weighted and permuted FI score. The FI scores
obtained represent the impact of different features on a model’s outcome.
4 Results
The main challenge in achieving explainability using FI was to present the FI
scores generated after training the model with different data augmentation tech-
niques and different feature sets in an integrated manner such that their associa-
tion with the performance metrics (e.g. accuracy, F-scores) and relative ranking
of the features can be effectively utilised in a domain-agnostic manner. This in-
tegration of information enables explainability of both: the ML model and the
underlying data, thereby broadening the scope of explainability.
8 U. Pawar et al.
The three approaches outlined in Section 3 have been implemented. The
value of the approaches and the derived explainability is demonstrated in this
section. This is considered a contribution towards a long-term generic workflow
for simplifying the integration of XAI in an applied setting such as healthcare.
A1: Relative Feature Ranking When the ML model was trained on data
sampled using Random Over Sampling, FI scores assigned to different features
were plotted as depicted in Figure 2. Random over sampling provided higher
accuracy than other sampling techniques [18], as such it was selected for this
approach.
Fig. 2. A1: Relative Feature Ranking using FI scores generated by Decision trees and
Shapley based FI techniques. The two bars represent the FI scores assigned to individ-
ual features by the two FI techniques to provide the basis for the comparison.
In Figure 2, the feature Schiller Test was omitted due to its high corre-
lation with the biopsy results as it is an advanced medical test conducted to
diagnose cervical cancer [11]. It can be observed that the feature Hinselmann is
the highest-ranked feature by both FI approaches. There is a similarity between
the two sets of FI scores obtained as features considered important by the model
(represented by Decision trees based FI) will automatically have a higher impact
on its outcome (represented by Shapley based FI). The value derived from these
FI scores is that a medical practitioner can understand and validate the rank-
ing of features. This enables the incorporation of clinical expertise to improve
feature engineering processes and achieve improved models for future use.
A2: FI in Different medical settings The random over sampled data was
used to train multiple instances of the model, each time on a subset of fea-
tures that excluded the highest-ranked feature from the previous instance. The
Incorporating XAI to aid Understanding of ML in Healthcare 9
resulting FI scores assigned to individual features are indicative of the impact
the omission of the highest-ranked feature from a prior instance has on the ML
model and is depicted in Figure 3.
The change in the relative ranking of different features can be observed on
the omission of the highest-ranked feature. For instance, when all the features
were present (All features), Schiller (orange segment) was given the highest
importance followed by Age (yellow segment). When feature Schiller is omitted
(second bar), it was noticed that Hinselmann (grey segment) was given the
highest importance instead of Age. Thus the inclusion or exclusion of a feature
does not behave in an ordered fashion as dictated by a gold-standard approach
that includes all features.
Fig. 3. A2: Impact of excluding the highest ranked features on FI scores, F-scores,
and Ordering of features based on FI score
Furthermore, the compounded omission of the highest-ranked features (left
to right) significantly reduces the total sum of FI scores assigned to features in
each instance (0.7 to 0.3). This is accompanied by a reduction in performance
metrics such as F-scores (denoted at the top of each bar) and indicates less
accurate models due to the absence of more important features or the presence
of less important features. This approach warrants the derivation of multiple
instances of a single model such that the relationship among features can be fully
understood and can be validated with clinical expertise. Based on a threshold
value of performance metrics, a medical practitioner can select a ML model that
is trained with the features that are accessible to his/her medical setting and
assigns appropriate importance scores to the available features while generating
an outcome.
10 U. Pawar et al.
A3: Understanding Underlying Data The model was trained on the data
sampled using different sampling techniques as discussed in Section 3.1 and FI
scores were plotted corresponding to each of the sampled versions as depicted
in Figure 4. FI scores relating to a particular type of sampled data are assigned
a particular color. Performance metrics corresponding to each of the sampling
techniques are noted in the legend.
Fig. 4. A3: Ranking of features (by Decision trees and Shapley based FI) on using dif-
ferent data sampling techniques and corresponding performance metrics. Performance
metrics are noted as follows Accuracy (ACC ), Precision (Prec) and Recall (Rec).
This approach enables the interpretability of the underlying dataset by pre-
senting the difference in FI scores when using data augmented using different
techniques. For instance, in Figure 4, there is a lack of similarity in the FI
scores associated with under-sampling techniques (e.g. NCUS, RUS) as com-
pared to over/combination-sampling techniques (e.g. S-TOM, ROS). The lesser
the volume of data generated using under-sampling techniques the less diverse
the values of a feature. This is evident when comparing the sorted ordering of FI
scores in under-sampling techniques to over/combination sampling techniques.
As depicted in Figure 4, the under-sampled data provided less accuracy and
recall values ( 70-90%) compared to the over/combination-sampled data ( 93-
97%). The association of performance metrics aligned with FI scores enables
the explainability to validate the suitability of datasets from a domain-specific
perspective. In contrast to over/combination sampled data, in the under-sampled
data augmented using the NCUS technique (dark-red bars), Age is assigned a
higher FI score than the Cytology test which would be considered an invalid
approach as a Cytology test is a diagnostic aid with a high level of efficacy when
detecting cervical cancer [21]. A medical practitioner should disregard the use of
NCUS data due to its invalid FI ranking along with the low-performance scores.
Incorporating XAI to aid Understanding of ML in Healthcare 11
5 Conclusions and Future Work
XAI is a crucial tool for enabling medical practitioners to understand and evalu-
ate AI-based solutions effectively in the healthcare domain. It provides additional
benefits in the form of increased confidence in solutions being adopted amongst
medical practitioners and increased exposure to the operation of the solutions.
In this paper, an alternative perspective regarding how FI scores can be
integrated into a ML workflow is adopted. FI scores are used to surface perti-
nent information relating to associations between features, models, and data to
provide explainability. This perspective is realised in three distinct approaches.
A1) A model/output-based perspective with regards to the relative rank-
ing of a feature, this informs the medical practitioner which features the model
considers most important and which features fluctuate the model outcome. A2)
Relative feature ranking in different medical settings, this incorporates a hier-
archical perspective which considers diagnostic capacity in the form of feature
inclusion/exclusion aligning it more closely to the real world. This informs the
medical practitioner how the model will perform and rank features in different
medical settings enabling a more informed interpretation of a model’s operation.
A3) The impact of data augmentation approaches on the performance of a model
and the validity of their FI scores in a medical setting. This informs the medical
practitioner how suitable the augmented data is and how valid it is in a medical
setting. The simple but powerful nature of FI enables the applicability of the
three approaches proposed in a domain-agnostic manner.
It is intended to extend the work by developing a framework that automates
the training and validation of models appropriate to the intended level of a
hierarchy in order to enable explainability from a multi-level perspective. The
workflow comprising that hierarchy will empirically evaluate the applicability of
combining XAI and recommendations to increase operational efficacy.
Acknowledgement: This publication has emanated from research co-sponsored
by McKesson and Science Foundation Ireland under Grant number SFI CRT
18/CRT/6222.
References
1. Andreas Holzinger, Chris Biemann, Constantinos S. Pattichis, and Douglas B. Kell.
What do we need to build explainable AI systems for the medical domain? arXiv
preprint arXiv:1712.09923, pages 1–28, 2017.
2. Benjamin P. Evans, Bing Xue, and Mengjie Zhang. What’s inside the black-box? A
genetic programming method for interpreting complex machine learning models.
GECCO 2019 - Proceedings of the 2019 Genetic and Evolutionary Computation
Conference, pages 1012–1020, 2019.
3. Danding Wang, Qian Yang, Ashraf Abdul, and Brian Y. Lim. Designing Theory-
Driven User-Centric Explainable AI. Proceedings of the 2019 CHI Conference on
Human Factors in Computing Systems - CHI ’19, pages 1–15, 2019.
4. Tim Miller. Explanation in artificial intelligence: Insights from the social sciences,
2017.
12 U. Pawar et al.
5. Erico Tjoa and Cuntai Guan. A survey on explainable artificial intelligence (xai):
towards medical xai. corr abs/1907.07374 (2019), 1907.
6. D Douglas Miller. The medical ai insurgency: what physicians must know about
data to practice with intelligent machines. NPJ digital medicine, 2(1):1–5, 2019.
7. Namrata Vaswani, Yuejie Chi, and Thierry Bouwmans. Rethinking pca for modern
data sets: Theory, algorithms, and applications [scanning the issue]. Proceedings
of the IEEE, 106(8):1274–1276, 2018.
8. Ben Hoyle, Markus Michael Rau, Roman Zitlau, Stella Seitz, and Jochen Weller.
Feature importance for machine learning redshifts applied to sdss galaxies. Monthly
Notices of the Royal Astronomical Society, 449(2):1275–1283, 2015.
9. Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie
Elhadad. Intelligible models for healthcare: Predicting pneumonia risk and hospi-
tal 30-day readmission. In Proceedings of the 21th ACM SIGKDD international
conference on knowledge discovery and data mining, pages 1721–1730, 2015.
10. Devam Dave, Het Naik, Smiti Singhal, and Pankesh Patel. Explainable ai meets
healthcare: A study on heart disease dataset. arXiv preprint arXiv:2011.03195,
2020.
11. Xiaoyu Deng, Yan Luo, and Cong Wang. Analysis of risk factors for cervical cancer
based on machine learning methods. In 2018 5th IEEE International Conference
on Cloud Computing and Intelligence Systems (CCIS), pages 631–635. IEEE, 2018.
12. Urja Pawar, Donna O’Shea, Susan Rea, and Ruairi O’Reilly. Explainable ai in
healthcare. In 2020 International Conference on Cyber Situational Awareness,
Data Analytics and Assessment (CyberSA), pages 1–2. IEEE, 2020.
13. P Nisha, Urja Pawar, and Ruairi O’Reilly. Interpretable machine learning models
for assisting clinicians in the analysis of physiological data. In AICS, pages 434–
445, 2019.
14. Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. ” why should i trust
you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM
SIGKDD international conference on knowledge discovery and data mining, pages
1135–1144, 2016.
15. Scott M Lundberg and Su-In Lee. A unified approach to interpreting model pre-
dictions. In Advances in neural information processing systems, 2017.
16. R. El Shawi, Y. Sherif, M. Al-Mallah, and S. Sakr. Interpretability in health-
care a comparative study of local machine learning interpretability techniques. In
2019 IEEE 32nd International Symposium on Computer-Based Medical Systems
(CBMS), pages 275–280, June 2019.
17. Kelwin Fernandes, Jaime S Cardoso, and Jessica Fernandes. Transfer learning with
partial observability applied to cervical cancer screening. In Iberian conference on
pattern recognition and image analysis, pages 243–250. Springer, 2017.
18. Sean Quinlan, Haithem Afli, and Ruairi O’Reilly. A comparative analysis of clas-
sification techniques for cervical cancer utilising at risk factors and screening test
results. In AICS, pages 400–411, 2019.
19. Guillaume Lemaˆıtre, Fernando Nogueira, and Christos K Aridas. Imbalanced-
learn: A python toolbox to tackle the curse of imbalanced datasets in machine
learning. The Journal of Machine Learning Research, 18(1):559–563, 2017.
20. J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986.
21. Anita Ww Lim, Rebecca Landy, Alejandra Castanon, Antony Hollingworth, Willie
Hamilton, Nick Dudding, and Peter Sasieni. Cytology in the diagnosis of cervical
cancer in symptomatic young women: a retrospective review. The British journal
of general practice : the journal of the Royal College of General Practitioners,
66(653):e871–e879, dec 2016.
... Notably, for publishers, journals such as Nature, Scientific Reports, Springer, and other related journals were grouped under 'Springer Nature.' Also, review [14], survey [15], and preprint papers [16] authored by the original researchers and published in journals such as arXiv [17,18] and MedRxiv [19] were excluded from the analysis. ...
Article
Full-text available
The integration of Artificial Intelligence (AI) in healthcare holds immense promise for revolutionizing clinical practices and patient outcomes. However, the lack of transparency in AI decision-making processes poses significant challenges, hindering trust and understanding among healthcare professionals. Explainable Artificial Intelligence (XAI) has emerged as a promising solution to address these concerns by shedding light on AI model predictions and enhancing interpretability. This review explores the efficacy and applications of XAI within the healthcare domain, focusing on key research questions regarding challenges, effectiveness, and utilized algorithms. Through a comprehensive examination of 50 recent literature, we identify challenges related to the integration of XAI into clinical workflows, the necessity for validation and trust-building, and technical hurdles such as diverse explanation methods and data quality issues. Popular XAI algorithms such as SHAP, LIME, and GRAD-CAM demonstrate significant promise in clarifying model predictions and aiding in the interpretation of AI-driven healthcare systems. Overall, this review underscores the immense potential of XAI in revolutionizing healthcare delivery and decision-making processes, emphasizing the need for further research and development to address challenges and leverage its full potential in enhancing healthcare practices.
... This transparency enables healthcare professionals to verify the relevance of the model's focus, increasing trust and aiding in clinical decision making. Studies have demonstrated that xAI can improve model adoption in healthcare by enhancing interpretability and identifying potential biases or errors [16]. ...
Article
Full-text available
Early and precise detection of brain tumors is critical for improving clinical outcomes and patient quality of life. This research focused on developing an image classifier using convolutional neural networks (CNN) to detect brain tumors in magnetic resonance imaging (MRI). Brain tumors are a significant cause of morbidity and mortality worldwide, with approximately 300,000 new cases diagnosed annually. Magnetic resonance imaging (MRI) offers excellent spatial resolution and soft tissue contrast, making it indispensable for identifying brain abnormalities. However, accurate interpretation of MRI scans remains challenging, due to human subjectivity and variability in tumor appearance. This study employed CNNs, which have demonstrated exceptional performance in medical image analysis, to address these challenges. Various CNN architectures were implemented and evaluated to optimize brain tumor detection. The best model achieved an accuracy of 97.5%, sensitivity of 99.2%, and binary accuracy of 98.2%, surpassing previous studies. These results underscore the potential of deep learning techniques in clinical applications, significantly enhancing diagnostic accuracy and reliability.
... XAI centers around pursuing computer based intelligence models and their choice making processes reasonable and interpretable for people. This is particularly significant in medical care, where choices can essentially affect patient results [29][30][31]. ...
Conference Paper
Full-text available
With the quick progression of man-made brainpower (computer based intelligence/ AI) in the clinical field, there is a rising requirement for straightforwardness and interpretability of man-made intelligence models to guarantee their viable and capable arrangement. This paper means to investigate the idea of logical simulated intelligence (XAI) with regards to medical services, examining its significance, difficulties, and possible applications. The paper gives strategies and procedures for XAI, uses of XAI in medical services, moral contemplations, and future bearings
Article
Full-text available
Postpartum depression (PPD) is a type of depression that mothers have following childbirth due to hormonal changes, psychological transition to parenting, and exhaustion. This depression strikes either during/or in the first year following childbirth. It is also a frequently disregarded medical condition that must be treated right away as it might have major repercussions. Machine learning (ML) and artificial intelligence (AI) are tools that healthcare professionals can utilize to anticipate this condition more rapidly and correctly. Consequently, we have demonstrated how to use explainable artificial intelligence (XAI) methods and heterogeneous classifiers to predict postpartum depression in mothers who have recently given birth. The K-Nearest Neighbor (KNN) model and the customized stack model outperformed all other classifiers. KNN model obtained 97% accuracy, 98% recall, and 95% precision and the stack model obtained 97% accuracy, 100% recall, and 94% precision, respectively. A set of frameworks and resources known as explainable artificial intelligence (XAI) facilitates the comprehension and interpretation of predictions made by machine learning algorithms. Four distinct XAI techniques: ELI5, Shapley Additive Values (SHAP), Local Interpretable Model-agnostic Explanations (LIME), and Anchor – have been used to interpret the model predictions. Explainability, interpretability, accountability, and transparency are crucial parameters of XAI, ensuring that machine learning models provide understandable and trustworthy results to users and stakeholders. The goal of this interdisciplinary research is to develop an automated diagnosis framework with tools that can transform therapy for postpartum depression leading to suicide attempts and empower medical professionals to offer mothers individualized, high-quality care.
Article
Full-text available
Early detection of Alzheimer's disease (AD) is vital for effective treatment, as interventions are most successful in the disease's early stages. Combining Magnetic Resonance Imaging (MRI) with artificial intelligence (AI) offers significant potential for enhancing AD diagnosis. However, traditional AI models often lack transparency in their decision-making processes. Explainable Artificial Intelligence (XAI) is an evolving field that aims to make AI decisions understandable to humans, providing transparency and insight into AI systems. This research introduces the Squeeze-and-Excitation Convolutional Neural Network with Random Forest (SECNN-RF) framework for early AD detection using MRI scans. The SECNN-RF integrates Squeeze-and-Excitation (SE) blocks into a Convolutional Neural Network (CNN) to focus on crucial features and uses Dropout layers to prevent overfitting. It then employs a Random Forest classifier to accurately categorize the extracted features. The SECNN-RF demonstrates high accuracy (99.89%) and offers an explainable analysis, enhancing the model's interpretability. Further exploration of the SECNN framework involved substituting the Random Forest classifier with other machine learning algorithms like Decision Tree, XGBoost, Support Vector Machine, and Gradient Boosting. While all these classifiers improved model performance, Random Forest achieved the highest accuracy, followed closely by XGBoost, Gradient Boosting, Support Vector Machine, and Decision Tree which achieved lower accuracy.
Article
Full-text available
Recently, artificial intelligence and machine learning in general have demonstrated remarkable performances in many tasks, from image processing to natural language processing, especially with the advent of deep learning (DL). Along with research progress, they have encroached upon many different fields and disciplines. Some of them require high level of accountability and thus transparency, for example, the medical sector. Explanations for machine decisions and predictions are thus needed to justify their reliability. This requires greater interpretability, which often means we need to understand the mechanism underlying the algorithms. Unfortunately, the blackbox nature of the DL is still unresolved, and many machine decisions are still poorly understood. We provide a review on interpretabilities suggested by different research works and categorize them. The different categories show different dimensions in interpretability research, from approaches that provide "obviously" interpretable information to the studies of complex patterns. By applying the same categorization to interpretability in medical research, it is hoped that: 1) clinicians and practitioners can subsequently approach these methods with caution; 2) insight into interpretability will be born with more considerations for medical practices; and 3) initiatives to push forward data-based, mathematically grounded, and technically grounded medical education are encouraged.
Conference Paper
Full-text available
Cervical cancer is a severe concern for women's health. Every year in the Republic of Ireland, approximately 300 women are diagnosed with cervical cancer, 30% for whom the diagnosis will prove fatal. It is the second most common cause of death due to cancer in women aged 25 to 39 years [14]. Recently there has been a series of controversies concerning the mishandling of results from cervical screening tests, delays in processing said tests and the recalling of individuals to retake tests [12]. The serious nature of the prognosis highlights the importance and need for the timely processing and analysis of data related to screenings. This work presents a comparative analysis of several classification techniques used for the automated analysis of known risk factors and screening tests with the aim of predicting cervical cancer outcomes via a Biopsy result. These techniques encompass methods such as tree-based, cluster-based, liner and ensemble techniques, and where applicable use parameter tuning to determine optimal model parameters. The dataset utilised for training and validation consists of 858 observations and 36 variables, including the binary target variable "Biopsy". The data itself is heavily imbalanced with 803 negative and 55 positive observations with approximately 11.73% of the data points missing. These issues are addressed during pre-processing by methods such as mean or median imputation, as well as over-sampling, under-sampling and combination techniques which led to the creation of 6 augmented datasets of varying size, consisting of 34 variables including the response Biopsy. The results show that a SMOTE-Tomek combination resampling method in conjunction with a tuned Random Forest model produced an accuracy score of 99.69% with a recall and precision value of 0.99% for both positive and negative responses.
Conference Paper
Full-text available
The analysis of physiological data plays a significant role in medical diagnostics. While state-of-the-art machine learning models demonstrate high levels of performance in classifying physiological data clinicians are slow to adopt them. A contributing factor to the slow rate of adoption is the "black-box" nature of the underlying model whereby the clinician is presented with a prediction result, but the rationale for that result omitted or not presented in an interpretable manner. This gives rise to the need for interpretable machine learning models such that clinicians can verify, and rationalise, the predictions made by a model. If a clinician understands why a model makes a prediction, they will be more inclined to accept a models assistance in analysing physiological data. This paper discusses some of the latest findings in interpretable machine learning. Thereafter, based on these findings, three models are selected and implemented to analyse ECG data that are both accurate and exhibit a high level of interpretability.
Conference Paper
Full-text available
Interpreting state-of-the-art machine learning algorithms can be difficult. For example, why does a complex ensemble predict a particular class? Existing approaches to interpretable machine learning tend to be either local in their explanations, apply only to a particular algorithm, or overly complex in their global explanations. In this work, we propose a global model extraction method which uses multi-objective genetic programming to construct accurate, simplistic and model-agnostic representations of complex black-box estimators. We found the resulting representations are far simpler than existing approaches while providing comparable reconstructive performance. This is demonstrated on a range of datasets, by approximating the knowledge of complex black-box models such as 200 layer neural networks and ensembles of 500 trees, with a single tree.
Article
Full-text available
Machine learning (ML) and its parent technology trend, artificial intelligence (AI), are deriving novel insights from ever larger and more complex datasets. Efficient and accurate AI analytics require fastidious data science—the careful curating of knowledge representations in databases, decomposition of data matrices to reduce dimensionality, and preprocessing of datasets to mitigate the confounding effects of messy (i.e., missing, redundant, and outlier) data. Messier, bigger and more dynamic medical datasets create the potential for ML computing systems querying databases to draw erroneous data inferences, portending real-world human health consequences. High-dimensional medical datasets can be static or dynamic. For example, principal component analysis (PCA) used within R computing packages can speed & scale disease association analytics for deriving polygenic risk scores from static gene-expression microarrays. Robust PCA of k-dimensional subspace data accelerates image acquisition and reconstruction of dynamic 4-D magnetic resonance imaging studies, enhancing tracking of organ physiology, tissue relaxation parameters, and contrast agent effects. Unlike other data-dense business and scientific sectors, medical AI users must be aware that input data quality limitations can have health implications, potentially reducing analytic model accuracy for predicting clinical disease risks and patient outcomes. As AI technologies find more health applications, physicians should contribute their health domain expertize to rules-/ML-based computer system development, inform input data provenance and recognize the importance of data preprocessing quality assurance before interpreting the clinical implications of intelligent machine outputs to patients.
Conference Paper
Full-text available
From healthcare to criminal justice, artificial intelligence (AI) is increasingly supporting high-consequence human decisions. This has spurred the field of explainable AI (XAI). This paper seeks to strengthen empirical application-specific investigations of XAI by exploring theoretical underpinnings of human decision making, drawing from the fields of philosophy and psychology. In this paper, we propose a conceptual framework for building human-centered, decision-theory-driven XAI based on an extensive review across these fields. Drawing on this framework, we identify pathways along which human cognitive patterns drives needs for building XAI and how XAI can mitigate common cognitive biases. We then put this framework into practice by designing and implementing an explainable clinical diagnostic tool for intensive care phenotyping and conducting a co-design exercise with clinicians. Thereafter, we draw insights into how this framework bridges algorithm-generated explanations and human decision-making theories. Finally, we discuss implications for XAI design and development.
Article
Full-text available
Artificial intelligence (AI) generally and machine learning (ML) specifically demonstrate impressive practical success in many different application domains, e.g. in autonomous driving, speech recognition, or recommender systems. Deep learning approaches, trained on extremely large data sets or using reinforcement learning methods have even exceeded human performance in visual tasks, particularly on playing games such as Atari, or mastering the game of Go. Even in the medical domain there are remarkable results. The central problem of such models is that they are regarded as black-box models and even if we understand the underlying mathematical principles, they lack an explicit declarative knowledge representation, hence have difficulty in generating the underlying explanatory structures. This calls for systems enabling to make decisions transparent, understandable and explainable. A huge motivation for our approach are rising legal and privacy aspects. The new European General Data Protection Regulation entering into force on May 25th 2018, will make black-box approaches difficult to use in business. This does not imply a ban on automatic learning approaches or an obligation to explain everything all the time, however, there must be a possibility to make the results re-traceable on demand. In this paper we outline some of our research topics in the context of the relatively new area of explainable-AI with a focus on the application in medicine, which is a very special domain. This is due to the fact that medical professionals are working mostly with distributed heterogeneous and complex sources of data. In this paper we concentrate on three sources: images, *omics data and text. We argue that research in explainable-AI would generally help to facilitate the implementation of AI/ML in the medical domain, and specifically help to facilitate transparency and trust.
Article
Full-text available
There has been a recent resurgence in the area of explainable artificial intelligence as researchers and practitioners seek to provide more transparency to their algorithms. Much of this research is focused on explicitly explaining decisions or actions to a human observer, and it should not be controversial to say that, if these techniques are to succeed, the explanations they generate should have a structure that humans accept. However, it is fair to say that most work in explainable artificial intelligence uses only the researchers' intuition of what constitutes a `good' explanation. There exists vast and valuable bodies of research in philosophy, psychology, and cognitive science of how people define, generate, select, evaluate, and present explanations. This paper argues that the field of explainable artificial intelligence should build on this existing research, and reviews relevant papers from philosophy, cognitive psychology/science, and social psychology, which study these topics. It draws out some important findings, and discusses ways that these can be infused with work on explainable artificial intelligence.