Content uploaded by Faiza Khan Khattak
Author content
All content in this area was uploaded by Faiza Khan Khattak on Mar 27, 2023
Content may be subject to copyright.
MLHOps: Machine Learning for Healthcare Operations
Faiza Khan Khattaka, Vallijah Subasria,b,c, Amrit Krishnana, Elham
Dolatabadia, Deval Pandyaa, Laleh Seyyed-Kalantarid, Frank Rudzicza,c,e
aVector Institute for Artificial Intelligence, Toronto, Ontario, Canada
bHospital for Sick Children, Toronto, Ontario, Canada
cUniversity of Toronto, Toronto, Ontario, Canada
dYork University, Toronto, Ontario, Canada
eDalhousie University, Halifax, Nova Scotia, Canada
Abstract
Machine Learning Health Operations (MLHOps) is the combination of pro-
cesses for reliable, efficient, usable, and ethical deployment and maintenance
of machine learning models in healthcare settings. This paper provides both
a survey of work in this area and guidelines for developers and clinicians
to deploy and maintain their own models in clinical practice. We cover the
foundational concepts of general machine learning operations, describe the
initial setup of MLHOps pipelines (including data sources, preparation, en-
gineering, and tools). We then describe long-term monitoring and updating
(including data distribution shifts and model updating) and ethical consid-
erations (including bias, fairness, interpretability, and privacy). This work
therefore provides guidance across the full pipeline of MLHOps from concep-
tion to initial and ongoing deployment.
Keywords: MLOps, Healthcare, Responsible AI
1. Introduction
Over the last decade, efforts to use health data for solving complex medi-
cal problems has increased significantly. Academic hospitals are increasingly
dedicating resources to bring machine learning (ML) to the bedside, and to
address issues encountered by clinical staff. These resources are being uti-
lized across a range of applications including clinical decision support, early
warning, treatment recommendation, risk prediction, image informatics, tele-
diagnosis, drug discovery, and intelligent health knowledge systems.
Preprint submitted to Journal of Biomedical Informatics March 27, 2023
There are various examples of ML being applied to medical data, including
prediction of sepsis [229], in-hospital mortality, prolonged length-of-stay, pa-
tient deterioration, and unplanned readmission [208]. In particular, sepsis is
one of the leading causes of in-hospital deaths. A large-scale study demon-
strated the impact of an early warning system to reduce the lead time for
detecting the onset of sepsis, and hence allowing more time for clinicians
to prescribe antibiotics [8]. Similarly, deep convolutional neural networks
have been shown to achieve superior performance in detecting pneumonia
and other pathologies from chest X-rays, compared to practicing radiologists
[209]. These results highlight the potential of ML models when they are
strongly integrated into clinical workflows.
When deployed successfully, data-driven models can free time for clinicians[101],
improve clinical outcomes [207], reduce costs [24], and provide improved qual-
ity care for patients. However, most studies remain preliminary, limited to
small datasets, and/or implemented in select health sub-systems. Integrat-
ing with clinical workflows remains crucial [265, 253] but, despite recent
computational advances and an explosion of health data, deploying ML in
healthcare responsibly and reliably faces several operational and engineering
challenges, including:
•Standardizing data formats,
•Strengthening methodologies for evaluation, monitoring and updating,
•Building trust with clinicians and hospital staff,
•Adopting interoperability standards, and
•Ensuring that deployed models align with ethical considerations, do
not exacerbate biases, and adhere to privacy and governance policies
In this review, we articulate the challenges involved in implementing suc-
cessful Machine Learning Health Operations (MLHOps) pipelines, specific
to clinical use cases. We begin by outlining the foundations of model de-
ployment in general, and provide a comprehensive study of the emerging
discipline [239, 157]. We then provide a detailed review of the different com-
ponents of development pipelines specific to healthcare. We discuss data,
pipeline engineering, deployment, monitoring and updating models, and eth-
ical considerations pertaining to healthcare use cases. While MLHOps often
requires aspects specific to healthcare, best practices and concepts from other
application domains are also relevant. This summarizes the primary outcome
of our review, which is to provide a set of recommendations for implementing
MLHOps pipelines in practice – i.e., a “how-to” guide for practitioners.
2. Foundations of MLOps
Machine learning operations (MLOps) provide engineering best practices to
standardize ML system development and operations [239]. This includes in-
creasing system quality, simplifying the management process, and automat-
ing the deployment of ML and deep learning models, especially in large-scale
production environments. Figure 1 illustrates a general MLOps pipeline.
Figure 1: MLOps pipeline
2.1. MLOps Pipeline
Pipelines are processes of multiple modules that streamline the ML workflow.
Once the project is defined, the MLOps pipeline begins with identifying in-
puts and outputs relevant to the problem, cleaning and transforming data
towards representations that are useful and efficient for learning, training and
analyzing models’ performance, and deploying selected models in production
while continuing to monitor their efficacy.
Pipelines can be of two types: automated pipelines and orchestrated pipelines.
In automated pipelines, each process is automated towards a single task and,
in orchestrated pipelines, multiple automated tasks are managed and coordi-
nated in a dynamic workflow.
Recently, MLOps have become more well-defined and widely utilized due to
their reusability and standardization benefits [219]. As a result, the structure
and definitions of different components are becoming quite well-established.
In the following, we describe a set of key concepts in MLOps:
•Stores encapsulate the tools designed to centralize building, manag-
ing, and sharing either features or models across different teams and
applications in an organization.
–AFeature Store is a feature reusability tool that provides services
such as running scalable and robust data transformation, storing,
registration, and serving features for training and inference tasks.
–An ML Metadata Store helps record and retrieve metadata as-
sociated with a ML pipeline including information about various
pipeline components, their executions (e.g., training runs), and
resulting artifacts (e.g., trained models).
•Serving is the task of hosting ML artifacts (usually models) either
on the cloud or on premises so that their functions are accessible to
multiple applications through remote function calls (i.e., application
programming interfaces (APIs)).
–In batch serving, the artifact is used by scheduled jobs.
–In online serving, the artifact processes requests in real-time. Com-
munication and access point channels, traffic management, pre-
and post-processing requests, and performance monitoring should
be considered while serving artifacts.
•Containerization involves packaging models with the components re-
quired to run them, including libraries and frameworks, so that they
can run in isolated user spaces with minimal configurations to the un-
derlying operating system [79]. Sometimes, source code is also included
in these containers.
2.2. Levels of MLOps maturity
MLOps practices can be divided into different levels based on the maturity
of the ML system automation process [109, 239], as described below.
•Level 0 – Manual ML pipeline: Every step in the ML pipeline – in-
cluding data processing, model building, evaluation, and deployment –
are manual. In Level 0, the experimental and operational pipelines are
distinct and the data scientists provide a trained model as an artifact
to the engineering team to deploy on their infrastructure. Here, only
the trained model is served for deployment and there are infrequent
model updates. Level 0 processes usually lack rigorous and continuous
performance monitoring capabilities.
•Level 1 – Continuous Model Training and Delivery: Here, the
entire ML pipeline is automated to perform continuous training of the
model as well as continuous delivery of model prediction service. Soft-
ware orchestrates the execution and transition between the steps in
the pipeline, leading to rapid iteration over experiments and an auto-
matic process for deploying a selected model into production. Contrary
to Level 0, the entire training pipeline is automated, and the deployed
model can incorporate newer data based on pipeline triggers. Given the
automated nature of Level 1, it is necessary to continuously monitor,
evaluate, and validate models and data to ensure expected performance
during production.
•Level 2 – Continuous Integration and Continuous Delivery:
This involves the highest maturity in automation through enforcing
combined practice on continuous integration and delivery which en-
ables for a rapid and reliable update of the pipelines in production.
Through automated test and deployment of new pipeline implemen-
tations, any rapid changes in data and business environment can be
addressed. In this level, the pipeline and its components are built,
tested, and packaged when new code is committed or pushed to the
source code repository. Moreover, the system continuously delivers
new pipeline implementations to the target environment that in turn
delivers prediction services of the newly trained model.
3. MLHOps Setup
Operationalizing ML models in healthcare is unique among other application
domains. Decisions made in clinical environments have a direct impact on
patient outcomes and, hence, the consequences of integrating ML models into
health systems need to be carefully controlled. For example, early warning
systems might enable clinicians to prescribe treatment plans with increased
lead time [101]; however, these systems might also suffer from a high false
alarm rate, which could result in alarm fatigue and possibly worse outcomes.
The requirements placed on such ML systems are therefore very high and,
if they are not adequately satisfied, the result is diminished adoption and
trust from clinical staff. Rigorous long-term evaluation is needed to validate
the efficacy and to identify and assess risks, and this evaluation needs to be
reported comprehensively and transparently [252].
While most MLOps best practices extend to healthcare settings, the data,
competencies, tools, and model evaluation differ significantly [169, 162, 243,
14]. For example, typical performance metrics (e.g., positive predictive value
and F1-scores) may differ between clinicians and engineers. Therefore, unlike
in other industries, it becomes necessary to evaluate physician experience
when predictions and model performance are presented to clinical staff [259].
In order to build trust in the clinical setting, the interpretability of ML
models is also exceptionally important. As more ML models are integrated
into hospitals, new legal frameworks and standards for evaluation need to be
adopted, and MLHOps tools need to comply with existing standards.
In the following sections, we explore the different components of MLHOps
pipelines.
3.1. Data
Successfully digitizing health data has resulted in a prodigious increase in the
volume and complexity of patient data collected [208]. These datasets are
now stored, maintained, and processed by hospital IT infrastructure systems
which in turn use specialized software systems.
3.1.1. Data sources
There could be multiple sources of data, which are categorized as follows:
Electronic health records (EHRs) record, analyze, and present information
to clinicians, including:
1. Patient demographic data: E.g., age and sex.
2. Administrative data: E.g., treatment costs and insurance.
3. Patient observations records: E.g., chart events such as lab tests
and vitals. These include a multitude of physiological signals captured
using various methods such as heart rate, blood pressure, skin temper-
ature, and respiratory rate.
4. Interventions: These are steps that significantly alter the course of
patient care, such as mechanical ventilation, dialysis, or blood transfu-
sions.
5. Medications information: E.g., medications administered and their
dosage.
6. Waveform data: This digitizes physiological signals collected from
bedside patient monitors.
7. Imaging reports and metadata: E.g., CT scans, MRI, ultrasound,
and corresponding radiology reports.
8. Medical notes: These are made by clinical staff on patient condition.
These can also be transcribed text of recorded interactions between the
patient and clinician.
Other sources of health data include primary care data, wearable data (e.g.,
smartwatches), genomics data, video data, surveys, medical claims, billing
data, registry data, and other patient-generated data [206, 26, 40].
Figure 2 illustrates the heterogeneous nature of health data. The stratifica-
tion shown can be extended further to contain more specialized data. For
example, genomics data can be further stratified into different types of data
based on the method of sequencing; observational EHR data can be further
stratified to include labs, vital measurements, and other recorded observa-
tions.
With such large volumes and variability in data, standardization is key to
achieve scalability and interoperability. Figure 3 illustrates the different lev-
els of standardization that need to be achieved with respect to health data.
Healthcare data
EHR
Demographic
Waveform
Administrative
Billing
Insurance
Observations
Notes
Medications
Interventions
Imaging
X-rays
MRIs CT scans
Ultrasound
Pathology
Omics
Genomics
Transcriptomics
Proteomics
Epigenomics
Metabolomics
Sensors
Wearables
Ambient sensors
Other
Pharmacological
Immune
Microbiome
Figure 2: Stratification of health data. Further levels of stratification can be extended as
the data becomes richer. For example, observational EHR data could include labs, vital
measurements, and other recorded observations.
3.1.2. Common Data Model (CDM)
Despite the widespread adoption of EHR systems, clinical events are not cap-
tured in a standard format across observational databases [185]. For effective
research and implementation, data must be drawn from many sources and
compared and contrasted to be fully understood.
Databases must also support scaling to large numbers of records which can
be processed concurrently. Hence, efficient storage systems along with com-
putational techniques are needed to facilitate analyses. One of the first steps
towards scalability is to transform the data to a common data standard.
Once available in a common format, the process of extracting, transforming,
and loading (ETL) becomes simplified. In addition to scale, patient data
require a high level of protection with strict data user agreements and ac-
4.4 and 4.0
Variable names and types
Concepts and relationships
Exchange formats and proto-
cols
Figure 3: The hierarchy of standardization that common data models and open standards
for interoperability address. The lowest level is about achieving standardization of vari-
able names such as lab test names, medications and diagnosis codes, as well as the data
types used to store these variables (i.e. integer vs. character). The next level is about
having abstract concepts such that data can be mapped and grouped under these concept
definitions. The top level of standardization is about data exchange formats, e.g. JSON,
XML, protocols, along with protocols for information exchange like supported RESTful
API architectures. This level addresses questions on interoperability and how data can be
exchanged across sites and EHR systems.
cess control. A common data model addresses these challenges by allowing
for downstream functional access points to be designed independent of the
data model. Data that is available in a common format promotes collabora-
tion and mitigates duplicated effort. Specific implementations and formats
of data should be hidden from users, and only high-level abstractions need
to be visible.
The Systematized Nomenclature of Medicine (SNOMED) was among the
first efforts to standardize clinical terminology, and a corresponding dic-
tionary with a broad range of clinical terminology is available as part of
SNOMED-CT [61]. Several data models use SNOMED-CT as part of their
core vocabulary. Converting datasets to a common data model like the Ob-
servational Medical Outcomes Partnership (OMOP) model involves mapping
from a source database to the target content delivery manager. This process
is usually time-consuming and involves a lot of manual effort undertaken
by data scientists. Tools to simplify the mapping and conversion process
can save time and effort and promote adoption. For OMOP, the ATLAS
tool [185] developed by Observational Health Data Sciences and Informatics
(OHDSI) provides such a feature through their web based interactive analysis
platform.
3.1.3. Interoperability and open standards
As the volume of data grows in healthcare institutions and applications in-
gest data for different use cases, real-time performance and data management
is crucial. To enable real-time operation and easy exchange of health data
across systems, an interoperability standard for data exchange along with
protocols for accessing data through easy-to-use programming interfaces is
necessary. Some of the popular healthcare data standards include Health
Level 7 (HL7), Fast Healthcare Interoperability Resources (FHIR), Health
Level 7 v2 (HL7v2), and Digital Imaging and Communications in Medicine
(DICOM).
The FHIR standard [27] is a leading open standard for exchanging health
data. FHIR is developed by Health Level 7 (HL7), a not-for-profit stan-
dards development organization that was established to develop standards
for hospital information systems. FHIR defines the key entities involved
in healthcare information exchange as resources, where each resource is a
distinct identifiable entity. FHIR also defines APIs which conform to the
representational state transfer (REST) architectural style for exchanging re-
sources, allowing for stateless Hypertext Transfer Protocol (HTTP) methods,
and exposing directory-structure like URIs to resources. RESTful architec-
tures are light-weight interfaces that allow for faster transmission, which is
more suitable for mobile devices. RESTful interfaces also facilitate faster
development cycles because of their simple structure.
DICOM is the standard for the communication and management of medical
imaging information and related metadata. The DICOM standard specifies
the format and protocol for exchange of digital information between medical
imaging equipment and other systems. Persistent information objects which
encode images are exchanged and an instance of such an information object
may be exchanged across many systems and many organizational contexts,
and over time. DICOM has enabled deep collaboration and standardization
across different disciplines such as radiology, cardiology, pathology, ophthal-
mology, and related disciplines.
3.1.4. Quality assurance and validation
Data collected in retrospective databases for analysis and ML use cases need
to be checked for quality and consistency. Data validation is an important
step towards ensuring that ML systems developed using the data are highly
performant, and do not incorporate biases from the data. Errors in data
propagate through the MLOps pipeline and hence specialized data quality
assurance tools and checks at various stages of the pipeline are necessary
[213]. A standardized data validation framework that includes i) data ele-
ment pre-processing, ii) checks for completeness, conformance, and plausi-
bility, and iii) a review process by clinicians and other stakeholders should
capture generalizable insight across various clinical investigations [228].
3.2. Pipeline Engineering
Data stored in raw formats need to be processed to create feature represen-
tations for ML models. Each transformation is a computation, and a chain
of these processing elements, arranged so that the output of each element is
the input of the next, constitutes a pipeline [125] and using software tools
and workflow practices that enable such pipelines is pipeline engineering.
There are advantages to using such a pipeline approach, including:
•Modularization: By breaking the chain of transformations into small
steps, modularization is naturally achieved.
•Testing: Each transformation step can be tested independently, which
facilitates quality assurance and testing.
•Debugging: Version controlling the outputs at each step makes it eas-
ier to ensure reproducibility, especially when many steps are involved.
•Parallelism: If any step in the pipeline is easily parallelizable across
multiple compute nodes, the overall processing time can be reduced.
•Automation: By breaking a complex task into a series of smaller
tasks, the completion of each task can be used to trigger the start of
the next task, and this can be automated using continuous integration
tools such as Jenkins, Github actions and Gitlab CI.
In health data processing, the following steps are crucial:
1. Cleaning: Formatting values, adjusting data types, checking and fix-
ing issues with raw data.
2. Encoding: Computing word embeddings for clinical text, encoding the
text and raw values into embeddings [118, 12]. Encoding is a general
transformation step that can be used to create vector representations of
raw data. For example, transforming images to numeric representations
can also be considered to be encoding.
3. Aggregation: Grouping values into buckets, e.g., for aggregating mea-
surements into fixed time-intervals, or grouping values by patient ID.
4. Normalization: Normalizing values into standard ranges or using
statistics of the data.
5. Imputation: Handling missing values in the data. For various clinical
data, ‘missingness’ can actually provide valuable contextual informa-
tion about the patient’s health and needs to be handled carefully [42].
Multiple data sources such as EHR data, clinical notes and text, imaging
data, and genomics data can be processed independently to create features
and they can be combined to be used as inputs to ML models. Hence, com-
posing pipelines of these tasks facilitates component reusability [106]. Fur-
thermore, since the ML development life-cycle constitutes a chain of tasks,
the pipelining approach becomes even more desirable. Some of the high
level tasks in the MLHOps pipeline include feature creation, feature selec-
tion, model training, evaluation, and monitoring. Evaluating models across
different slices of data, hyper-parameters, and other confounding variables is
necessary for building trust.
Table 7 lists popular open-source tools and packages specific to health data
and ML processing. These tools are at different stages of development and
maturity. Some examples of popular tools include MIMIC-Extract [260],
Clairvoyance [106] and CheXstray [234].
3.3. Modelling
At this stage, the data has been collected, cleaned, and curated, ready to be
fed to the ML model to accomplish the desired task. The modelling phase in-
volves choosing the available models that fit the problem, training & testing
the models, and choosing the model with the best performance & reliabil-
ity guarantees. Given the the existence of numerous surveys summarizing
machine learning and deep learning algorithms for general healthcare scenar-
ios [67, 1], as well as specific use cases such as brain tumor detection [15],
COVID-19 prevention[23], and clinical text representation [118], we omit this
discussion and let the reader explore the surveys relevant to their prediction
problem.
3.4. Infrastructure and System
Hospitals typically use models developed by their EHR vendor which are
deployed through the native EHR vendor configuration. Often, inference is
run locally or in a cloud instance, and the model outputs are communicated
within the EHR [115]. Predominantly, these models are pre-trained and
sometimes fine-tuned on the specific site’s data.
A feature store is a ML-specific data system used to centralize storage, pro-
cessing, and access to frequently used features, making them available for
reuse in the development of future machine learning models. Feature stores
operationalize and streamline the input, tracking, and governance of the data
as part of feature engineering for machine learning [125].
To ensure reliability, the development, staging, and production environments
are separated and have different requirements. The staging and production
environments typically consist of independent virtual machines with ade-
quate compute and storage, along with reliable and secure connections to
the databases.
The infrastructure and software systems also have to follow and comply with
cybersecurity, medical software design and software testing standards [59].
3.4.1. Roles and Responsibilities
Efficient and successful MLHOps requires a collaborative, interdisciplinary
team across a range of expertise and competencies commonly found in data
science, ML, software, operations, production engineering, medicine, and pri-
vacy capabilities [125]. Similar to general MLOps practices, data and ML
scientists, data, DevOps, and ML engineers, solution and data architects,
ML and software fullstack developers, and project managers are needed. In
addition, the following role are required, which are distinct to healthcare (for
more general MLOps roles see Table 5):
•Health AI Project Managers: Responsibilities include panning
projects, establishing guidelines, milestone tracking, managing risk,
supporting the teams and governing partnerships with collaborators
from other health organizations.
•Health AI Implementation Coordinator: Liaison that engages
with key stakeholders to facilitate the implmentation of clinical AI sys-
tems.
•Healthcare Operations Manager: Oversees and coordinates quality
management, resource management, process improvement, and patient
safety in clinical settings like hospitals.
•Clinical Researchers & Scientists: Domain experts that provide
critical domain-specific knowledge relevant to model development and
implementation.
•Patient-Facing Practitioners: Responsibilities include providing sys-
tem requirements, pipeline usage feedback, and perspective about the
patient experience (e.g. clinicians, nurses).
•Ethicists: Provides support regarding ethical implications of clinical
AI systems.
•Privacy Analysts: Provides assessments regarding privacy concerns
pertaining to the usage of patient data.
•Legal Analysts: Works closely with privacy analysts and ethicists to
evaluate the legal vulnerabilities of clinical AI systems.
3.5. Reporting Guidelines
Many clinical AI systems do not meet reporting standards because of a fail-
ure to assess for poor quality or unavailable input data, insufficient analysis
of performance errors, or a lack of information regarding code or algorithm
availability [198]. Systematic reviews of clinical AI systems suggest there is a
substantial reporting burden, and additions regarding reliability and fairness
can improve reporting [154]. As a result, guidelines informed by challenges
in existing AI deployments in health settings have become imperative [52].
Reporting guidelines including CONSORT-AI [148], DECIDE-AI [252], and
SPIRIT-AI [215] were developed by a multidisciplinary group of international
experts using the Delphi process to ensure complete and transparent report-
ing of randomized clinical trials (RCT) that evaluate interventions with an
AI model. Broadly these guidelines suggest inclusion of the following criteria
[59]:
•Intended use: Inclusion of the medical problem and context, current
standard practice, intended patient population(s), how the AI system
will be integrated into the care pathway, and the intended patient out-
comes.
•Patient and user recruitment: Well-defined inclusion and exclusion
criteria.
•Data and outcomes: The use of a representative patient population,
data coding and processing, missing- and low-quality data handling,
and sample size considerations.
•Model: Inclusion of inputs, outputs, training, model selection, param-
eter tuning, and performance.
•Implementation: Inclusion of user experience with the AI system,
user adherence to intended implementation, and changes to clinical
workflow.
•Modifications: A description protocol for changes made, timing and
rationale for modifications, and outcome changes after each modifica-
tion.
•Safety and errors: Identification of system errors and malfunctions,
anticipated risks and mitigation strategies, undesirable outcomes, and
worst-case scenarios.
•Ethics and fairness: Inclusion of subgroup analyses, and fairness
metrics.
•Human-computer agreement: Report of user agreement with the
AI system, reasons for disagreement, and cases of users changing their
mind based on the AI system.
•Transparency: Inclusion of data and code availability.
•Reliability: Inclusion of uncertainty measures, and performance against
realistic baselines.
•Generalizability: Inclusion of measures taken to reduce overfitting,
and external performance evaluations.
Table 1: MLOps tools
Category Description Tooling Examples
Model metadata storage
and management
Section 3.1
•MLFlow 1
•Comet 2
•Neptune3
Data and pipeline version-
ing
Section 3.2 •DVC4
•Pachyderm5
Model deployment and
serving
Section 3.3
•DEPLOYR6[54]
•Flyte7
•ZenML8
Production model monitor-
ing
Section 4
•MetaFlow9
•Kedro10
•Seldon Core11
Run orchestration and
workflow pipelines
Orchestrating the execution
of preprocessing, training,
and evaluation pipelines.
Section 3.4 & 3.5
•Kuberflow12
•Polyaxon13
•MLRun14
Collaboration Tool Setting up an MLOps
pipeline requires collabora-
tion between different peo-
ple. Section 3.4.1
•ChatOps15
•Slack16
•Trello17
•GitLab18
•Rocket Chat 19
3.5.1. Tools and Frameworks
Understanding the MLOps pipeline and required expertise is just the first
step to addressing the problem. Once this has been accomplished, it is nec-
essary to create and/or adopt appropriate tooling for transforming these
principles into practice. There are seven broad categories of MLOps tools as
shown in Table 1 whereby different tools to automate different phase of the
workflows involved in MLOps processes exist. A compiled list of tools within
each category is shown in Table 1
4. MLHOps Monitoring and Updating
Once an MLHOps pipeline and required resources are setup and deployed,
robust monitoring protocols are crucial to the safety and longevity of clinical
AI systems. For example, inevitable updates to a model can introduce var-
ious operational issues (and vice versa), including bias (e.g., a new hospital
policy that shifts the nature of new data) and new classes (e.g., new subtypes
in a disease classifier) [274]. Incorporating expert labels can improve model
performance; however, the time, cost, and expertise required to acquire ac-
curate labels for very large imaging datasets like those used in radiology- or
histology-based classifiers makes this difficult [129].
As a result, there exist monitoring frameworks with policies to determine
when to query experts for labels [286]. These include:
•Periodic Querying, a non-adaptive policy whereby labels are period-
ically queried in batches according to a predetermined schedule;
•Request-and-Reverify which sets a predetermined threshold for drift
and queries a batch of labels whenever the drift threshold is exceeded
[275];
•MLDemon which follows a periodic query cycle and uses a linear
estimate of the accuracy based on changes in the data [83].
4.1. Time-scale windows
Monitoring clinical AI systems requires evaluating robustness to temporal
shifts. Since the time-scale used can change the types of shifts detected (i.e.,
gradual versus sudden shifts), multiple time windows should be considered
(e.g., week, month). Moreover, it is important to use both 1) cumulative
statistics, which use a single time window and updates at the beginning
of each window and 2) sliding statistics, which retain previous data and
update with new data.
4.2. Appropriate metrics
It is critical to choose evaluation and monitoring metrics optimal for each
clinical context. The quality of labels is highly dependent on the data from
which they are derived and, as such, can possess inherent biases. For in-
stance, sepsis labels derived from incorrect billing codes will inherently have
a low positive predictive value (PPV). Moreover, clinical datasets are often
imbalanced, consisting of far fewer positive instances of a label than negative
ones. As a result, measures like accuracy that weigh positive and negative
labels equally can be detrimental to monitoring. For instance, in the context
of disease classification, it may be particularly important to monitor sensi-
tivity, in contrast to more time-sensitive clinical scenarios like the intensive
care unit (ICU) where false positives (FP) can have critical outcomes [17].
4.3. Detecting data distribution shift
Data distribution shift occurs when the underlying distribution of the train-
ing data used to build an ML model differs from the distribution of data
applied to the model during deployment [204]. When the difference between
the probability distributions of these data sets is sufficient to deteriorate the
model’s performance, the shift is considered malignant.
In healthcare, there are multiple sources of data distribution shifts, many of
which can occur concurrently [71]. Common occurrences of malignant shifts
include differences attributed to:
•Institution - These differences can arise when comparing teaching to
non-teaching hospitals, government-owned to private hospitals, or gen-
eral to specialized hospitals (e.g., paediatric, rehabilitation, trauma).
These institutions can have differing local clinical practices, resource
allocation schemes, medical instruments, and data-collection and pro-
cessing workflows that can lead to downstream variation. This has
previously been reported in Pneumothorax classifiers when evaluated
on external institutions [121].
•Behaviour - Temporal changes in behaviour at the systemic, physi-
cian and patient levels are unavoidable sources of data drift. These
changes include new healthcare reimbursement incentives, changes in
the standard-of-care in medical practice, novel therapies, and updates
to hospital operational processes. An example of this is the COVID-19
pandemic, which required changes in resource allocation to cope with
hospital bed shortages [123, 191].
•Patient demographics - Differences in factors like age, race, gender,
religion, and socioeconomic background can arise for various reasons
including epidemiological transitions, gentrification of neighbourhoods
around a health system, and new public health and immigration poli-
cies. Distribution shifts due to demographic differences can dispropor-
tionately deteriorate model performance in specific patient populations.
For instance, although Black women are more likely to develop breast
tumours with poor prognosis, many breast mammography ML classi-
fiers experience deterioration in performance on this patient population
[271]. Similarly, skin-lesion classifiers trained primarily on images of
lighter skin tones may show decreased performance when evaluated on
images of darker skin tones [9, 63].
•Technology - Data shifts can be attributed to changes in technology
between institutions or over time. This includes chest X-ray classifiers
trained on portable radiographs that are evaluated on stationary ra-
diographs or deteroriation of clinical AI systems across EHR systems
(e.g., Philips Carevue vs. Metavision) [178].
Although evaluated differently, data shifts are present across various modal-
ities of clinical data such as medical images [91] and EHR data [64, 191].
In order to effectively prevent these malignant shifts from occurring, it is
necessary to perform prospective evaluation of clinical AI systems [289] in
order to understand the circumstances under which they arise, and to design
strategies that mitigate model biases and improve models for future itera-
tions [277]. Broadly, these data shifts can be categorized into three groups
which can co-occur or lead to one another:
4.3.1. Covariate Shift
Covariate shift is a difference in the distribution of input variables between
source and target data. It can occur due to a lack of randomness, inadequate
sampling, biased sampling, or a non-stationary environment. This can be
at the level of a single input variable (i.e. feature shift) or a group of input
features (i.e. dataset shift). Table 4.3.1 contains a list of commonly used
methods used for covariate shift detection.
Feature Shift Detection: Feature shift refers to the change in distribu-
tion between the source and target data for a single input feature. Feature
shift detection can be performed using two-sample univariate tests such as
the Kolmogorov-Smirnov (KS) test [205]. Publicly available tools like Ten-
sorFlow Extended (TFX) apply univariate tests (i.e., L-infinity distance for
categorical variables, Jensen-Shannon divergence for continuous variables) to
perform feature shift detection between training and deployment data and
provide users with summary statistics (Table 4.4). It is also possible to de-
tect feature shift while conditioning on the other features in a model using
conditional distribution tests [126].
Dataset Shift Detection: Dataset shift refers to the change in the joint
distribution between the source and target data for a group of input features.
Multivariate testing is crucial because input to ML models typically consist
of more than one variable and multiple modalities. In order to test whether
the distribution of the target data has drifted from the source data two
main approaches exist: 1) two-sample testing and 2) classifiers. These
approaches often work better on low-dimensional data compared to high-
dimensional data, therefore dimensionality reduction is typically applied first
[205]. For instance, variational autoencoders (VAE) have been used to reduce
chest X-ray images to a low-dimensional space prior to two-sample testing
[234]. In the context of medical images, including chest X-rays [201] [276],
diabetic retinopathies [36], and histology slides [235], classifier methods have
proven effective. For EHR data, dimensionality reduction using clinically
meaningful patient representations has improved model performance [178].
For clinically relevant drift detection, it is important to ensure that drift
metrics correlate well with ground truth performance differences.
4.3.2. Concept Shift
Concept shift is a difference in the relationship (i.e., joint distribution) of the
variables and the outcome between the source and target data. In healthcare,
concept shift can arise due to changes in symptoms for a disease or antigenic
drift. This has been explored in the context of surgery prediction [28] and
medical triage for emergency and urgent care [103].
Concept Shift Detection: There are three broad categories of concept
shift detection based on their approach.
1. Distribution techniques which use a sliding window to divide the
incoming data streams into windows based on data size or time interval
and that compare the performance of the most recent observations
with a reference window [77]. ADaptive WINdowing (ADWIN), and its
extension ADWIN2, are windowing techniques which use the Hoeffding
Method Shift Test Type
L-infinity distance Feature (c) 2-ST
Cram´er-von Mises Feature (c) 2-ST
Fisher’s Exact Test Feature (c) 2-ST
Chi-Squared Test Feature (c) 2-ST
Jensen-Shannon divergence Feature (n) 2-ST
Kolmogorov-Smirnov [164] Feature (n) 2-ST
Feature Shift Detector [126] Feature Model
Maximum Mean Discrepancy (MMD) [86] Dataset 2-ST
Least Squares Density Difference [32] Dataset 2-ST
Learned Kernel MMD [145] Dataset 2-ST
Context Aware MMD [51] Dataset 2-ST
MMD Aggregated [226] Dataset 2-ST
Classifier [151] Dataset Model
Spot-the-diff [108] Dataset Model
Model Uncertainty [230] Dataset Model
Mahalanobis distance [212] Dataset Model
Gram matrices [192] [224] Dataset Model
Energy Based Test [147] Dataset Model
H-Divergence [285] Dataset Model
Table 2: Covariate Shift Detection Methods c: categorical; n: numeric; 2-ST: Two-
Sample Test
bound to examine the change between the means of two sufficiently
large subwindows [99].
2. Sequential Analysis strategies use the Sequential Probability Ratio
Test (SPRT) as the basis for their change detection algorithms. A well-
known algorithm is CUMSUM which outputs an alarm when the mean
of the incoming data significantly deviates from zero [25].
3. Statistical Process Control (SPC) methods track changes in the
online error rate of classifiers and trigger an update process when there
is a statistically significant change in error rate [153]. Some common
SPC methods include: Drift Detection Method (DDM), Early Drift
Detection Method (EDDM), and Local Drift Detection (LLDD) [20].
4.3.3. Label Shift
Label shift is a difference in the distribution of class variables in the outcome
between the source and target data. Label shift may appear when some con-
cepts are under-sampled or over-sampled in the target domain compared to
the source domain. Label shift arises when class proportions differ between
the source and target, but the feature distributions of each class do not. For
instance, in the context of disease diagnosis, a classifier trained to predict
disease occurrence is subject to drift due to changes in the baseline preva-
lence of the disease across various populations.
Label Shift Detection: Label shift can be detected using moment matching-
based estimator methods that leverage model predictions like Black Box
Shift Estimation (BBSE) [141] and Regularized Learning under Label Shift
(RLLS) [19]. Assuming access to a classifier that outputs the true source dis-
tribution conditional probabilities ps(y|x) Expectation Maximization (EM)
algorithms like Maximum Likelihood Label Shift (MLLS) can also be used to
detect label shift [80]. Furthermore, methods using bias-corrected calibration
show promise in correcting label shift [11].
4.4. Model Updating and Retraining
As the implementation of ML-enabled tools is realized in the clinic, there is
a growing need for continuous monitoring and updating in order to improve
models over time and adapt to malignant distribution shifts. Retraining of
ML models has demonstrated improved model performance in clinical con-
texts like pneumothorax diagnosis [121]. However, proposed modifications
can also degrade performance and introduce bias [139]; as a result it may be
preferable to avoid making a prediction and defer the decision to a down-
stream expert [176]. When defining a model updating or retraining strategy
for clinical AI models there are several factors to consider [266], we outline
they key criteria in this section.
4.4.1. Quality and Selection of Model Update Data
When updating a model it is important to consider the relevance and size of
the data to be used. This is typically done by defining a window of data to
update the model: i) Fixed window uses a window that remains constant
across time. ii) Dynamic window uses a window that changes in size due
to an adaptive data shift, iii) Representative subsample uses a subsample
from a window that is representative of the entire window distribution.
Name of tool Capabilities
Evidently 20 Interactive reports to analyze ML models
during validation or production monitoring.
NannyML21 Performance estimation and monitoring,
data drift detection and intelligent alerting
for deployment.
River [175] Online metrics, drift detection and outlier
detection for streaming data.
SeldonCore [249] Serving, monitoring, explaining, and man-
agement of models using advanced metrics,
explainers, and outlier detection.
TFX22 Explore and validate data used for machine
learning models.
TorchDrift23 Covariate and concept drift detection.
deepchecks [49] Testing for continuous validation of ML mod-
els and data.
EHR OOD Detection
[246]
Uncertainty estimation, OOD detection and
(deep) generative modelling for EHRs.
Avalanche [150] Prototyping, training and reproducible eval-
uation of continual learning algorithms.
Giskard24 Evaluation, monitoring and drift testing.
Table 3: List of open-source tools available on Github that can be used for ML Monitoring
and Updating
4.4.2. Updating Strategies
There are several ways to update a model including: i) Model recalibration
is the simplest type of model update, where continuous scores (e.g. predicted
risks) produced by the original model are mapped to new values [47]. Some
common methods to achieve this include Platt scaling [199], temperature
scaling, and isotonic regression [181]. ii) Model updating includes changes
to an existing model, for instance, fine-tuning with regularization [130] or
model editing where pre-collected errors are used to train hypernetworks
that can be used to edit a model’s behaviour by predicting new weights or
building a new classifier [172]. iii) Model retraining involves retraining a
model from scratch or fitting an entirely different model.
4.4.3. Frequency of Model Updates
In practice, retraining procedures for clinical AI models have generally been
locked after FDA approval [131] or confined to ad-hoc one-time updates [248]
[97]. The timing of when it is necessary to update or retrain a model varies
across use case. As a result, it is imperative to evaluate the appropriate fre-
quency to update a model. Strategies employed include: i) Periodic train-
ing on a regular schedule (e.g. weekly, monthly). ii) Performance-based
trigger in response to a statistically significant change in performance. iii)
Data-based trigger in response to a statistically significant data distribu-
tion shift. iv) Retraining on demand is not based on a trigger or regular
schedule, and instead initiated based on user prompts.
4.4.4. Continual Learning
Continual learning is a strategy used to update models when there is a con-
tinuous stream of input data that may be subject to changes over time. Prior
to deployment, it is crucial to simulate the online learning procedure on ret-
rospective data to assess robustness to data shifts [46] [188]. When models
are retrained on only the most recent data, this can result in “catastrophic
forgetting” [254] [131], in which the integration of new data into the model
can overwrite knowledge learned in the past and interfere with what the
model has already learned [129]. Contrastingly, procedures that retrain mod-
els on all previously collected data can fail to adapt to important temporal
shifts and are computationally expensive. More recently, strategies leverag-
ing multi-armed bandits have been utilized to select important samples or
batches of data for retraining [85] [287]. This is an important consideration
in healthcare contexts like radiology, where the labelling of new data can be
a time-consuming bottleneck [93] [196].
To ensure continual learning satisfies performance guarantees, hypothesis
testing can be used for approving proposed modifications [56]. An effec-
tive approach for parametric models include continual updating procedures
like online recalibration/revision [69]. Strategies for continual learning can
broadly be categorized into: 1) Parameter isolation where changes to pa-
rameters that are important for the previous tasks are forbidden e.g. Local
Winner Takes All (LWTA), Incremental Moment Matching (IMM) [247]; 2)
Regularization methods which builds on the observation forgetting can
be reduced by protecting parameters that are important for the previous
tasks e.g. elastic weight consolidation (EWC), Learning Without Forgetting
(LWF); and 3) Replay-based approaches that retain some samples from
the previous tasks and use them for training or as constraints to reduce for-
getting e.g. episodic representation replay (ERR) [60]. Evaluation of several
continual learning methods on ICU data across a large sequence of tasks
indicate replay-based methods achieves more stable long-term performance,
compared to regularization and rehearsal based methods [16]. In the context
of chest X-ray classification, Joint Training (JT) has demonstrated superior
model performance, with LWF as a promising alternative in the event that
training data is unavailable at deployment [132]. For sepsis prediction using
EHR data, a joint framework leveraging EWC and ERR has been proposed
[13]. More recently, continual model editing strategies have shown promise
in overcoming the limitations of continual fine-tuning methods by updating
model behavior with minimal influence on unrelated inputs and maintaining
upstream test performance [98].
4.4.5. Domain Generalization and Adaptation
Broadly, domain generalization and adaptation methods are used to improve
clinical AI model stability and robustness to data shifts by reducing distri-
bution differences between training and test data [280] [88]. However, it is
critical to evaluate several methods over a range of metrics, as the effective-
ness of each method varies based on several factors including the type of shift
and data modality [272].
•Data-based methods perform manipulations based on the patient
data to minimize distribution shifts. This can be done by re-weighting
observations during training based on the target domain [124], upsam-
pling informative training examples [143] or leveraging a combination
of labeled and pseudo-labeled [137].
•Representation-based methods focus on achieving a feature rep-
resentation such that the source classifier performs well on the target
domain. In clinical data this has been explored using strategies in-
cluding invariant risk minimization (IRM), distribution matching (e.g.
CORAL) and domain-adversarial adaptation networks (DANN). DANN
methods have demonstrated a reduction on the impact of data shift on
cross-institutional transfer performance for diagnostic prediction [283].
However, it has been shown that for clinical AI models subject to real
life data shifts, in contrast to synthetic perturbations, empirical risk
minimization outperforms domain generalization and unsupervised do-
main adaptation methods [90] [281].
•Inference-based methods introduce constraints on the optimization
procedure to reduce domain shift [124]. This can be done by estimating
a model’s performance on the “worst-case” distribution [237] or con-
straining the learning objective to enforces closeness between protected
groups [227]. Batch normalization statistics can also be leveraged to
build models that are more robust to covariate shifts [225].
4.4.6. Data Deletion and Unlearning
In healthcare there are two primary reasons for wanting to remove data
from models. Firstly, with the growing concerns around privacy and ML in
healthcare, it may become necessary to remove patient data for privacy rea-
sons. Secondly, it may also be beneficial to a model’s performance to delete
noisy or corrupted training data [30]. The naive approach to data deletion
is to exclude unwanted samples and retrain the model from scratch on the
remaining data, however this approach can quickly become time consum-
ing and resource-intensive [105]. As a result, more sophisticated approaches
have been proposed for unlearning in linear and logistic models [105], random
forest models [31], and other non-linear models [89].
4.4.7. Feedback Loops
Feedback loops that incorporate patient outcomes and clinician decisions
are critical to improving outcomes in future model iterations. However, re-
training feedback loops can also lead to error amplification, and subsequent
downstream increases in false positives [6]. As a result, it is important to
consider model complexity and choose an appropriate classification threshold
to ensure minimization of error amplification [7].
5. Responsible MLHOps
AI has surged in healthcare, out of necessity or/and [277, 189], but many
issues still exist. For instance, many sources of bias exist in clinical data,
large models are opaque, and there are malicious actors who damage or pol-
lute the AI/ML systems. In response, responsible AI and trustworthiness
have together become a growing area of study [166, 251]. Responsible AI, or
trustworthy MLOps, is defined as an ML pipeline that is fair and unbiased,
explainable and interpretable, secure, private, reliable, robust, and resilient
to attacks. In healthcare, trust is critical to ensuring a meaningful relation-
ship between the healthcare provider and patient [57]. In this section, we
discuss components of responsible and trustworthy AI [133], which can be
applied to the MLHOps pipeline. In Section 5.1, we review the main con-
cepts of responsible AI and in Section 5.2 we explore how these concepts can
be embedded in the MLHOps pipeline to enable safe deployment of clinical
AI systems.
5.1. Responsible AI in healthcare
Ethics in healthcare:
Ethics in healthcare primarily consists of the following criteria [250]:
1. Nonmaleficence: Do not harm the patient.
2. Beneficence: Act to the benefit of the patient.
3. Autonomy: The patient (when able) should have the freedom to make
decisions about his/her body. More specifically, the following aspects
should be taken care of:
•Informed Consent: The patient (when able) should give in-
formed consent for any medical or surgical procedure, or for re-
search.
•Truth-telling: The patient (when able) should receive full dis-
closure to his/her diagnosis and prognosis.
•Confidentiality: The patient’s medical information should not
be disclosed to any third party without the patient’s consent.
4. Justice: Ensure fairness to the patient.
To supplement these criteria, guiding principles drawn from surgical settings
[142, 218] include:
5. Rescue: A patient surrenders to the healthcare provider’s expertise to
be rescued.
6. Proximity: The emotional proximity to the patient should be limited
to maintain self-preservation and stability in case of any failure.
7. Ordeal: A patient may have to face an ordeal (i.e., go through painful
procedures) in order to be rescued.
8. Aftermath: The physical and psychological aftermath that may occur
to the patient due to any treatment must be acknowledged.
9. Presence: An empathetic presence must be provided to the patient.
While some of these criteria relate to the humanity of the healthcare provider,
others relate to the following topics in ML models:
■Fairness involves the justice component in the healthcare domain [45].
■Interpretability & explainability relate to explanations and better
understanding of the ML models’ decisions, which can help in achieving
nonmaleficence, beneficence, informed consent, and truth-telling prin-
ciples in healthcare. Interpretability can help identify the reasons for a
given model outcome, which can help inform healthcare providers and
patients on how to respond accordingly [169].
■Privacy and security relate to confidentiality. [117].
■Reliability, robustness, and resilience addresses rescue [217].
We discuss these concepts further in Sections 5.1.1, 5.1.2, 5.1.3 and 5.1.4.
5.1.1. Bias & Fairness
The fairness of AI-based decision support systems have been studied gener-
ally in a variety of applications including occupation classifiers [58], criminal
risk assessments algorithms [50], recommendation systems [65], facial recog-
nition algorithms [33], search engines [78], and risk score assessment tools in
hospitals [183]. In recent years, the topic of fairness in AI models in health-
care has received a lot of attention [183, 231, 128, 44, 265, 232]. Unfairness
in healthcare manifests as differences in model performance against or in
favour of a sub-population, for a given predictive task. For instance, dis-
proportionate performance differences for disease diagnosis in Black versus
White patients [231].
5.1.1.1. Causes
A lack of fairness in clinical AI systems may be a result of various contributing
causes:
•Objective:
– Unfair objective functions: The initial objective used in de-
veloping a ML approach may not consider fairness. This does not
mean that the developer explicitly (or implicitly) used an unfair
objective function to train the model, but the oversimplification
of that objective can lead to downstream issues. For example, a
model designed to maximize accuracy across all populations may
not inherently provide fairness across different sub-populations
even if it reaches state-of-the-art performance on average, across
the whole population [231, 232].
– Incorrect presumptions: In some instances, the objective func-
tion includes incorrect interpretations of features, which can lead
to bias. For instance, a commercial algorithm used in the USA,
used health costs as a proxy for health needs[183]; however, due to
financial limitations, Black patients with the same need for care as
White patients often spend less on healthcare and therefore have
a lower health cost. As a result, the model falsely inferred that
Black patients require less care compared to White patients be-
cause they spend less [183]. Additionally, patients may be charged
differently for the same service based on their insurance, suggest-
ing cost may not be representative of healthcare needs.
•Data:
– Inclusion and exclusion: It is important to clearly outline the
conditions and procedures utilized for patient data collection, in
order to understand patient inclusion criteria and any potential
selection biases that could occur. For instance, the Chest X-ray
dataset [262] was gathered in a research hospital that does not
routinely conduct diagnostic and treatment procedures25. This
dataset therefore includes mostly critical cases, and few patients
25from https://clinicalcenter.nih.gov/about/welcome/faq.html.
at the early stages of diagnosis. Moreover, as a specialized hos-
pital, patient admission is selective and chosen solely by institute
physicians based on if they have an illness being studied by the
given institute 26. Such a dataset will not contain the diversity
of disease cases that might be seen in hospitals specialized across
different diseases, or account for patients visiting for routine treat-
ment services at general hospitals.
– Insufficient sample size: Insufficient sample sizes of under-
represented groups can also result in unfairness [82]. For instance,
patients of low socioeconomic status may use healthcare services
less, which reduces their sample size in the overall dataset, re-
sulting in an unfair model [281, 33, 44]. In another instance, an
algorithm that can classify skin cancer [66] with high accuracy will
not be able to generalize to different skin colours if similar samples
have not been represented sufficiently in the training data [33].
– Missing essential representative features: Sometimes, es-
sential representative features are missed or not collected during
the dataset curation process, which prohibits downstream fairness
analyses. For instance, if the patient’s race has not been recorded,
it is not possible to analyze whether a model trained on that data
is fair with respect to that race [232]. Failure to include sensitive
features can generate discrimination and reduce transparency [43].
•Labels:
– Social bias reflection on labels: Biases in healthcare systems
widely reflect existing biases in society [158, 238, 256]. For in-
stance, race and sex biases exist in COPD underdiagnosis [158], in
medical risk score analysis (whereby there exists a higher thresh-
old for Black patients to gain access to clinical resources) [256],
and in the time of diagnosis for cardiovascular disease (whereby
female patients are diagnosed much later compared to the male
patients with similar conditions) [238]. These biases are reflected
in the labels used to train clinical AI systems and, as a result, the
model will learn to replicate this bias.
26from https://clinicalcenter.nih.gov/about/welcome/faq.html.
– Bias of automatic labeling: Due to the high cost and labour-
intensive process of acquiring labels for healthcare data, there has
been a shift away from hand-labelled data, towards automatic
labelling [34, 104, 111]. For instance, instead of expert-labeled
radiology images, natural language processing (NLP) techniques
are applied to radiology reports in order to extract labels. This
presents concerns as these techniques have shown racial biases,
even after they have been trained on clinical notes [282]. There-
fore, using NLP techniques for automatic labeling may sometimes
amplify the overall bias of the labels [232].
•Resources:
– Limited computational resources: Not all centers have enough
labeled data or computational resources to train ML models ‘from
scratch’ and must use pretrained models for inference or transfer
learning. If the original model has been trained on biased (or dif-
ferently distributed) data, it will unfairly influence the outcome,
regardless of the quality of the data at the host center.
5.1.1.2. Evaluation
To evaluate the fairness of a model, we need to decide which fairness metric
to use and what sensitive attributes to consider in our analysis.
•Fairness metric(s): There are many ways to define fairness metrics.
For instance, [50] and [96] discussed several fairness criteria and sug-
gested balancing the error rate between different subgroups [53, 279].
However, it is not always possible to satisfy multiple fairness constraints
concurrently [232]. Jon Kleinberg et al., [122] showed that three fair-
ness conditions evaluated could not be simultaneously satisfied. As a
result, a trade-off between the different notions of fairness is required,
or a single fairness metric can be chosen based on domain knowledge
and the given clinical application.
•Sensitive attributes: Sensitive attributes are protected groups that
we want to consider when evaluating the fairness of an AI model. Sex
and race are two commonly used sensitive attributes [279, 231, 232,
282]. However, a lack of fairness in an AI system with respect to other
sensitive attributes such as age [231, 232], socioeconomic status, [231,
232, 282], and spoken language [282] are also important to consider.
Defining AI fairness is context- and problem-dependent. For instance, if we
build an AI model to support decision making for disease diagnosis with the
goal of using it in the clinic, then it is critical to ensure equal opportunity
in the model is provided; i.e., patients from different races should have equal
opportunity to be accurately diagnosed [231]. However, if an AI model is to
be used to triage patients, then ensuring the system does not underdiagnose
unhealthy patients of a certain group may be of greater concern compared to
the specific disease itself because the patient will lose access to timely care
[232].
5.1.2. Interpretability & Explainability
In recent years, interpretability has received a lot of interest from the ML
community [174, 241, 162]. In machine learning, interpretability is defined
as the ability to explain the rationale for an ML model’s predictions in terms
that a human can understand [62] and explainability refers to a detailed un-
derstanding of the model’s internal representations, a priori of any decision.
After other research in this area [160], we use ‘interpretability’ and ‘explain-
ability’ interchangeably.
Interpretability is not a pre-requisite for all AI systems [62, 174], including
in low-risk environments (in which miscalculations have very limited conse-
quences) and in well-studied problems (which have been tested and validated
extensively according to robust MLOps methods). However, interpretability
can be crucial in many cases, especially for systems deployed in the healthcare
domain [81]. The need for interpretability arises from the incompleteness of
the problem where system results require an accompanying rationale.
5.1.2.1. Importance of interpretability
Interpretability applied to an ML model can be useful for the following rea-
sons:
•Trust: Interpretability enhances trust when all components are well-
explained. This builds an understanding of the decisions made by a
model and may help integrate it into the overall workflow.
•Reliability & robustness: Interpretability can help in auditing ML
models, further increasing model reliability.
•Privacy & security: Interpretability can be used to assess if any
private information is leaked from the results. While some researchers
claim that interpretability may hinder privacy [233, 95, 233] as the in-
terpretable features may leak sensitive information, others have shown
that it can help make the system robust against the adversarial attacks
[136, 284].
•Fairness: Interpretability can help in identifying and reducing biases
discussed in Sec. 5.1.1. However, the quality of these explanations can
differ significantly between subgroups and, as such, it is important to
test various explanation models in order to carefully select an equitable
model with high overall fidelity [21].
•Better understanding and knowledge: A good interpretation of
the model can lead to the identification of the factors that most impact
the model. This can also result in a better understanding of the use
case itself and enhance knowledge in that particular area.
•Causality: Interpretability gives a better understanding of the model
decisions and the features and hence can help to identify causal rela-
tionships of the features [38].
5.1.2.2. Types of approaches for interpretability in ML:
Many methods have been developed for better interpretability in ML, such
as explainable AI for trees [155], Tensorflow Lattice27, DeepLIFT [134],
InterpretML[182], LIME [214], and SHAP [156]. Some of these have been
applied to healthcare [2, 236]. The methods for interpretability are usually
categorized as:
•Model-based
– Model-specific: Model-specific interpretability can only be used
for a particular model. Usually, this type of interpretability uses
27https://www.tensorflow.org/lattice
the model’s internal structure to analyze the impact of features,
for example.
– Model-agnostic: Interpretability is not restricted to a specific
machine learning model and can be used more generally with sev-
eral.
•Complexity-based
– Intrinsic: Relatively simple methods, such as height-bound deci-
sion trees, are easier for humans to understand.
– Post-hoc: After the model has produced output, interpretation
proceeds for more complex methods.
•Scope-based
– Locally interpretable: Interprets individual or per-instance pre-
dictions of the model.
– Globally interpretable: Interprets the model’s overall predic-
tion set and provides insight into how the model works in general.
•Methodology-based approach
– Feature-based: Methods that interpret the models based on the
impact of the features on that model. E.g., weight plot, feature
selection, etc.
– Perturbation-based: Methods that interpret the model by per-
turbing the settings or features of the model. E.g., LIME [214],
SHAP [156] and anchors.
– Rule-based: Methods that apply rules on features to identify
their impact on the model e.g., BETA, MUSE, and decision trees.
– Image-based: Methods where important inputs are shown using
images superimposed over the input e.g., saliency maps [10].
5.1.2.3. Interpretability in healthcare
In recent years, interpretability has become common in healthcare [2, 169,
210]. In particular, Abdullah et al. [2] reported that interpretability methods
(e.g., decision trees, LIME, SHAP) have been applied to extract insights into
different medical conditions including cardiovascular diseases, eye diseases,
cancer, influenza, infection, COVID-19, depression, and autism. Similarly,
Meng et al. [169] performed interpretability of deep learning mortality pre-
diction models and fairness analysis on the MIMIC-III dataset [110], showing
connections between interpretability methods and fairness metrics.
5.1.3. Privacy & Security
While digitizing healthcare has led to centralized data and improved access
for healthcare professionals, it has also increased risks to data security and
privacy [179]. After previous work [3], privacy is the individual’s ability to
control, interact with, and regulate their personal information and security
is a systemic protection of data from leaks or cyber-attacks.
5.1.3.1. Security & privacy requirements
In order to ensure privacy and security, the following requirements should be
met [179]:
•Authentication: Strong authentication mechanisms for accessing the
system.
•Confidentiality: Access to data and devices should be restricted to
authorized users.
•Integrity: Integrity-checking mechanisms should be applied to restrict
any modifications to the data or to the system.
•Non-repudiation: Logs should be maintained to monitor the system.
Access to those logs should be restricted and avoid any tampering.
•Availability: Quick, easy, and fault-tolerant availability should be
ensured at all times.
•Anonymity: Anonymity of the device, data, and communication should
be guaranteed.
•Device unlinkability: An unauthorized person should not be able to
establish a connection between data and the sender.
•Auditability and accountability: It should be possible to trace back
the recording time, recording person, and origins of the data to validate
its authenticity.
5.1.3.2. Types of threats
Violation of privacy & security can occur either due to human error (uninten-
tional or non-malicious) or an adversarial attack (intentional or malicious).
1. Human error: Human error can cause data leakage through the care-
lessness or incompetence of authorized individuals. Most of the litera-
ture in this context [138, 68] divides human error into two types:
(a) Slip: the wrong execution of correct, intended actions; e.g., in-
correct data entry, forgetting to secure the data, giving access of
information to unauthorized persons using the wrong email ad-
dress.
(b) Mistake: the right execution of incorrect, unintended actions;
e.g., collecting data that is not required, using the same password
for different systems to avoid password recovery, giving access of
information to unauthorized persons assuming they can have ac-
cess.
While people dealing with data should be trained to avoid such negli-
gence, some researchers have suggested policies, frameworks, and strate-
gies such as error avoidance,error interception, or error correction to
prevent or mitigate these issues [138, 68].
2. Adversarial attacks: A primary risk for any digital data or system is
from adversarial attackers [92] who can damage, pollute, or leak infor-
mation from the system. An adversarial attacker can attack in many
ways; e.g., they can be remote or physically present, they can access
the system through a third-party device, or they can be personified as
a patient [179]. The most common types of attacks are listed below.
•Hardware or software attack: Modifying the hardware or soft-
ware to use it for malicious purposes.
•System unavailability: Making the device or data unavailable.
•Communication attack: Interrupting the communication or
forcing a device to communicate with unauthorized external de-
vices.
•Data sniffing: Illegally capturing the communication to get sen-
sitive information.
•Data modification: Maliciously modifying data.
•Information leakage: Retrieving sensitive information from the
system.
5.1.3.3. Healthcare components and security & privacy
Extra care needs to be taken to protect healthcare data [5]. Components
[184] include:
•Electronic health data: This data can be leaked due to human
mistakes or malicious attacks, which can result in tampering or misuse
of data. In order to overcome such risks, measures such as access
control, cryptography, anonymization, blockchain, steganography, or
watermarking can be used.
•Medical devices: Medical devices such as smartwatches and sensors
are also another source of information that can be attacked. Secure
hardware and software, authentication and cryptography can be used
to avoid such problems.
•Medical network: Data shared across medical professionals and or-
ganizations through a networks may be susceptible to eavesdropping,
spoofing, impersonating, and unavailability attacks. These threats can
be reduced by applying encryption, authentication, access control, and
compressed sensing.
•Cloud storage: Cloud computing is becoming widely adopted in
healthcare. However, like any system, it is also prone to unavailability,
data breaches, network attacks, and malicious access. Similar to those
above, threats to cloud services can be avoided through authentica-
tion, cryptography, and decoying (i.e., a method to make an attacker
erroneously believe that they have acquired useful information).
5.1.3.4. Healthcare privacy & security laws
Due to the sensitivity of healthcare data and communication, many coun-
tries have introduced laws and regulations such as the Personal Information
Protection and Electronic Documents Act (PIPEDA) in Canada, the Health
Insurance Portability and Accountability Act (HIPPA) in the USA, and the
Data Protection Directive in the EU [267]. These acts mainly aim at protect-
ing patient data from being shared or used without their consent but while
allowing them to access to their own data.
5.1.3.5. Attacks on ML pipeline
Any ML model that learns from data can also leak information about the
data, even if it is generalized well; e.g., using membership inference (i.e.,
determining if a particular instance was used to train the model) [168, 102]
or using property inference (i.e., inferring properties of the training dataset
from a given model) [168, 190]. Adversarial attacks in the context of the
MLOps pipeline can occur in the following phases [92]:
•Data collection phase: At this phase, a poisoning attack results in
modified or polluted data, impacting the training of the model and
lowering performance on unmodified data.
•Modelling phase: Here, the Trojan AI attack can modify a model
to provide an incorrect response for specific trigger instances [258] by
changing the model architecture and parameters. Since it is now com-
mon to use pre-trained models, these models can be modified or re-
placed by attackers.
•Production and deployment phases: At these phases, both Trojan
AI attacks and evasion attacks can occur. Evasion attacks consist, e.g.,
of modifying test data to have them misclassified [197].
5.1.4. Reliability, robustness and resilience
A trustworthy MLOps system should be reliable, robust, and resilient. These
terms are defined as follows [288]:
•Reliability: The system performs in a satisfactory manner under spe-
cific, unaltered operating conditions.
•Robustness: The system performs in a satisfactory manner despite
changes in operating conditions, e.g., data shift.
•Resilience: The system performs in a satisfactory manner despite a
major disruption in operating conditions; e.g., adversarial attacks.
These aspects have been studied in the healthcare domain [171, 203] and
different approaches such as interpretability, security, privacy, and methods
to deal with data shift (discussed in Sections 5.1.2 and 5.1.3) have been sug-
gested.
Trade-off between accuracy and trustworthiness: In Section 5.1, we
discussed different important components of trustworthy AI that should be
considered while designing an ML system; however, literature shows that
there can be a trade-off between accuracy, interpretability, and robustness
[210, 244]. While a main reason for the trade-off is that robust models learn
a different feature representation that may decrease accuracy, it is better
perceived by humans [244].
5.2. Incorporating Responsibility and Trust into MLHOps
In recent years, responsible and trustworthy AI have gained a lot of attention
in general as well as for healthcare due to its implications on society [210].
There are several definitions of trustworthiness [210], and they are related
to making the system robust, unbiased, generalizable, reproducible, trans-
parent, explainable, and secure. However, the lack of standardized practices
for applying, explaining, and evaluating trustworthiness in AI for healthcare
makes this very challenging [210]. In this section, we discuss how we can
incorporate all these qualities at each step of the pipeline.
5.2.1. Data
The process of a responsible and trustworthy MLOps pipeline starts with
data collection and preparation. The impact of biased or polluted data prop-
agates through all the subsequent steps of the pipeline [75]. This can be even
more important and challenging in the healthcare domain due to the privacy
and sensitivity of the data [18]. If compromised, this information can be
tempered or misused in various ways (e.g., identity theft, information sold to
a third party) and introduce bias in the healthcare system. Such challenges
can also cause economic harm (such as job loss), psychological harm (e.g.,
causing embarrassment due to a medical issue), and social isolation (e.g.,
due to a serious illness such as HIV) [177, 4]. It can also impact ML model
performance and trustworthiness [45].
5.2.1.1. Data collection
In healthcare, data can be acquired through multiple sources [245], which
increases the chance of the data being polluted by bias. Bias can concern, for
example, race[271], gender, sexual orientation, gender identity, and disability.
Bias in healthcare data can be mitigated against by increasing diversity in
data, e.g., by including underrepresented minorities (URMs), which can lead
to better outcomes [159]. Debiasing during data collection can include:
1. Identifying & acknowledging potential real-world biases: Bias
in healthcare is introduced long before the data collection stage. Al-
though increasingly less common in many countries28, bias can still
occur in medical school admission, job interviews, patient care, disease
identification, research samples, and case studies. Such biases lead to
the dominance of people from certain communities [159] or in-group
vs. out-group bias [84], which can result in stereotyped and biased
data generation and hence biased data collection.
Bias can be unconscious or conscious [159, 72]. Unconscious bias stems
from implicit or unintentional associations outside conscious awareness
resulting from stereotypical perceptions and experiences. On the other
hand, conscious bias is explicit and intentional and has resulted in abuse
and criminal acts in healthcare; e.g., the Tuskegee study of untreated
Syphilis in black men demonstrated intentional racism [73]. Both con-
scious and unconscious biases damage the validity of the data. Since
conscious bias is relatively more visible, it is openly discouraged not
only in healthcare but also in all areas of society. However, uncon-
scious bias is more subtle and not as easy to identify. In most cases,
unconscious bias is not even known to the person suffering from it.
Different surveys, tests, and studies have found the following types of
biases (conscious or unconscious) common in healthcare [159]:
(a) Racial bias e.g., Black, Hispanic, and Native American physi-
cians are underrepresented [187]. According to one study, white
males from the upper classes are preferred by the admission com-
mittees [37] (although some other sources suggest the opposite28).
28https://applymd.utoronto.ca/admission-stats
(b) Gender bias: e.g., professional women in healthcare being less
likely to be invited to give talks [167], to be introduced using
professional titles [70], to experience harassment or exclusion, to
receive insufficient support at work or negative comparisons with
male colleagues, and to be perceived as weak & less competitive
[140, 240].
(c) Gender minority bias e.g., LGBTQ people receive lower quality
healthcare [216] and faced challenges to get jobs in healthcare
[222].
(d) Disability bias e.g., people with disabilities receive limited ac-
cessibility supports to all facilities and have to work harder to be
feel validated or recognized [165].
Various tests identify the existence of unconscious bias, such as the
Implicit Association Test (IAT), and have been reported to be useful.
For example, Race IAT results detected unintentional bias in 75% of
the population taking the test [22]. While debate continues regarding
the degree of usefulness of these tests [29], they may still capture some
subtle human behaviours. Some other assessment tools (e.g., Diversity
Engagement Survey (DES) [193]) have also been built for successfully
measuring inclusion and diversity in medical institutes.
According to Marcelin et al. [159], the following measures can help in
reducing unintentional bias:
(a) Using IAT to identify potential biases in admissions or hiring com-
mittee members in advance.
(b) Promoting equity, diversity, inclusion, and accessibility (EDIA) in
teams. Including more people from underrepresented minorities
(URM) in the healthcare profession, especially in admissions and
hiring committees.
(c) Conducting and analyzing surveys to keep track of the challenges
faced by URM individuals due to the biased perception of them.
(d) Training to highlight the existence and need for mitigation of bias.
(e) Self-monitoring bias can be another way to incorporate inclusion
and diversity.
2. Debiasing during data collection and annotation:
In addition to human factors, we can take steps to improve the data
collection process itself. In this regard, the following measures can be
taken [146]:
(a) Investigating the exclusion: In dataset creation, an important
step is to carefully investigate which patients are included in the
dataset. An exclusion criterion in dataset creation may be con-
scious and clinically motivated, but there are many unintentional
exclusion criteria that are not very well visible and enforce biases.
For instance, a dataset that is gathered in a research hospital that
does not routinely provide standard diagnostic and treatment ser-
vices and select the patients only because they have an illness
being studied by the Institutes have a different type of patients
compared to clinical hospitals that do not have these limitations
[232]. Alternatively, whether the service delivered to the patient
is free or covered by insurance would change the distribution of
the patients and infect biases into the resulting AI model [231].
(b) Annotation with explanation: Adding justification for choos-
ing the label by the human annotators not only helps them identify
their own unconscious biases but also can help in setting standards
for unbiased annotations and avoid any automatic association and
stereotyping (e.g., high prevalence HIV in gay men led to under-
diagnosis of this disease in women and children [159]. Moreover,
these explanations can be a good resource for training explainable
AI models [264].
(c) Data provenance: This involves tracking data lineage through
the data source, dependencies, and data collection process. Health-
care data can come from multiple sources which increases the
chances of it being biased [40]. Data provenance improves data
quality, integrity, audibility, and transparency [270]. Different
tools for data provenance are available including Fast Health-
care Interoperability Resources (FHIR) [223] and Atmolytics[270].
[161]
(d) Data security & privacy during data collection: Smart
healthcare technologies have become a common practice [40]. A
wide variety of smart devices is available, including wearable de-
vices (e.g., smartwatches, skin-based sensors), body area networks
(e.g., EEG sensors, blood pressure sensors), tele-healthcare (e.g.,
tele-monitoring, tele-treatment), digital healthcare systems (e.g.,
electronic health records (EHR), electronic medical records (EMR)),
and health analytics (e.g., medical big-data). While the digitiza-
tion of healthcare has improved access to medical facilities, it has
increased the risk of data leakage and malicious attacks. Extra
care should be taken while designing an MLOps pipeline to avoid
privacy and security risks, as it can lead to serious life-threatening
consequences. Other issues include the number of people involved
in using the data and proper storage for high volumes of data.
Chaudhry et al. [40] proposed an AI-based framework using 6G-
networks for secure data exchange in digital healthcare devices. In
the past decade, the blockchain has also emerged as a way of ensur-
ing data privacy and security. Blockchain is a distributed database
with unique characteristics such as immutability, decentralization,
and transparency. This is especially relevant in healthcare because
of security and privacy issues [94, 273, 180]. Using blockchain can
help in more efficient and secure management of patient’s health
records, transparency, identification of false content, patient mon-
itoring, and maintaining financial statements [94].
(e) Data-sheet: Often, creating a dataset that represents the full
diversity of a population is not feasible, especially for very multi-
cultural societies. Additionally, the prevalence of diseases among
different sub-populations may be different [232]. If it is not pos-
sible to build an ideal dataset with the above specifications, the
data needs to be delivered by a data-sheet. The data-sheet is
meta-data that helps to analyze and specify the characteristics of
the data, clearly explain exclusion and inclusion criteria, detail
demographic features of the patients, and statistics of the data
distribution over sub-populations, labels and features.
5.2.1.2. Data pre-processing
1. Data quality assurance: Sendak et al. [228] argued that clinical re-
searchers choose data for research very carefully but the machine learn-
ing community in healthcare does not follow this practice. To overcome
this gap, they suggest that data points are identified by the clinicians
and extracted into a project-specific data store. After this, a three-step
framework is applied: (1) use different measures for data pre-processing
to ensure the correctness of all data elements (e.g, converting each lab
measurement to the same unit), (2) ensure completeness, conformance,
plausibility, and possible data shifts, and (3) adjudicate the data with
the clinicians.
2. Data anonymization: Due to the sensitivity of healthcare data prepa-
ration, data anonymization should minimize the chances of it being
de-anonymized. Olatunji et al. [186] provide a detailed overview of
data anonymization models and techniques in healthcare such as k-
anonymity, k-map, l-diversity, t-closeness, δ-disclosure privacy, β-likeness,
δ-presence, and (ϵ,δ)-differential privacy. To avoid data leakage, many
tools for data anonymization and its evaluation tools [255] such as Sec-
Graph [107], ARX- tool for anonymizing biomedical data [202], Amne-
sia29 [242], PySyft [220], Synthea [257] and Anonimatron30 (open-source
data anonymization tool written in Java) can be incorporated in the
MLHOps pipeline.
3. Removing subgroups indicators. Changing the race of the patients
can have a dramatic impact on the outcome of an algorithm that is de-
signed to fill a prompt [282]. Therefore, the existence of race attributes
in the text can decrease the fairness of the model dramatically. In some
specific problems, removing subgroup indicators such as the sex of a job
candidate from their application has shown to have minimal influence
on classifier accuracy while improving the fairness [58]. This method
is applicable mostly in text-based data where sensitive attributes are
easily removable. As a preprocessing step, one can estimate the effect
of keeping or removing such sensitive attributes on the overall accuracy
and fairness of a developed model. At the same time, it is not always
possible to remove the sensitive attributes from the data. For example,
AI models can predict patient race from medical images, but it is not
yet clear how they can do it [263]. In one study [263], researchers did
not provide the patient race during model training, but they also could
not find a particular patch or region in the data for which AI failed to
29https://www.openaire.eu/item/amnesia-data-anonymization-made-easy
30https://realrolfje.github.io/anonimatron/
detect race by removing that part.
4. Differential privacy: Differential privacy [55] aims to provide infor-
mation about inherent groups while withholding the information about
the individuals. Many algorithms and tools have been developed for
this, including CapC[48] and PySyft [220].
5.2.2. Methodology
The following sections overview the steps to put these concepts into practice.
5.2.2.1. Algorithmic fairness
Algorithmic fairness [173, 269, 76] attempts to ensure the unbiased output
across the available classes. Here, we discuss how we can overcome this
challenge at different stages of model training [173, 269].
1. Pre-processing
•Choice of sampling & data augmentation: Making sure that the
dataset is balanced (having approximately an equal number of
instances from each class) and all the classes get equal representa-
tion in the dataset using simple under- or over-sampling methods
[269]. This can also be done by data augmentation [170, 74] to
improve the counterfactual fairness by counterfactual text genera-
tion and using it to augment data. Augmentation methods include
Synthetic Minority Oversampling Technique (SMOTE) [41] and
Adaptive Synthetic Sampling (ADASYN) [100]. Since synthetic
samples may not be universally beneficial for the healthcare do-
main, acquiring more data and undersampling may be the best
strategy [269].
•Causal fairness using data pre-processing: Causal fairness is achieved
by reducing the impact of protected or sensitive attributes (e.g.,
race and gender) on predicted variables and different methods
have been developed to accomplish this [76, 268]. Kamiran et
al. [112] proposed “massaging the data” before using traditional
classification algorithms.
•Re-weighing: In a pre-processing approach, one may re-weight the
training dataset samples or remove features with high correlation
to sensitive attributes as well as the sensitive attribute itself [113],
learning representations that are relatively invariant to sensitive
attribute [152]. One might also adjust representation rates of pro-
tected groups and achieve target fairness metrics [39], or utilize
optimization to learn a data transformation that reduce discrimi-
nation [35].
2. In-processing
•Adversarial learning: It is also possible to enforce fairness dur-
ing model training, using adversarial debiasing [278, 211, 261].
Adversarial learning refers to the methods designed to intention-
ally confound ML models during training, through deceptive or
misleading inputs, to make those models more robust. This tech-
nique has been used in healthcare to create robust models [116],
and for bias mitigation, by intentionally inputting biased examples
[135, 194].
•Prejudice remover: Another important aspect is prejudice injected
into the features [114]. Prejudice can be (a) Direct prejudice: using
a protected attribute as a prediction variable, (b) Indirect prej-
udice: statistical dependence between protected attributes and
prediction variables, and (c) Latent prejudice : statistical depen-
dence between protected attributes and non-protected attributes.
Kamishaima et al. [114] proposed a method to remove prejudice
using regularization. Similarly, Grgic et al. [87] introduced a
method using constraints for classifier optimization objectives to
remove prejudice.
•Enforcing fairness in the model training: Fairness can also be
enforced by making changes to the model through constraint op-
timization [149], modifying loss functions to penalize deviation
from the general population for subpopulations [195], regularizing
loss function to minimize mutual information between feature em-
bedding and bias [119], or adding regularizer to identify and treat
latent discriminating features [114].
•Up-weighing: It is possible to improve the outcome on worst case
group by up weighting the groups with the largest loss [279, 221,
163]. However, all these methods need awareness about the mem-
bership of the instance to sensitive attributes. There are also
group un-aware methods where they try to weights each sample
with an adversary that tries to maximize the weighted loss [127],
or trains an additional classier that up-weights samples classified
incorrectly in the last training step [144].
3. Post-processing: The post-processing fairness mitigation approaches
may target post-hoc calibration of model predictions. This method
has sown impact in bias mitigation in both non-healthcare [96, 200]
and healthcare [120] applications.
There are some software tools and libraries for algorithmic fairness check,
listed in [269], which can be used by developer and end user to evaluate the
fairness of the AI model outcomes.
5.2.3. Development & evaluation
At this stage, the ML system is evaluated to make sure its trustworthiness,
which includes evaluating the evaluation methods [210, 14].
5.2.3.1. Model interpretability & explainability
At this stage, model evaluation can be done through interpretability and ex-
plainability methods to mitigate any potential issues such as possible anoma-
lies in the data or the model. However, it should be noted that the methods
perform interpretability and explainability should also be evaluated carefully
before relying on them, which can be performed using different methods such
as human evaluation [160, 21].
6. Concluding remarks
Machine learning (ML) has been applied to many clinically-relevant tasks
and many relevant datasets in the research domain but, to fully realize the
promise of ML in healthcare, practical considerations that are not typically
necessary or even common in the research community must be carefully de-
signed and adhered to. We have provided a deep survey into a breadth
of these ML considerations, including infrastructure, human resources, data
sources, model deployment, monitoring and updating, bias,