ArticlePDF Available

Abstract and Figures

Background Artificial intelligence (AI) has the potential to transform our healthcare systems significantly. New AI technologies based on machine learning approaches should play a key role in clinical decision-making in the future. However, their implementation in health care settings remains limited, mostly due to a lack of robust validation procedures. There is a need to develop reliable assessment frameworks for the clinical validation of AI. We present here an approach for assessing AI for predicting treatment response in triple-negative breast cancer (TNBC), using real-world data and molecular -omics data from clinical data warehouses and biobanks. Methods The European “ITFoC (Information Technology for the Future Of Cancer)” consortium designed a framework for the clinical validation of AI technologies for predicting treatment response in oncology. Results This framework is based on seven key steps specifying: (1) the intended use of AI, (2) the target population, (3) the timing of AI evaluation, (4) the datasets used for evaluation, (5) the procedures used for ensuring data safety (including data quality, privacy and security), (6) the metrics used for measuring performance, and (7) the procedures used to ensure that the AI is explainable. This framework forms the basis of a validation platform that we are building for the “ITFoC Challenge”. This community-wide competition will make it possible to assess and compare AI algorithms for predicting the response to TNBC treatments with external real-world datasets. Conclusions The predictive performance and safety of AI technologies must be assessed in a robust, unbiased and transparent manner before their implementation in healthcare settings. We believe that the consideration of the ITFoC consortium will contribute to the safe transfer and implementation of AI in clinical settings, in the context of precision oncology and personalized care.
This content is subject to copyright. Terms and conditions apply.
Tsopraetal. BMC Med Inform Decis Mak (2021) 21:274
https://doi.org/10.1186/s12911-021-01634-3
RESEARCH ARTICLE
A framework forvalidating AI inprecision
medicine: considerations fromtheEuropean
ITFoC consortium
Rosy Tsopra1,2,3,10* , Xose Fernandez4, Claudio Luchinat5, Lilia Alberghina6, Hans Lehrach7,8, Marco Vanoni6,
Felix Dreher8, O.Ugur Sezerman9, Marc Cuggia10, Marie de Tayrac11, Edvins Miklasevics12, Lucian Mihai Itu13,
Marius Geanta14, Lesley Ogilvie7,8, Florence Godey15,16, Cristian Nicolae Boldisor13, Boris Campillo‑Gimenez17,
Cosmina Cioroboiu14, Costin Florian Ciusdel13, Simona Coman13, Oliver Hijano Cubelos4, Alina Itu13,
Bodo Lange8, Matthieu Le Gallo15,16, Alexandra Lespagnol18, Giancarlo Mauri19, H.Okan Soykam20,
Bastien Rance1,2,3, Paola Turano5, Leonardo Tenori5, Alessia Vignoli5, Christoph Wierling8, Nora Benhabiles21 and
Anita Burgun1,2,3,22
Abstract
Background: Artificial intelligence (AI) has the potential to transform our healthcare systems significantly. New AI
technologies based on machine learning approaches should play a key role in clinical decision‑making in the future.
However, their implementation in health care settings remains limited, mostly due to a lack of robust validation pro‑
cedures. There is a need to develop reliable assessment frameworks for the clinical validation of AI. We present here an
approach for assessing AI for predicting treatment response in triple‑negative breast cancer (TNBC), using real‑world
data and molecular ‑omics data from clinical data warehouses and biobanks.
Methods: The European “ITFoC (Information Technology for the Future Of Cancer)” consortium designed a frame‑
work for the clinical validation of AI technologies for predicting treatment response in oncology.
Results: This framework is based on seven key steps specifying: (1) the intended use of AI, (2) the target population,
(3) the timing of AI evaluation, (4) the datasets used for evaluation, (5) the procedures used for ensuring data safety
(including data quality, privacy and security), (6) the metrics used for measuring performance, and (7) the procedures
used to ensure that the AI is explainable. This framework forms the basis of a validation platform that we are building
for the “ITFoC Challenge. This community‑wide competition will make it possible to assess and compare AI algorithms
for predicting the response to TNBC treatments with external real‑world datasets.
Conclusions: The predictive performance and safety of AI technologies must be assessed in a robust, unbiased and
transparent manner before their implementation in healthcare settings. We believe that the consideration of the
ITFoC consortium will contribute to the safe transfer and implementation of AI in clinical settings, in the context of
precision oncology and personalized care.
© The Author(s) 2021. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or
other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line
to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory
regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this
licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/. The Creative Commons Public Domain Dedication waiver (http:// creat iveco
mmons. org/ publi cdoma in/ zero/1. 0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Open Access
*Correspondence: rosy.tsopra@nhs.net
1 Centre de Recherche Des Cordeliers, Inserm, Université de Paris,
Sorbonne Université, 75006 Paris, France
Full list of author information is available at the end of the article
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 2 of 14
Tsopraetal. BMC Med Inform Decis Mak (2021) 21:274
Background
Artificial intelligence (AI) has the potential to transform
our healthcare systems considerably and will play a key
role in clinical decision-making in the future [1]. AI has
been in the spotlight since the 1980’s, when the first
“expert systems” simulating the clinical reasoning for
clinical decisions emerged [2]. With the huge increase in
medical data over the last few decades, new approaches
have been developed (principally machine learning (ML),
including neural networks). ML techniques trained on
clinical datasets [2] have already proved useful for diag-
nostic applications [35] and risk prediction [6].
Despite the enthusiasm surrounding AI, their use in
healthcare settings remains limited. AI technologies
require rigorous assessment before they can be used
in clinical practice [7]. For example, the first AI-based
device to receive market authorization from the FDA
was assessed with a large prospective comparative clini-
cal trial including 900 patients from multiple sites [4].
AI technologies must satisfy stringent regulations for
approval as medical devices, because (1) the decision
support provided is optimized and personalized con-
tinuously in real time, according to the phenotype of the
patient [7]; (2) the performance of AI depends strongly
on the training datasets used [8], resulting in a large risk
of AI performing less well in real practice [911] or on
another group of patients or institutions [9]. It is, there-
fore, essential to assess the performance and safety of AI
before its introduction into routine clinical use.
Robust evaluations are required for AI to be trans-
ferred to clinical settings, but, in practice, only a few
such systems have been validated with external datasets
[12, 13]. A recent literature review reported that most
studies assessing AI did not include the recommended
design features for the robust validation of AI [9]. ere
is, therefore, a need to develop frameworks for the robust
validation of the performance and safety of AI with reli-
able external datasets [14, 15].
Finding, accessing and re-using reliable datasets is a
real challenge in medicine (contrasting with other FAIR
data collections [16]). However, with the development
of clinical data warehouses within hospitals, it should
become easier to obtain access to “real datasets”. e
benefit of using real-world data for research purposes
[17], and, particularly, for generating complementary evi-
dence during AI life cycles, has been highlighted by the
European Medicines Agency [18]. Real-world data from
clinical data warehouses may, therefore, constitute a
valuable source of reliable external datasets for validating
AI before its implementation in healthcare settings.
Guidelines on the regulation of AI technologies include
high-level directions, but not specific guidance on the
practical steps in AI evaluation [19]. Here, we propose
a framework for assessing the clinical performance and
safety of AI in the context of precision oncology. More
precisely, the objective is to use real-world data collected
from clinical data warehouses and biobanks to assess AI
technologies for predicting the response to anti-cancer
drugs. We developed this framework as part of the Euro-
pean Flag-Era project ‘ITFoC (Information Technology
for the Future of Cancer)’ [20], to validate AI algorithms
with -omics and clinical data for the prediction of treat-
ment response in triple-negative breast cancer (TNBC).
is framework could help AI developers and institutions
to design clinically trustworthy decision support systems,
and to assess them with a robust methodology.
Methods
Breast cancer is the most common cancer in women
worldwide [21, 22]. e most aggressive type is triple-
negative breast cancer (TNBC), characterized by a lack
of estrogen receptor, progesterone receptor and human
epidermal growth factor expression, together with a high
histologic grade and a high rate of mitosis [23]. TNBC
accounts for 10–20% of all breast cancers, and has a very
poor prognosis, with chemotherapy the main therapeutic
option [23, 24]. New targeted and personalized therapies
are, therefore, urgently required [23].
In recent decades, cancer treatments has followed a
“one-size-fits-all” approach based on a limited set of
clinical criteria. Recent advances, rendering sequenc-
ing techniques more widely available, are providing new
opportunities for precision oncology, the personaliza-
tion of treatment based on a combination of clinical and
molecular data, and improvements in drug efficacy, with
fewer side effects.
In this context, many AI models have been developed,
based on the detailed molecular characterization of indi-
vidual tumors and patients. ey model the effects and
adverse effects of drugs in the context of TNBC treat-
ment [25, 26]. However, these AI models often lack clini-
cal validation, and require further external evaluation.
e ITFoC (Information Technology for the Future of
Cancer) consortium [20], a multidisciplinary group from
six European countries, has proposed a new approach
to the unbiased validation of these AI models. is
Keywords: Artificial intelligence, Precision medicine, Personalized medicine, Computerized decision support systems,
Cancer, Oncology
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 3 of 14
Tsopraetal. BMC Med Inform Decis Mak (2021) 21:274
approach involves evaluating the performance and safety
of these AI models through robust clinical evaluation
with reliable and external real-world datasets, before
their implementation in healthcare settings. e ITFoC
consortium has designed a framework to meet this goal.
is framework is based on seven key steps specifying
(Fig.1): (1) the intended use of AI, (2) the target popula-
tion, (3) the timing of AI evaluation, (4) the datasets used
for evaluation, (5) the procedures used for ensuring data
safety (including data quality, privacy and security), (6)
the metrics used for measuring performance, and (7) the
procedures used to ensure that the AI is explainable.
Results
e framework designed by the “ITFoC consortium”
follows seven principles that we consider essential for
the assessment of AI technologies. is framework was
developed to support a community-based programming
contest to be held during “Pink October”. is “ITFoC
challenge”, will open a platform enabling various teams
(academic, research, and MedTech organizations) to test
their AI-based approaches with TNBC datasets provided
by our partners for the purpose of this competition.
We describe here the framework and the paral-
lel actions planned for the setting up of the “ITFoC
challenge”.
Step 1: Specify theintended use ofAI
e first step in AI assessment is accurately defining its
intended use (for medical purposes) [7], together with
its input (i.e. the data required to run the AI), and out-
put (i.e. the results provided by AI) parameters.
Once the intended use of AI is clearly stated, it is
important to be sure that:
AI is used only to address questions that are relevant
and meaningful for the medical community. Indeed,
AI may be irrelevant if it is used in a correct, but not
useful manner in healthcare settings [27]. It is, there-
fore, important to define clearly the benefits of AI for
a particular clinical scenario.
• AI complies with ethical, legal and social standards
[27, 28]. As stated by the High-Level Expert Group
on AI established by the European Commission [29],
AI should (1) comply with all applicable laws and
regulations, (2) adhere to ethical principles and val-
ues, (3) not disadvantage people from particular soci-
odemographic backgrounds or suffering from certain
conditions, (4) not increase discrimination based on
ethnicity or sex.
Fig. 1 The seven key steps needed for the clinical validation of AI technologies
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 4 of 14
Tsopraetal. BMC Med Inform Decis Mak (2021) 21:274
Planned actions
In the “ITFoC challenge”, we aim to assess AI with the fol-
lowing intended use: predicting the response of TNBC
patients to treatment, regardless of their origin or ethnic
background. More precisely, AI should be able to predict,
at the time of diagnosis, whether particular patients are
likely to respond to standard treatment, so that prob-
able non-responders can be offered alternative treatment
options.
e expected clinical impact is an improvement in
survival rates for TNBC patients, particularly those not
responding to standard treatment.
Step 2: Clearly specify thetarget population
e second step in AI assessment is accurately defining
the target population. AI must be evaluated on inde-
pendent datasets similar to the target population of the
AI technology. e population is defined during the
development phase, by specifying patient and disease
characteristics, in a similar manner to the definition
of eligibility criteria in conventional clinical trials. e
sets of patients selected for the assessment should be
representative of the target population, and consecu-
tive inclusion or random selection should be used for
patient recruitment, at multiple sites, to limit the risk
of spectrum bias (i.e. the risk of the patients selected
not reflecting the target population) [15], and to ensure
that the results can be generalized.
Contrary to the AI validation and training stages,
which require large datasets, AI evaluation does not
necessarily require ‘big data’ [15]. As in randomized
clinical trials, the study sample should be determined
according to the study hypothesis, expected effect (e.g.
superiority, non-inferiority) and degree of importance
(differences important or unimportant) [15].
Planned actions
In the “ITFoC challenge”, the target population is “women
who have been diagnosed with TNBC”. We need to assess
AI performance in terms of treatment response. We
must therefore select patients who have already received
first-line treatment (making it possible to compare the
predicted and observed responses in a retrospective mul-
ticentre cohort of TNBC patients).
Step 3: Specify thetiming ofAI evaluation
e third step in AI assessment is clearly defining the
timing of the evaluation. As in drug development, vari-
ous phases can be distinguished for AI evaluation (Fig.2):
e “fine-tuning” phase is an essential part of AI
development. It is equivalent to the “preclinical
phase” in drug development, when drugs are tested
in a laboratory setting. Here, AI is evaluated inter-
nally in three steps: training, internal validation, and
testing. e training step involves training the algo-
rithm on a subset of so-called “training” data. e
internal validation involves fine-tuning the algorithm
or selecting the most optimized parameters. e test
step corresponds to the final internal assessment of
the performance of the algorithm.
e “clinical validation” phase follows the internal
validation and testing of AI. It is equivalent to phases
I and II of clinical trials, in which drug efficacy and
safety are assessed in a limited number of patients.
Here, the performance and safety of AI are assessed
with external data. e goal is to check that AI will
not result in lost opportunities for patients through
the generation of false-positive or false-negative pre-
dictions (i.e. for patients predicted to respond to a
treatment who do not in reality, and vice-versa).
Finally, patient outcomes are assessed after clini-
cal validation with external datasets. is phase is
equivalent to the phase III of clinical trials, in which
new drugs are compared to standard treatment in
randomized controlled trials (RCT). Here, AI is
implemented in healthcare settings, and its effect on
patient outcomes and the efficiency of the healthcare
system is assessed with real patients, via a RCT.
Planned actions
In the “ITFoC challenge”, we will focus on the “clini-
cal validation” phase. Akin to early-phase drug trials,
the goal will be to determine whether the AI developed
is sufficiently accurate and safe for transfer into clinical
practice for further assessment in RCTs.
Step 4: Specify thedatasets used forAI evaluation
e fourth step in AI assessment is the selection of reli-
able and representative datasets:
Publicly accessible datasets [1] are available through
public repositories (e.g. ArrayExpress [30], GEO [31])
or are released by research and/or medical institu-
tions (e.g. TCGA, or ICGC collections). However,
most are more suitable for bioinformatics than for
clinical informatics [1].
• Patient databases store retrospective or prospective
datasets generated by clinical trials or routine care
(real-world data).
‘Clinical trial’ datasets are collected in the con-
trolled environment of a specific clinical trial
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 5 of 14
Tsopraetal. BMC Med Inform Decis Mak (2021) 21:274
Fig. 2 Evaluation of AI‑timing
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 6 of 14
Tsopraetal. BMC Med Inform Decis Mak (2021) 21:274
(Table 1), from a restricted population that may
not be representative of the general population.
e data collection process is time-consuming
and costly, but the resulting data should be homo-
geneous, highly reliable and should have a well-
structured format. However, such datasets are not
generally made publicly available, for the follow-
ing reasons [32]: the potential loss of competitive
advantage for the organization funding the study;
the possibility of invalidating the results published
through secondary analyses; the costs associated
with data sharing and, finally, due to ethical and
scientific considerations. Moreover, data collec-
tion is usually limited to predefined sets of vari-
ables, and it may, therefore, be difficult to re-use
secondarily these data to address questions not
included in the initial protocol [32].
Real-world datasets are usually stored in clinical
data warehouses (Table1). ese datasets are col-
lected throughout patient care and have various
clinical sources (structured and unstructured clin-
ical records, laboratory, pharmacy, and radiology
results, etc.) [17, 33]. e collection of these data
is less time-consuming and costly than that for
clinical trial datasets. However, their exploitation
requires careful data quality management, because
they are highly variable and were initially collected
for clinical purposes rather than for research [34
37].
Split-sample validation involves randomly splitting
datasets into separate parts, which are then used for
both the development and internal evaluation of AI [12,
15]. is method is relevant only during the develop-
ment phase, and cannot be used to validate the gen-
eralizability of AI. Indeed, there is a risk of overfitting
bias (i.e. the AI fits too exactly to the training data), and
spectrum bias (i.e. the internal dataset is not represent-
ative of the population on which the AI will be used).
Validation on completely independent external data-
sets is required to overcome these limitations and for
validation of the generalizability of AI [15]. Geographic
sampling (i.e. using datasets collected by independent
investigators from different sites) could considerably
limit both biases, and improve the estimation of AI
generalizability in healthcare settings [15].
Planned actions
In the “ITFoC challenge”, we are working with retrospec-
tive real-world datasets collected from the clinical data
warehouses and biobanks of multiple hospitals, ensuring
that the TNBC population is broadly represented.
The inclusion criteria for datasets are:
A follow-up period of at least three years, to ensure
the standardized evaluation of treatment response
High-quality data extracted from a clinical data ware-
house or from a dedicated cancer database
Biological samples must be available in biobanks for
additional -omics analyses, if required.
Patients must have signed a consent form for the
reuse of their data and the reuse of their samples for
research purposes
e objective is not to acquire thousands of patient
datasets of variable quality, but to collect a representative
set of high-quality patient data.
Step 5: Specify theprocedures used toensure data safety
e fifth step in AI assessment is ensuring data safety,
including data quality, privacy and security, during the
evaluation phase.
Table 1 Clinical trial versus Real‑world datasets for AI evaluation
Clinical trial datasets Real-world datasets
Setting Experimental Real world
Population Representativeness Selective sample Large sample
Type Homogeneous Heterogeneous
Size +/− ++++
Time period for recruitment and follow‑
up Limited Long
Data Type Clinical +/ ‑omics Clinical +/ ‑omics
Collected by Dedicated specialist professionals Various healthcare professionals
Quality +++ +/
Need for data management +/− +++
Need for anonymization + +
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 7 of 14
Tsopraetal. BMC Med Inform Decis Mak (2021) 21:274
Data quality
Standardization is strongly recommended, to guarantee
the quality, sharing, portability and reusability of data
for AI evaluation [38]. Standardization is defined as the
representation of heterogeneous data with consensual
specifications [38]. It includes specifications for both data
fields (i.e. variables) and their value sets (i.e. codes) [38].
Standardization is highly dependent on the type of data-
sets involved.
Clinical data Clinical data are highly complex, for sev-
eral reasons: (1) they come from different sources (e.g.
electronic health records, reimbursement claims data),
(2) they have various formats (e.g. free text, numbers,
images), and representations (e.g. structured, semi-struc-
tured, unstructured); (3) the level of granularity is highly
variable, ranging from general to fine-grained concepts;
(4) datasets are not complete (e.g. missing data); (5) data-
set content varies within and between institutions.
Various common data models can be used to standard-
ize clinical datasets. ese models include the CDISC
(Clinical Data Interchange Standards Consortium) model
for “clinical trial datasets”, which can be used to ensure
information system interoperability between healthcare
and clinical research, and the OMOP (Observational
Medical Outcomes Partnership) common data model for
real-world datasets. e data values must also be harmo-
nized by the use of terminologies ensuring interoperabil-
ity between AI systems, such as the ICD 10 (International
Classification of Diseases) for the standardization of
medical diagnoses, LOINC (Logical Observation Iden-
tifiers Names and Codes) for biological tests, Med-
DRA (Medical Dictionary for Regulatory Activities) for
adverse events, and so on. Most standard terminologies
are integrated into the UMLS (Unified Medical Language
System) metathesaurus, which can be used as a global
thesaurus in the biomedical domain.
-Omics data -Omics data are complex: (1) they are
generated by different techniques, with different bioin-
formatic tools; (2) they may be based on different types
of NGS (next-generation sequencing) data, such as
WGS (whole-genome sequencing), WES (whole-exome
sequencing), and RNA-sequencing, or on data from prot-
eomics and metabolomics platforms; (3) their integration
and interpretation remain challenging, due to their size
and complexity, and the possibility of experimental and
technical errors during sample preparation, sequencing
and data analysis [39].
-Omics data can be standardized at any stage from data
generation to data interpretation. For example, MIAME
(minimum information about a microarray experi-
ment) [40] and MAGE (microarray gene expression data
modeling and exchange standards) have been developed
for microarray experiments [41]. e most widely used
format for variant identification is VCF (variant clinical
format), which includes a number of fields for genomic
coordinates, reference nucleotide, and variant nucleotide,
for example, but also metadata adding meaningful infor-
mation relating to variants: e.g. gene symbol, location,
type, HGVS (human genome variation society) nomen-
clature, predicted protein sequence alterations and
additional resources, such as cross-references to cancer-
specific and general genomic databases and prior in silico
algorithm-based predictions.
Standardization ofclinical and-omics data Standardi-
zation makes it possible to combine data from multiple
institutions. It also ensures the consistency of datasets,
and improves the quality and reliability of clinical and
-omics data. ese aspects are crucial, to maximize the
chances of predicting the real impact of AI on the health-
care process. Indeed, the ultimate performance of AI
depends strongly on the quality of data used for evalua-
tion [12, 13].
Planned actions In the “ITFoC” challenge, we will apply
a range of internationally accepted standards for breast
cancer data, to overcome issues of data heterogeneity
and variability associated with the use of data of different
provenances [34, 35] and to ensure access to high-quality
real-world datasets [38]
Clinical datasets will be standardized with the OMOP
common data model [42] for data structure and the OSI-
RIS model [43] for data content. e OMOP CDM is sup-
ported by the OHDSI consortium (Observational Health
Data Sciences and Informatics), and OSIRIS is supported
by the French National Institute of Cancer. Both stand-
ards include a list of concepts and source values, con-
sidered the minimal dataset necessary for the sharing of
clinical and biological data in oncology. Items and values
are structured and standardized according to interna-
tional medical terminologies, such as ICD 10, LOINC,
SNOMED CT. A standardized TNBC data model based
on these models will be used: items will be added,
removed and/or transformed, and values will be adapted
to TNBC data (e.g. the values of the “biomarker” item are
limited to RO, RP and HER2 receptors, Ki67). e instan-
tiated model contains the dataset specifications provided
to participants in this challenge. e database will be
populated locally through dedicated extract-transform-
load pipelines.
It may not be possible to extract -omics data directly
from clinical data warehouses, because these data are
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 8 of 14
Tsopraetal. BMC Med Inform Decis Mak (2021) 21:274
not widely collected in routine care. If not already pre-
sent in the electronic health record of the patient, -omics
data will be generated from patient samples stored in
biobanks. For the challenge, WES data, RNA-sequencing
data, microRNA expression levels and metabolomic data
will be obtained from primary tumor samples, and from
blood samples as a control. Data quality will be ensured
by using only freshly frozen tumors with a celll content of
more than 30% (as determined by a pathologist). Multi-
level -omics data contain a wealth of potentially relevant
information, including molecular variants (directly or
indirectly) affecting clinically significant pathways. eir
incorporation into the challenge dataset should greatly
increase the predictive power of the AI technologies
evaluated.
Data privacy
e patients’ right to privacy must be respected. Patients
must be informed about the storage and use of their
data, and must have signed a consent form authorizing
the collection and use of their data for research [44, 45].
Within Europe, data privacy is regulated by the General
Data Protection Regulation (GDPR) [45]), which protects
patients against the inappropriate use of their data. Such
regulations ensure that (1) patients can choose whether
or not to consent to the collection of their data, (2)
patients are informed about the storage and use of their
data (principle of transparency), (3) data are stored in
an appropriate manner (principle of integrity), (4) data
are used only for certain well-defined purposes, and (5)
patients have the right to change their minds and to with-
draw consent at any time.
Planned actions In the “ITFoC challenge”, data privacy
will be respected:
Only datasets from patients who have signed a con-
sent form authorizing the reuse of their data and
samples for research will be included in the chal-
lenge.
e clinical data will be pseudo-anonymized by
state-of-the-art methods (and in accordance with the
GDPR), without altering the scientific content. Any
clinical information that could be used, directly or
indirectly, to identify the individual will be removed
(e.g. dates will be transformed into durations (com-
puted as a number of days)).
Data security
AI evaluation should be hosted and managed on a secure
platform [46], that can ensure that confidentiality, integ-
rity and/or the availability of patient information are not
compromised deliberately or accidentally [44]. Any plat-
form used for AI evaluation should implement the strict-
est control over access, to ensure that data are available
only to authorized parties [44], only for the duration of
the evaluation [44], and that any personal data (including
both data directly linked to a patient, such as surname,
and indirectly linked to the patient, such as diagnosis
date) are removed [47].
Planned actions In the “ITFoC challenge”, data security
will be ensured by using a dedicated ITFoC data space.
Workflows will be created between local clinical data
warehouses and the local ITFoC data space, for standardi-
zation of the datasets with respect to the standard TNBC
model. Each standardized dataset will be transferred to a
secure platform, on which it will be stored (Fig.3).
Participants will assess their AI technologies with the
same datasets hosted on a secure platform, but they will
not be allowed to access datasets directly. Clinical and
-omics data will be inaccessible throughout the duration
of the challenge, and participants will be provided only
with the specifications of the datasets.
Step 6: Specify themetrics used formeasuring AI
performance
e sixth step in AI assessment is defining the metrics
used to evaluate the performance of the AI algorithm.
e intrinsic performance of the AI itself is assessed
during the “fine-tuning” and the “clinical validation”
phases. Discrimination performance is measured in
terms of sensitivity and specificity for binary outputs [15].
By plotting the effects of different levels of sensitivity and
specificity for different thresholds, a ROC (receiver oper-
ating characteristics) curve can be generated [48]. is
ROC curve represents the discrimination performance of
a particular predictive algorithm [15]. e most common
metric used is the AUC (area under the ROC Curve), the
values of which lie between 0 and 1. Algorithms with high
levels of performance have a high sensitivity and specific-
ity, resulting in an AUC close to 1 [15, 48].
Calibration performance is measured for quantitative
outputs, such as probabilities [15]. It is used to determine
whether predicted probabilities agree with the real prob-
abilities [15]. e predicted probabilities are plotted on
the x-axis, and the observed real probabilities are plotted
on the y-axis, to generate a calibration plot [15]. is plot
can be used to estimate the goodness of fit between the
predicted and real probabilities [49]. Bland–Altman plots
can also be used to analyze the agreement between the
predicted and the observed probabilities [50].
A more detailed discussion of the statistical methods
used to measure AI performance is beyond the scope of
this article but can be found elsewhere [49].
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 9 of 14
Tsopraetal. BMC Med Inform Decis Mak (2021) 21:274
Fig. 3 Data workflow for the ITFoC challenge
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 10 of 14
Tsopraetal. BMC Med Inform Decis Mak (2021) 21:274
e clinical performance of AI in real clinical settings is
assessed during the “patient outcome assessment” phase.
AI metrics, such as AUC, are not always understood by
clinicians [51], and do not necessarily reflect clinical effi-
cacy [52]. ere is a need to determine the effect of AI
on patient outcomes in real-life conditions. Ideally, the
effects of AI should be compared to a gold standard [53]
or baseline (i.e. standard procedure) in an RCT using
standard statistical approaches [15].
Planned actions
In the “ITFoC challenge”, we will assess the performance
of AI itself with the binary criterion “predicted response
to treatment” during the clinical validation phase. For
each AI algorithm, various metrics will be reported,
including AUC, confusion matrix, sensitivity, specificity,
positive and negative predictive values.
e evaluation will be carried out by a scientific com-
mittee, independent of the ITFoC organizational com-
mittee. is scientific committee will include members
from various disciplines (e.g. bioinformaticians, medical
doctors, data scientists, statistical and machine-learning
experts) and from various international institutions (aca-
demic, research and hospital institutions).
Step 7: Specify theprocedures toensure AI explainability
e seventh step in the assessment of AI is examining the
underlying algorithm [54, 55]. is step has two expected
benefits. First, it may prevent an inappropriate represen-
tation of the dataset used for training/validation. Sec-
ond, it may reveal the learning of unanticipated artifacts
instead of relevant inputs [54].
e input data must be analyzed first [54]. e type
(structured or unstructured), format (e.g. text, numbers,
images), and specifications (e.g. variables used) of the
data must be assessed. A better comprehension of the
input data should ensure that the data used by the AI are
comprehensive and relevant to clinical practice.
e underlying algorithm should also be analyzed [54].
e code, documented scripts, and the computer envi-
ronment should be evaluated by independent research-
ers. Ideally, independent researchers should even run the
pipeline, check the underlying AI methods and evaluate
the explainability of the outputs [54]. However, AI devel-
opers may be reluctant to share their codes openly, for
scientific or economic reasons. In such cases, alternatives
can be found, such as a trusted neutral third party signing
a confidentiality form, or a virtual computing machine
running the code with new datasets [54], or the provision
of documentations about the AI.
Planned actions
In the “ITFoC challenge”, we aim at explain why some AI
successfully predict treatment response, whereas oth-
ers fail. Each AI developer participating in the challenge
should provide the data specifications used by the AI.
We will encourage the AI developers to share their codes
openly. Alternatively, they could opt for restricted code
sharing with the scientific committee (the scientific com-
mittee will sign a confidentiality agreement).
Discussion
We describe here the framework designed by the ITFoC
consortium for the assessment of AI technologies for
predicting treatment response in oncology. is frame-
work will be used to construct a validation platform for
the “ITFoC Challenge”, a community-wide competition
for assessing and comparing AI algorithms predicting
the response to treatments in TNBC patients from real-
world datasets.
Use ofreal-world datasets forvalidating AI technologies
e systematic and rigorous validation of AI technologies
is essential before their integration into clinical practice.
Such evaluation is the only way to prevent unintentional
harm, such as misdiagnosis, inappropriate treatment or
adverse effects, potentially decreasing patient survival.
To date, only a few AI-based solutions have actually been
clinically validated [9], mostly exclusively on internal
datasets, with no external validation. RCTs in which AI
technologies are compared to the gold standard (i.e. rou-
tine care delivered by medical experts) are the strongest
and most reliable approach for assessing AI performance
and safety [56]. Such trials provide a more detailed evalu-
ation, including a range of relevant parameters, such as
patient benefits in terms of quality of life, acceptance by
physicians, integration into the clinical workflow, and
economic impact. However, RCTs are costly, both finan-
cially and in terms of time required, and should be pre-
ceded by early-phase studies [4].
Here, we support the idea that when AI technologies
reach a state of sufficient “maturity”, they should undergo
clinical validation with external real-world datasets. is
would make it possible to measure the performance
and safety of AI quickly and reliably in conditions close
to those encountered in real-life. is validation pro-
cess would save both money and time, due to the use of
real-world datasets from clinical data warehouses. At the
end of this early validation step, if the performance of a
specific AI technology falls short of expectations (e.g. if
it fails to predict response to treatment, or is considered
unsafe), then it can be rejected (as in early-phase trials
for drugs), and no further evaluation in RCTs is required.
If an AI is validated clinically with these real-world
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 11 of 14
Tsopraetal. BMC Med Inform Decis Mak (2021) 21:274
datasets, it can be considered a good candidate and
allowed to progress to the next stage in evaluation (i.e. an
RCT). e validation process outlined here (“validation
step with retrospective real-world datasets”) should thus
be an integral part of the entire AI evaluation process,
constituting the decisive step concerning whether or not
to perform a RCT.
Use ofacommunity-wide competition toassess AI
technologies
We propose here to organize the “validation step” in the
form of a community-wide competition. Competition-
based approaches are increasingly being seen as relevant
in the medical informatics domain, with participating
teams usually tackling a challenge over a limited time
period, with access to an anonymized dataset for the test-
ing of methods. For example, the i2b2 (Informatics for
Integrating Biology and the Bedside) project includes
a “Natural Language Processing” challenge for assess-
ing methods for understanding clinical narratives [57].
Competition-based approaches have also been developed
in oncology (e.g. the Sage Bionetworks—DREAM Breast
Cancer Prognosis Challenge, designed for developing
computational models that can predict breast cancer
survival [58, 59]; and the Prostate DREAM Challenge,
for identifying prognostic models capable of predicting
survival in patients with metastatic castration-resistant
prostate cancer [46]). e utility of these crowdsourced
challenges for the community has clearly been demon-
strated. ey have multiple advantages: (1) they allow the
development of models that outperform those developed
with traditional research approaches [58, 60], (2) they
encourage collaboration between teams for the improve-
ment of models [60], and (3) they provide more trans-
parent results, because both favorable and unfavorable
results are published [58, 60].
We derived a framework from these competition-
based approaches. Our approach is based on the same
principles as these existing challenges, but focusing on
the combination of real-world data collected from clini-
cal data warehouses (rather than data collected through
RCTs), and -omics data generated by next-generation
sequencing techniques. e results of the “ITFoC chal-
lenge” will provide essential proof-of-principle evidence
for the use of real-world datasets for validating AI tech-
nologies in a competition setting, as an essential precur-
sor to RCTs.
Accelerating AI transfer tohealthcare settings
We propose a framework for the clinical validation of AI
technologies before their transfer to clinical settings and
clear actions in the domain of TNBC treatment. Both the
framework and the planned actions can be generalized
to other questions in oncology, with minor adaptations.
For instance, for diagnosis, other datasets could be con-
sidered (e.g. images, signals). Likewise, we propose here
the use of real world dataset from various healthcare
centres, to guarantee the volume and representativeness
of the dataset. Similarly, when dealing with rare can-
cers, the datasets may come from various centers, and
may even be extended to other sources, such as clinical
research data. Dataset from other sources have already
been successfully used for the assessment of AI in breast
and prostate cancers [46, 58]. Furthermore, the metrics
used to assess AI performance may also differ, depend-
ing on the type of cancer and the intended use of AI (e.g.
for diagnosis, the primary outcome could be compared to
the diagnosis made by an oncologist).
We believe that a platform, as described here, could
help to accelerate AI transfer to healthcare settings in
oncology. AI systems are currently considered to be
medical devices that can only be implemented in health
centers after the demonstration of their safety and effi-
cacy through a large prospective RCT [4]. However, this
is time-consuming and expensive, and there is a risk of
patient outcome studies becoming obsolete by the time
the results become available [15]. e use of a valida-
tion platform has several advantages: (1) several AI tech-
nologies can be assessed in parallel for the same price
(whereas a RCT is usually designed to assess a single
AI technology); (2) the platform can be re-used for fur-
ther AI evaluations; (3) new datasets can easily be added
to the platform; (4) transparency is guaranteed, as the
results are communicated even if unfavorable. For all
these reasons, validation platforms constitute a credible
route towards establishing a rigorous, unbiased, trans-
parent and durable approach to the assessment of AI
technologies.
Supporting precision medicine
Clinical care decision are traditionally driven by patient
symptoms and disease characteristics. In precision
oncology, the scope is extended to the patient pheno-
type, preclinical symptoms, tumor characteristics and
the complex molecular mechanisms underlying disease
[61]. Recent advances in genetics and sequencing tech-
nologies are now enabling clinicians to include molec-
ular aspects of the disease in their clinical decision
processes, and advances in metabolomics have facili-
tated considerations of the functional activity of can-
cer cells [62, 63]. e use of -omics data in routine care
(e.g. genomic, metabolomic or proteomic data [64]), is
strongly supported by the European Medicines Agency
[18], and could lead to significant improvements in
patient care.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 12 of 14
Tsopraetal. BMC Med Inform Decis Mak (2021) 21:274
Here, we provide support for the idea that -omics anal-
ysis should be part of the clinical decision process. e
“ITFoC Challenge” aims to demonstrate the benefits of
integrating clinical data warehouses and biobanks into
the clinical care process, in accordance with the findings
of previous studies [65, 66]. By combining clinical and
-omics data, AI tools may facilitate the delivery of treat-
ments that are personalized according to the characteris-
tics of the patients and their tumors, thereby increasing
of the chances of survival and decreasing side effects. By
designing the “ITFoC Challenge”, we aim to encourage
the development of AI based on clinical and -omics data
for the prediction of treatment response in cancer, and
the personalization of cancer treatment.
Conclusions
We hereby propose a framework for assessing AI tech-
nologies based on real-world data, before their use in
healthcare settings. is framework includes seven
key steps specifying: (1) the intended use of AI, (2)
the target population, (3) the timing for AI evaluation,
(4) the datasets selected for evaluation, (5) the proce-
dures used to ensure data safety, (6) the metrics used to
measure performance, and (7) the procedures used to
ensure that the AI is explainable. e proposed frame-
work has the potential to accelerate the transfer of AI
into clinical settings, and to boost the development of
AI solutions using clinical and -omics data to predict
treatment responses and to personalize treatment in
oncology. Here, we applied this framework to the estab-
lishment of a community-wide competition in the con-
text of predicting treatment responses in TNBC.
Abbreviations
AI: Artificial intelligence; CDISC: Clinical data interchange standards con‑
sortium; GDPR: General data protection regulation; HGVS: Human genome
variation society; ICD: International classification of diseases; LOINC: Logical
observation identifiers names and codes; MedDRA: Medical dictionary for
regulatory activities; OMOP: Observational medical outcomes partner‑
ship; MIAME: Minimum information about a microarray experiment; MAGE:
MicroArray gene expression; ML: Machine learning; NGS: Next‑generation
sequencing; OHDSI: Observational health data sciences and informatics; RCT
: Randomized controlled trials; ROC: Receiver operating characteristics; TNBC:
Triple‑negative breast cancer; UMLS: Unified medical language system; VCF:
Variant clinical format; WGS: Whole‑genome sequencing; WES: Whole‑exome
sequencing.
Authors’ contributions
Design: RT, XF, CL, LA, HL, MV, FD, OUS, MC, MDT, EM, LMI, MG, LO, FG, CNB,
BCG, CC, CFC, SC, OHC, AI, BL, MLG; AL, GM, HOS, BR, PT, LT, AV, CW, NB, AB.
Writing original manuscript: RT, AB. Agreement with all aspects of the work: RT,
XF, CL, LA, HL, MV, FD, OUS, MC, MDT, EM, LMI, MG, LO, FG, CNB, BCG, CC, CFC,
SC, OHC, AI, BL, MLG; AL, GM, HOS, BR, PT, LT, AV, CW, NB, AB. All authors read
and approved the final manuscript.
Funding
This work was supported by the ITFoC project (Information Technology for the
Future of Cancer) – FLAG‑ERA support.
Availability of data and materials
Data sharing is not applicable to this article as no datasets were generated or
analyzed during this study.
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
Hans Lehrach is a member of the board of Alacris Theranostics GmbH. Felix
Dreher is an employee of Alacris Theranostics GmbH. Lesley Ogilvie is an
employee of Alacris Theranostics GmbH. Bodo Lange is the CEO of Alacris
Theranostics GmbH. Christoph Wierling is an employee of Alacris Theranostics
GmbH. The other authors have no conflicts of interest to declare.
Author details
1 Centre de Recherche Des Cordeliers, Inserm, Université de Paris, Sorbonne
Université, 75006 Paris, France. 2 Inria, HeKA, Inria Paris, France. 3 Department
of Medical Informatics, Hôpital Européen Georges‑Pompidou, AP‑HP, Paris,
France. 4 Institut Curie, 25 Rue d’Ulm, 75005 Paris, France. 5 Centro R isonanze
Magnetiche ‑ CERM/CIRMMP and Department of Chemistry, University of Flor‑
ence, 50019 Sesto Fiorentino (Florence), Italy. 6 Department of Biotechnology
and Biosciences, University of Milano Bicocca and ISBE‑Italy/SYSBIO ‑ Candi‑
date National Node of Italy for ISBE, Research Infrastructure for Systems Biol‑
ogy Europe, Milan, Italy. 7 Max Planck Institute for Molecular Genetics, Berlin,
Germany. 8 Alacris Theranostics GmbH, Berlin, Germany. 9 School of Medicine
Biostatistics and Medical Informatics Dept., Acibadem University, Istanbul,
Turkey. 10 Univ Rennes, CHU Rennes, Inserm, LTSI ‑ UMR 1099, 35000 Rennes,
France. 11 Univ Rennes, Department of Molecular Genetics and Genomics, CHU
Rennes, IGDR‑UMR6290, CNRS, 35000 Rennes, France. 12 RSU Institute of Oncol‑
ogy, Dzirciema str. 16, Riga 1010, Latvia. 13 Transilvania University of Brasov,
Brasov, Romania. 14 Centre for Innovation in Medicine, Bucharest, Romania.
15 INSERM U1242 « Chemistry, Oncogenesis Stress Signaling », Université de
Rennes, 35042 CEDEX, Rennes, France. 16 Centre de Lutte Contre Le Cancer
Eugène Marquis, CRB Santé (BRIF Number: BB‑0033‑00056), 35042 CEDEX,
Rennes, France. 17 Univ Rennes, CLCC Eugène Marquis, INSERM, LTSI ‑ UMR
1099, 35000 Rennes, France. 18 Department of Molecular Genetics and Genom‑
ics, CHU Rennes, 35000 Rennes, France. 19 Department of Informatics, Systems
and Communication, University of Milano Bicocca and ISBE‑Italy/SYSBIO ‑
Candidate National Node of Italy for ISBE, Research Infrastructure for Systems
Biology Europe, Milan, Italy. 20 EPIGENETICS Inc. BUDOTEK, Istanbul, Turkey.
21 Direction de La Recherche Fondamentale (DRF), CEA, Université Paris‑Saclay,
91191 Gif‑sur‑Yvette, France. 22 PaRis Artificial Intelligence Research InstitutE
(Prairie), Paris, France.
Received: 18 June 2020 Accepted: 22 September 2021
References
1. Paton C, Kobayashi S. An open science approach to artificial intelligence
in healthcare. Yearb Med Inform. 2019;28:47–51.
2. Davenport T, Kalakota R. The potential for artificial intelligence in health‑
care. Future Healthc J. 2019;6:94–8.
3. Gulshan V, Peng L, Coram M, Stumpe MC, Wu D, Narayanaswamy A,
et al. Development and validation of a deep learning algorithm for
detection of diabetic retinopathy in retinal fundus photographs. JAMA.
2016;316:2402–10.
4. Abràmoff MD, Lavin PT, Birch M, Shah N, Folk JC. Pivotal trial of an autono‑
mous AI‑based diagnostic system for detection of diabetic retinopathy in
primary care offices. NPJ Digit Med. 2018;1:39.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 13 of 14
Tsopraetal. BMC Med Inform Decis Mak (2021) 21:274
5. Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, et al. Dermatolo‑
gist‑level classification of skin cancer with deep neural networks. Nature.
2017;542:115–8.
6. Calvert JS, Price DA, Chettipally UK, Barton CW, Feldman MD, Hoffman JL,
et al. A computational approach to early sepsis detection. Comput Biol
Med. 2016;74:69–73.
7. FDA. Artificial Intelligence and Machine Learning in Software as a Medical
Device. FDA. 2019. http:// www. fda. gov/ medic al‑ devic es/ softw are‑ medic
al‑ device‑ samd/ artifi cial‑ intel ligen ce‑ and‑ machi ne‑ learn ing‑ softw are‑
medic al‑ device. Accessed 15 Nov 2019.
8. Ding J, Li X. An approach for validating quality of datasets for machine
learning. IEEE Int Conf Big Data Big Data. 2018;2018:2795–803.
9. Kim DW, Jang HY, Kim KW, Shin Y, Park SH. Design characteristics of
studies reporting the performance of artificial intelligence algorithms for
diagnostic analysis of medical images: results from recently published
papers. Korean J Radiol. 2019;20:405–10.
10. Challen R, Denny J, Pitt M, Gompels L, Edwards T, Tsaneva‑Atanasova K.
Artificial intelligence, bias and clinical safety. BMJ Qual Saf. 2019;28:231–7.
11. Cabitza F, Rasoini R, Gensini GF. Unintended consequences of machine
learning in medicine. JAMA. 2017;318:517–8.
12. Park SH, Kressel HY. Connecting technological innovation in artificial
intelligence to real‑world medical practice through rigorous clinical
validation: what peer‑reviewed medical journals could do. J Korean Med
Sci. 2018. https:// doi. org/ 10. 3346/ jkms. 2018. 33. e152.
13. Kourou K, Exarchos TP, Exarchos KP, Karamouzis MV, Fotiadis DI. Machine
learning applications in cancer prognosis and prediction. Comput Struct
Biotechnol J. 2015;13:8–17.
14. The Lancet null. Artificial intelligence in health care: within touching
distance. Lancet Lond Engl. 2018;390:2739.
15. Park SH, Han K. Methodologic guide for evaluating clinical performance
and effect of artificial intelligence technology for medical diagnosis and
prediction. Radiology. 2018;286:800–9.
16. Wilkinson MD, Dumontier M, Aalbersberg IJJ, Appleton G, Axton M, Baak
A, et al. The FAIR Guiding Principles for scientific data management and
stewardship. Sci Data. 2016. https:// doi. org/ 10. 1038/ sdata. 2016. 18.
17. Jannot A‑S, Zapletal E, Avillach P, Mamzer M‑F, Burgun A, Degoulet P. The
Georges Pompidou University hospital clinical data warehouse: a 8‑years
follow‑up experience. Int J Med Inf. 2017;102:21–8.
18. European Medicines Agency. EMA Regulatory Science to 2025. Strategic
reflection. 2018.
19. Park Y, Jackson GP, Foreman MA, Gruen D, Hu J, Das AK. Evaluating
artificial intelligence in medicine: phases of clinical research. JAMIA Open.
2020;3:326–31.
20. IT Future of Cancer. https:// itfoc. eu/. Accessed 30 Apr 2020.
21. Breast cancer statistics. World Cancer Research Fund. 2018. https:// www.
wcrf. org/ dieta ndcan cer/ cancer‑ trends/ breast‑ cancer‑ stati stics. Accessed
13 Dec 2019.
22. Boyle P. Triple‑negative breast cancer: epidemiological considera‑
tions and recommendations. Ann Oncol Off J Eur Soc Med Oncol.
2012;23(Suppl6):vi7‑12.
23. Khosravi‑Shahi P, Cabezón‑Gutiérrez L, Aparicio Salcedo MI. State of art of
advanced triple negative breast cancer. Breast J. 2019;25:967–70.
24. Ovcaricek T, Frkovic SG, Matos E, Mozina B, Borstnar S. Triple nega‑
tive breast cancer—prognostic factors and survival. Radiol Oncol.
2010;45:46–52.
25. Ogilvie LA, Wierling C, Kessler T, Lehrach H, Lange BMH. Predictive
modeling of drug treatment in the area of personalized medicine. Cancer
Inform. 2015;14(Suppl 4):95–103.
26. Fröhlich F, Kessler T, Weindl D, Shadrin A, Schmiester L, Hache H, et al.
Efficient parameter estimation enables the prediction of drug response
using a mechanistic pan‑cancer pathway model. Cell Syst. 2018;7:567‑
579.e6.
27. Magrabi F, Ammenwerth E, McNair JB, De Keizer NF, Hyppönen H,
Nykänen P, et al. Artificial intelligence in clinical decision support: chal
lenges for evaluating AI and practical implications. Yearb Med Inform.
2019;28:128–34.
28. Meurier A‑L, Ghafoor Z, Foehrenbach C, Hartmann C, Herzog J, Madzou
L, et al. Mission assigned by the Prime Minister Édouard Philippe, p.
154.
29. AI HLEG (High‑Level Expert Group on Artificial Intelligence), set up by
the European Commission. Ethics Guidelines for Trustworthy AI. 2018.
30. Kolesnikov N, Hastings E, Keays M, Melnichuk O, Tang YA, Williams E,
et al. ArrayExpress update‑‑simplifying data submissions. Nucleic Acids
Res. 2015;43 Database issue:D1113‑1116.
31. Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M,
et al. NCBI GEO: archive for functional genomics data sets—update.
Nucleic Acids Res. 2013;41 Database issue:D991–5.
32. Committee on Strategies for Responsible Sharing of Clinical Trial Data,
Board on Health Sciences Policy, Institute of Medicine. Sharing Clinical
Trial Data: Maximizing Benefits, Minimizing Risk. Washington (DC):
National Academies Press (US); 2015. http:// www. ncbi. nlm. nih. gov/
books/ NBK26 9030/. Accessed 18 Sep 2019.
33. Delamarre D, Bouzille G, Dalleau K, Courtel D, Cuggia M. Semantic
integration of medication data into the EHOP Clinical Data Warehouse.
Stud Health Technol Inform. 2015;210:702–6.
34. Weiskopf NG, Weng C. Methods and dimensions of electronic health
record data quality assessment: enabling reuse for clinical research. J
Am Med Inform Assoc JAMIA. 2013;20:144–51.
35. Kim H‑S, Lee S, Kim JH. Real‑world evidence versus randomized
controlled trial: clinical research based on electronic medical records. J
Korean Med Sci. 2018. https:// doi. org/ 10. 3346/ jkms. 2018. 33. e213.
36. Tsopra R, Peckham D, Beirne P, Rodger K, Callister M, White H, et al.
The impact of three discharge coding methods on the accuracy of
diagnostic coding and hospital reimbursement for inpatient medical
care. Int J Med Inf. 2018;115:35–42.
37. Tsopra R, Wyatt JC, Beirne P, Rodger K, Callister M, Ghosh D, et al. Level
of accuracy of diagnoses recorded in discharge summaries: a cohort
study in three respiratory wards. J Eval Clin Pract. 2019;25:36–43.
38. Richesson RL, Krischer J. Data standards in clinical research: gaps, over
laps, challenges and future directions. J Am Med Inform Assoc JAMIA.
2007;14:687–96.
39. Manzoni C, Kia DA, Vandrovcova J, Hardy J, Wood NW, Lewis PA, et al.
Genome, transcriptome and proteome: the rise of omics data and their
integration in biomedical sciences. Brief Bioinform. 2018;19:286–302.
40. Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P,
Stoeckert C, et al. Minimum information about a microarray experi
ment (MIAME)‑toward standards for microarray data. Nat Genet.
2001;29:365–71.
41. Spellman PT, Miller M, Stewart J, Troup C, Sarkans U, Chervitz S, et al.
Design and implementation of microarray gene expression markup lan‑
guage (MAGE‑ML). Genome Biol. 2002;3:research0046.1‑research0046.9.
42. OMOP Common Data Model – OHDSI. https:// www. ohdsi. org/ data‑ stand
ardiz ation/ the‑ common‑ data‑ model/. Accessed 13 Dec 2019.
43. OSIRIS : a national data sharing project—www. en. ecanc er. fr. https:// en.e‑
cancer. fr/ OSIRIS‑ a‑ natio nal‑ data‑ shari ng‑ proje ct. Accessed 4 Nov 2019.
44. Georgiou A, Magrabi F, Hyppönen H, Wong ZS‑Y, Nykänen P, Scott PJ, et al.
The safe and effective use of shared data underpinned by stakeholder
engagement and evaluation practice. Yearb Med Inform. 2018;27:25–8.
45. EU data protection rules. European Commission ‑ European Commission.
https:// ec. europa. eu/ commi ssion/ prior ities/ justi ce‑ and‑ funda mental‑
rights/ data‑ prote ction/ 2018‑ reform‑ eu‑ data‑ prote ction‑ rules/ eu‑ data‑
prote ction‑ rules_ en. Accessed 1 Nov 2019.
46. Guinney J, Wang T, Laajala TD, Winner KK, Bare JC, Neto EC, et al.
Prediction of overall survival for patients with metastatic castration‑
resistant prostate cancer: development of a prognostic model through
a crowdsourced challenge with open clinical trial data. Lancet Oncol.
2017;18:132–42.
47. Doel T, Shakir DI, Pratt R, Aertsen M, Moggridge J, Bellon E, et al. GIFT‑
Cloud: a data sharing and collaboration platform for medical imaging
research. Comput Methods Programs Biomed. 2017;139:181–90.
48. Handelman GS, Kok HK, Chandra RV, Razavi AH, Huang S, Brooks M, et al.
Peering into the black box of artificial intelligence: evaluation metrics of
machine learning methods. AJR Am J Roentgenol. 2019;212:38–43.
49. Steyerberg E. Clinical prediction models: a practical approach to develop‑
ment, validation, and updating. New York: Springer; 2009. https:// doi. org/
10. 1007/ 978‑0‑ 387‑ 77244‑8.
50. Altman DG, Bland JM. Measurement in medicine: the analysis of method
comparison studies. J R Stat Soc Ser Stat. 1983;32:307–17.
51. Kelly CJ, Karthikesalingam A, Suleyman M, Corrado G, King D. Key chal‑
lenges for delivering clinical impact with artificial intelligence. BMC Med.
2019;17:195.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 14 of 14
Tsopraetal. BMC Med Inform Decis Mak (2021) 21:274
fast, convenient online submission
thorough peer review by experienced researchers in your field
rapid publication on acceptance
support for research data, including large and complex data types
gold Open Access which fosters wider collaboration and increased citations
maximum visibility for your research: over 100M website views per year
At BMC, research is always in progress.
Learn more biomedcentral.com/submissions
Ready to submit your research
Ready to submit your research
? Choose BMC and benefit from:
? Choose BMC and benefit from:
52. Keane PA, Topol EJ. With an eye to AI and autonomous diagnosis. NPJ
Digit Med. 2018;1:40.
53. Vollmer S, Mateen BA, Bohner G, Király FJ, Ghani R, Jonsson P, et al.
Machine learning and artificial intelligence research for patient benefit:
20 critical questions on transparency, replicability, ethics, and effective‑
ness. BMJ. 2020. https:// doi. org/ 10. 1136/ bmj. l6927.
54. Norgeot B, Quer G, Beaulieu‑Jones BK, Torkamani A, Dias R, Gianfrancesco
M, et al. Minimum information about clinical artificial intelligence mod‑
eling: the MI‑CLAIM checklist. Nat Med. 2020;26:1320–4.
55. Haibe‑Kains B, Adam GA, Hosny A, Khodakarami F, Board MS, Waldron L,
et al. The importance of transparency and reproducibility in artificial intel‑
ligence research. 2020.https:// doi. org/ 10. 1038/ s41586‑ 020‑ 2766‑y
56. Moja L, Polo Friz H, Capobussi M, Kwag K, Banzi R, Ruggiero F, et al. Effec‑
tiveness of a hospital‑based computerized decision support system on
clinician recommendations and patient outcomes: a randomized clinical
trial. JAMA Netw Open. 2019;2:e1917094.
57. Uzuner Ö, Goldstein I, Luo Y, Kohane I. Identifying patient smoking
status from medical discharge records. J Am Med Inform Assoc JAMIA.
2008;15:14–24.
58. Margolin AA, Bilal E, Huang E, Norman TC, Ottestad L, Mecham BH, et al.
Systematic analysis of challenge‑driven improvements in molecular
prognostic models for breast cancer. Sci Transl Med. 2013;5:181re1.
59. Cheng W‑Y, Ou Yang T‑H, Anastassiou D. Development of a prognostic
model for breast cancer survival in an open challenge environment. Sci
Transl Med. 2013;5:181ra50.
60. Bilal E, Dutkowski J, Guinney J, Jang IS, Logsdon BA, Pandey G, et al.
Improving breast cancer survival analysis through competition‑based
multidimensional modeling. PLoS Comput Biol. 2013;9:e1003047.
61. Chen R, Snyder M. Promise of Personalized Omics to Precision Medicine.
Wiley Interdiscip Rev Syst Biol Med. 2013;5:73–82.
62. Nielsen J. Systems biology of metabolism: a driver for developing person‑
alized and precision medicine. Cell Metab. 2017;25:572–9.
63. Sun L, Suo C, Li S‑T, Zhang H, Gao P. Metabolic reprogramming for cancer
cells and their microenvironment: beyond the Warburg effect. Biochim
Biophys Acta Rev Cancer. 2018;1870:51–66.
64. Wang F, Preininger A. AI in health: state of the art, challenges, and future
directions. Yearb Med Inform. 2019;28:16–26.
65. Rance B, Canuel V, Countouris H, Laurent‑Puig P, Burgun A. Integrating
heterogeneous biomedical data for cancer research: the CARPEM infra‑
structure. Appl Clin Inform. 2016;7:260–74.
66. Danciu I, Cowan JD, Basford M, Wang X, Saip A, Osgood S, et al. Second‑
ary use of clinical data: the Vanderbilt approach. J Biomed Inform.
2014;52:28–35.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in pub‑
lished maps and institutional affiliations.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... Intended benefit to healthcare providers, patients, and/ or affected parties [ 1a. Actual versus intended use in target population [30,34] 1b. System usability/user satisfaction [34,44,45] 1c. ...
... System usability/user satisfaction [34,44,45] 1c. System explainability (explanation, debiasing, addressing uncertainty, sensibility, comprehensibility) [30,43] 1d. Understanding the change process [46] 2a. ...
... System quality and functionality (hardware & software adaptability, flexibility, dependability) [34,45,54] 2b. Data availability, integrity and safety (confidentiality, quality, privacy and security) [30,34,45,54] 2c. Predictive performance (discrimination, calibration, thresholds) [22,30,43] 2d. ...
Article
Full-text available
Background The primary aim of this scoping review was to synthesise key domains and sub-domains described in existing clinical decision support systems (CDSS) implementation frameworks into a novel taxonomy and demonstrate most-studied and least-studied areas. Secondary objectives were to evaluate the frequency and manner of use of each framework, and catalogue frameworks by implementation stage. Methods A scoping review of Pubmed, Scopus, Web of Science, PsychInfo and Embase was conducted on 12/01/2022, limited to English language, including 2000–2021. Each framework was categorised as addressing one or multiple stages of implementation: design and development, evaluation, acceptance and integration, and adoption and maintenance. Key parts of each framework were grouped into domains and sub-domains. Results Of 3550 titles identified, 58 papers were included. The most-studied implementation stage was acceptance and integration, while the least-studied was design and development. The three main framework uses were: for evaluating adoption, for understanding attitudes toward implementation, and for framework validation. The most frequently used framework was the Consolidated Framework for Implementation Research. Conclusions Many frameworks have been published to overcome barriers to CDSS implementation and offer guidance towards successful adoption. However, for co-developers, choosing relevant frameworks may be a challenge. A taxonomy of domains addressed by CDSS implementation frameworks is provided, as well as a description of their use, and a catalogue of frameworks listed by the implementation stages they address. Future work should ensure best practices for CDSS design are adequately described, and existing frameworks are well-validated. An emphasis on collaboration between clinician and non-clinician affected parties may help advance the field.
... As highlighted in a recent review [4], randomised controlled trials on AI in clinical practice are scarce, while the small sample size and single-centre design hinder the model's generalizability. Recently proposed validation platforms [36] aim to use real-world data from clinical data warehouses for external AI validation and method comparison/ benchmarking. ...
Article
Full-text available
Good practices in artificial intelligence (AI) model validation are key for achieving trustworthy AI. Within the cancer imaging domain, attracting the attention of clinical and technical AI enthusiasts, this work discusses current gaps in AI validation strategies, examining existing practices that are common or variable across technical groups (TGs) and clinical groups (CGs). The work is based on a set of structured questions encompassing several AI validation topics, addressed to professionals working in AI for medical imaging. A total of 49 responses were obtained and analysed to identify trends and patterns. While TGs valued transparency and traceability the most, CGs pointed out the importance of explainability. Among the topics where TGs may benefit from further exposure are stability and robustness checks, and mitigation of fairness issues. On the other hand, CGs seemed more reluctant towards synthetic data for validation and would benefit from exposure to cross-validation techniques, or segmentation metrics. Topics emerging from the open questions were utility, capability, adoption and trustworthiness. These findings on current trends in AI validation strategies may guide the creation of guidelines necessary for training the next generation of professionals working with AI in healthcare and contribute to bridging any technical-clinical gap in AI validation. Relevance statement This study recognised current gaps in understanding and applying AI validation strategies in cancer imaging and helped promote trust and adoption for interdisciplinary teams of technical and clinical researchers. Key Points Clinical and technical researchers emphasise interpretability, external validation with diverse data, and bias awareness in AI validation for cancer imaging. In cancer imaging AI research, clinical researchers prioritise explainability, while technical researchers focus on transparency and traceability, and see potential in synthetic datasets. Researchers advocate for greater homogenisation of AI validation practices in cancer imaging. Graphical Abstract
... During development, AI models are trained and validated on historical and existing data, like the other retrospective clinical studies, the data itself may be produced within certain biases or patterns. Prospective validation is to ensure models generalize well on unseen, real-world data, which may have new patterns, variations, and shifts that were not captured in the training data [63]. Third, in this review, none of the included studies reported the composition of ethnic groups. ...
Article
Full-text available
Background In recent years, the integration of artificial intelligence (AI) techniques into medical imaging has shown great potential to transform the diagnostic process. This review aims to provide a comprehensive overview of current state-of-the-art applications for AI in abdominal and pelvic ultrasound imaging. Methods We searched the PubMed, FDA, and ClinicalTrials.gov databases for applications of AI in abdominal and pelvic ultrasound imaging. Results A total of 128 titles were identified from the database search and were eligible for screening. After screening, 57 manuscripts were included in the final review. The main anatomical applications included multi-organ detection (n = 16, 28%), gynecology (n = 15, 26%), hepatobiliary system (n = 13, 23%), and musculoskeletal (n = 8, 14%). The main methodological applications included deep learning (n = 37, 65%), machine learning (n = 13, 23%), natural language processing (n = 5, 9%), and robots (n = 2, 4%). The majority of the studies were single-center (n = 43, 75%) and retrospective (n = 56, 98%). We identified 17 FDA approved AI ultrasound devices, with only a few being specifically used for abdominal/pelvic imaging (infertility monitoring and follicle development). Conclusion The application of AI in abdominal/pelvic ultrasound shows promising early results for disease diagnosis, monitoring, and report refinement. However, the risk of bias remains high because very few of these applications have been prospectively validated (in multi-center studies) or have received FDA clearance.
... Model performance was evaluated using the standard metrics: accuracy, the area under the receiver operating characteristic curve (AUC) evaluation metric [49], F1 score, precision, and recall. Values of AUC can range from 0.5 (no predictive ability) to 1 (perfect predictive ability). ...
Article
Full-text available
Background Ageing is one of the most important challenges in our society. Evaluating how one is ageing is important in many aspects, from giving personalized recommendations to providing insight for long-term care eligibility. Machine learning can be utilized for that purpose, however, user reservations towards “black-box” predictions call for increased transparency and explainability of results. This study aimed to explore the potential of developing a machine learning-based healthy ageing scale that provides explainable results that could be trusted and understood by informal carers. Methods In this study, we used data from 696 older adults collected via personal field interviews as part of independent research. Explanatory factor analysis was used to find candidate healthy ageing aspects. For visualization of key aspects, a web annotation application was developed. Key aspects were selected by gerontologists who later used web annotation applications to evaluate healthy ageing for each older adult on a Likert scale. Logistic Regression, Decision Tree Classifier, Random Forest, KNN, SVM and XGBoost were used for multi-classification machine learning. AUC OvO, AUC OvR, F1, Precision and Recall were used for evaluation. Finally, SHAP was applied to best model predictions to make them explainable. Results The experimental results show that human annotations of healthy ageing could be modelled using machine learning where among several algorithms XGBoost showed superior performance. The use of XGBoost resulted in 0.92 macro-averaged AuC OvO and 0.76 macro-averaged F1. SHAP was applied to generate local explanations for predictions and shows how each feature is influencing the prediction. Conclusion The resulting explainable predictions make a step toward practical scale implementation into decision support systems. The development of such a decision support system that would incorporate an explainable model could reduce user reluctance towards the utilization of AI in healthcare and provide explainable and trusted insights to informal carers or healthcare providers as a basis to shape tangible actions for improving ageing. Furthermore, the cooperation with gerontology specialists throughout the process also indicates expert knowledge as integrated into the model.
Article
Full-text available
This study aims to explore the use of AI in industrial renewable energy systems, majoring in efficiency, storage, and prognosis of maintenance techniques. The research analyses various renewable power technologies – solar, wind, hydroelectric, and bioenergy- and evaluates the impact of AI on increasing the effectiveness, stability, and profitability of the technology. Machine learning is used to predict energy production, control and predict energy storage and other real-time control systems, and implement predictive maintenance models to minimize equipment failure. Best studies show that augmenting solar and wind energy with AI increased the predictive capability to 95% and overall energy efficiency by 7%. Furthermore, new charging/discharging plans provided by the optimization process based on the AI technology increased energy density by 12% and cut the energy cost by 15%. Preventive maintenance models produced through artificial intelligence solutions decreased rates of unwanted time loss by 20%, while biomass power systems achieved higher fuel efficiency rates of 9%. The study also shows that through AI improvement of energy management, storage, and system reliability, operational costs have been cut by 18%. Application to the industry stakeholders, policy makers, and researchers are pushing for AI's ability to enhance efficient electricity generation using sustainable resources. Finally, limitations are discussed in the light of the study: first, regarding the accessibility of different operational datasets and second, the need for sophisticated AI models to enhance energy systems in various geographical and industrial conditions.
Article
Background Prevention of suicide is a global health priority. Approximately 800,000 individuals die by suicide yearly, and for every suicide death, there are another 20 estimated suicide attempts. Large language models (LLMs) hold the potential to enhance scalable, accessible, and affordable digital services for suicide prevention and self-harm interventions. However, their use also raises clinical and ethical questions that require careful consideration. Objective This scoping review aims to identify emergent trends in LLM applications in the field of suicide prevention and self-harm research. In addition, it summarizes key clinical and ethical considerations relevant to this nascent area of research. Methods Searches were conducted in 4 databases (PsycINFO, Embase, PubMed, and IEEE Xplore) in February 2024. Eligible studies described the application of LLMs for suicide or self-harm prevention, detection, or management. English-language peer-reviewed articles and conference proceedings were included, without date restrictions. Narrative synthesis was used to synthesize study characteristics, objectives, models, data sources, proposed clinical applications, and ethical considerations. This review adhered to the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) standards. Results Of the 533 studies identified, 36 (6.8%) met the inclusion criteria. An additional 7 studies were identified through citation chaining, resulting in 43 studies for review. The studies showed a bifurcation of publication fields, with varying publication norms between computer science and mental health. While most of the studies (33/43, 77%) focused on identifying suicide risk, newer applications leveraging generative functions (eg, support, education, and training) are emerging. Social media was the most common source of LLM training data. Bidirectional Encoder Representations from Transformers (BERT) was the predominant model used, although generative pretrained transformers (GPTs) featured prominently in generative applications. Clinical LLM applications were reported in 60% (26/43) of the studies, often for suicide risk detection or as clinical assistance tools. Ethical considerations were reported in 33% (14/43) of the studies, with privacy, confidentiality, and consent strongly represented. Conclusions This evolving research area, bridging computer science and mental health, demands a multidisciplinary approach. While open access models and datasets will likely shape the field of suicide prevention, documenting their limitations and potential biases is crucial. High-quality training data are essential for refining these models and mitigating unwanted biases. Policies that address ethical concerns—particularly those related to privacy and security when using social media data—are imperative. Limitations include high variability across disciplines in how LLMs and study methodology are reported. The emergence of generative artificial intelligence signals a shift in approach, particularly in applications related to care, support, and education, such as improved crisis care and gatekeeper training methods, clinician copilot models, and improved educational practices. Ongoing human oversight—through human-in-the-loop testing or expert external validation—is essential for responsible development and use. Trial Registration OSF Registries osf.io/nckq7; https://osf.io/nckq7
Article
Full-text available
Increased scrutiny of artificial intelligence (AI) applications in healthcare highlights the need for real-world evaluations for effectiveness and unintended consequences. The complexity of healthcare, compounded by the user- and context-dependent nature of AI applications, calls for a multifaceted approach beyond traditional in silico evaluation of AI. We propose an interdisciplinary, phased research framework for evaluation of AI implementations in healthcare. We draw analogies to and highlight differences from the clinical trial phases for drugs and medical devices, and we present study design and methodological guidance for each stage.
Article
Full-text available
Machine learning, artificial intelligence, and other modern statistical methods are providing new opportunities to operationalise previously untapped and rapidly growing sources of data for patient benefit. Despite much promising research currently being undertaken, particularly in imaging, the literature as a whole lacks transparency, clear reporting to facilitate replicability, exploration for potential ethical concerns, and clear demonstrations of effectiveness. Among the many reasons why these problems exist, one of the most important (for which we provide a preliminary solution here) is the current lack of best practice guidance specific to machine learning and artificial intelligence. However, we believe that interdisciplinary groups pursuing research and impact projects involving machine learning and artificial intelligence for health would benefit from explicitly addressing a series of questions concerning transparency, reproducibility, ethics, and effectiveness (TREE). The 20 critical questions proposed here provide a framework for research groups to inform the design, conduct, and reporting; for editors and peer reviewers to evaluate contributions to the literature; and for patients, clinicians and policy makers to critically appraise where new findings may deliver patient benefit. © © Author(s) (or their employer(s)) 2019. Re-use permitted under CC BY-NC. No commercial re-use. See rights and permissions. Published by BMJ.
Article
Full-text available
Importance: Sophisticated evidence-based information resources can filter medical evidence from the literature, integrate it into electronic health records, and generate recommendations tailored to individual patients. Objective: To assess the effectiveness of a computerized clinical decision support system (CDSS) that preappraises evidence and provides health professionals with actionable, patient-specific recommendations at the point of care. Design, setting, and participants: Open-label, parallel-group, randomized clinical trial among internal medicine wards of a large Italian general hospital. All analyses in this randomized clinical trial followed the intent-to-treat principle. Between November 1, 2015, and December 31, 2016, patients were randomly assigned to the intervention group, in which CDSS-generated reminders were displayed to physicians, or to the control group, in which reminders were generated but not shown. Data were analyzed between February 1 and July 31, 2018. Interventions: Evidence-Based Medicine Electronic Decision Support (EBMEDS), a commercial CDSS covering a wide array of health conditions across specialties, was integrated into the hospital electronic health records to generate patient-specific recommendations. Main outcomes and measures: The primary outcome was the resolution rate, the rate at which medical problems identified and alerted by the CDSS were addressed by a change in practice. Secondary outcomes included the length of hospital stay and in-hospital all-cause mortality. Results: In this randomized clinical trial, 20 563 patients were admitted to the hospital. Of these, 6480 (31.5%) were admitted to the internal medicine wards (study population) and randomized (3242 to CDSS and 3238 to control). The mean (SD) age of patients was 70.5 (17.3) years, and 54.5% were men. In total, 28 394 reminders were generated throughout the course of the trial (median, 3 reminders per patient per hospital stay; interquartile range [IQR], 1-6). These messages led to a change in practice in approximately 4 of 100 patients. The resolution rate was 38.0% (95% CI, 37.2%-38.8%) in the intervention group and 33.7% (95% CI, 32.9%-34.4%) in the control group, corresponding to an odds ratio of 1.21 (95% CI, 1.11-1.32; P < .001). The length of hospital stay did not differ between the groups, with a median time of 8 days (IQR, 5-13 days) for the intervention group and a median time of 8 days (IQR, 5-14 days) for the control group (P = .36). In-hospital all-cause mortality also did not differ between groups (odds ratio, 0.95; 95% CI, 0.77-1.17; P = .59). Alert fatigue did not differ between early and late study periods. Conclusions and relevance: An international commercial CDSS intervention marginally influenced routine practice in a general hospital, although the change did not statistically significantly affect patient outcomes. Trial registration: ClinicalTrials.gov identifier: NCT02577198.
Article
Full-text available
Background: Artificial intelligence (AI) research in healthcare is accelerating rapidly, with potential applications being demonstrated across various domains of medicine. However, there are currently limited examples of such techniques being successfully deployed into clinical practice. This article explores the main challenges and limitations of AI in healthcare, and considers the steps required to translate these potentially transformative technologies from research to clinical practice. Main body: Key challenges for the translation of AI systems in healthcare include those intrinsic to the science of machine learning, logistical difficulties in implementation, and consideration of the barriers to adoption as well as of the necessary sociocultural or pathway changes. Robust peer-reviewed clinical evaluation as part of randomised controlled trials should be viewed as the gold standard for evidence generation, but conducting these in practice may not always be appropriate or feasible. Performance metrics should aim to capture real clinical applicability and be understandable to intended users. Regulation that balances the pace of innovation with the potential for harm, alongside thoughtful post-market surveillance, is required to ensure that patients are not exposed to dangerous interventions nor deprived of access to beneficial innovations. Mechanisms to enable direct comparisons of AI systems must be developed, including the use of independent, local and representative test sets. Developers of AI algorithms must be vigilant to potential dangers, including dataset shift, accidental fitting of confounders, unintended discriminatory bias, the challenges of generalisation to new populations, and the unintended negative consequences of new algorithms on health outcomes. Conclusion: The safe and timely translation of AI research into clinically validated and appropriately regulated systems that can benefit everyone is challenging. Robust clinical evaluation, using metrics that are intuitive to clinicians and ideally go beyond measures of technical accuracy to include quality of care and patient outcomes, is essential. Further work is required (1) to identify themes of algorithmic bias and unfairness while developing mitigations to address these, (2) to reduce brittleness and improve generalisability, and (3) to develop methods for improved interpretability of machine learning predictions. If these goals can be achieved, the benefits for patients are likely to be transformational.
Article
Full-text available
Introduction: Artificial intelligence (AI) technologies continue to attract interest from a broad range of disciplines in recent years, including health. The increase in computer hardware and software applications in medicine, as well as digitization of health-related data together fuel progress in the development and use of AI in medicine. This progress provides new opportunities and challenges, as well as directions for the future of AI in health. Objective: The goals of this survey are to review the current state of AI in health, along with opportunities, challenges, and practical implications. This review highlights recent developments over the past five years and directions for the future. Methods: Publications over the past five years reporting the use of AI in health in clinical and biomedical informatics journals, as well as computer science conferences, were selected according to Google Scholar citations. Publications were then categorized into five different classes, according to the type of data analyzed. Results: The major data types identified were multi-omics, clinical, behavioral, environmental and pharmaceutical research and development (R&D) data. The current state of AI related to each data type is described, followed by associated challenges and practical implications that have emerged over the last several years. Opportunities and future directions based on these advances are discussed. Conclusion: Technologies have enabled the development of AI-assisted approaches to healthcare. However, there remain challenges. Work is currently underway to address multi-modal data integration, balancing quantitative algorithm performance and qualitative model interpretability, protection of model security, federated learning, and model bias.
Article
Full-text available
The complexity and rise of data in healthcare means that artificial intelligence (AI) will increasingly be applied within the field. Several types of AI are already being employed by payers and providers of care, and life sciences companies. The key categories of applications involve diagnosis and treatment recommendations, patient engagement and adherence, and administrative activities. Although there are many instances in which AI can perform healthcare tasks as well or better than humans, implementation factors will prevent large-scale automation of healthcare professional jobs for a considerable period. Ethical issues in the application of AI to healthcare are also discussed.
Article
Here we present the MI-CLAIM checklist, a tool intended to improve transparent reporting of AI algorithms in medicine.
Article
Advanced triple negative breast cancer (TNBC) is an aggressive disease (high probability of visceral metastasis) with poor outcome. Triple negative breast cancer is characterized by lack of expression of estrogen receptor (ER), progesterone receptor (PR) and human epidermal growth factor receptor‐2 (HER2), high histologic grade, and high mitotic rate. Chemotherapy remains the primary systemic treatment, with international guidelines supporting the use of single‐agent taxanes (with or without bevacizumab) or anthracyclines as first‐line therapy, with a median overall survival of approximately 18 months or less. Given the suboptimal outcomes with chemotherapy, new targeted therapies for advanced TNBC are urgently needed. This review summarizes the current status of treatment, and future challenges of using new treatment options for advanced TNBC, such as poly‐adenosine‐diphosphate‐ribose‐polymerase inhibitors (olaparib and talazoparib) and immune checkpoint inhibitors (eg atezolizumab) as monotherapy or in combination with chemotherapy.