Background Artificial intelligence (AI) has the potential to transform our healthcare systems significantly. New AI technologies based on machine learning approaches should play a key role in clinical decision-making in the future. However, their implementation in health care settings remains limited, mostly due to a lack of robust validation procedures. There is a need to develop reliable assessment frameworks for the clinical validation of AI. We present here an approach for assessing AI for predicting treatment response in triple-negative breast cancer (TNBC), using real-world data and molecular -omics data from clinical data warehouses and biobanks. Methods The European “ITFoC (Information Technology for the Future Of Cancer)” consortium designed a framework for the clinical validation of AI technologies for predicting treatment response in oncology. Results This framework is based on seven key steps specifying: (1) the intended use of AI, (2) the target population, (3) the timing of AI evaluation, (4) the datasets used for evaluation, (5) the procedures used for ensuring data safety (including data quality, privacy and security), (6) the metrics used for measuring performance, and (7) the procedures used to ensure that the AI is explainable. This framework forms the basis of a validation platform that we are building for the “ITFoC Challenge”. This community-wide competition will make it possible to assess and compare AI algorithms for predicting the response to TNBC treatments with external real-world datasets. Conclusions The predictive performance and safety of AI technologies must be assessed in a robust, unbiased and transparent manner before their implementation in healthcare settings. We believe that the consideration of the ITFoC consortium will contribute to the safe transfer and implementation of AI in clinical settings, in the context of precision oncology and personalized care.
Tsopraetal. BMC Med Inform Decis Mak (2021) 21:274
A framework forvalidating AI inprecision
medicine: considerations fromtheEuropean
ITFoC consortium
Rosy Tsopra1,2,3,10* , Xose Fernandez4, Claudio Luchinat5, Lilia Alberghina6, Hans Lehrach7,8, Marco Vanoni6,
Felix Dreher8, O.Ugur Sezerman9, Marc Cuggia10, Marie de Tayrac11, Edvins Miklasevics12, Lucian Mihai Itu13,
Marius Geanta14, Lesley Ogilvie7,8, Florence Godey15,16, Cristian Nicolae Boldisor13, Boris Campillo‑Gimenez17,
Cosmina Cioroboiu14, Costin Florian Ciusdel13, Simona Coman13, Oliver Hijano Cubelos4, Alina Itu13,
Bodo Lange8, Matthieu Le Gallo15,16, Alexandra Lespagnol18, Giancarlo Mauri19, H.Okan Soykam20,
Bastien Rance1,2,3, Paola Turano5, Leonardo Tenori5, Alessia Vignoli5, Christoph Wierling8, Nora Benhabiles21 and
Anita Burgun1,2,3,22
Artificial intelligence (AI) has the potential to transform
our healthcare systems considerably and will play a key
role in clinical decision-making in the future [1]. AI has
been in the spotlight since the 1980’s, when the first
“expert systems” simulating the clinical reasoning for
clinical decisions emerged [2]. With the huge increase in
medical data over the last few decades, new approaches
have been developed (principally machine learning (ML),
including neural networks). ML techniques trained on
clinical datasets [2] have already proved useful for diag-
nostic applications [35] and risk prediction [6].
Despite the enthusiasm surrounding AI, their use in
healthcare settings remains limited. AI technologies
require rigorous assessment before they can be used
in clinical practice [7]. For example, the first AI-based
device to receive market authorization from the FDA
was assessed with a large prospective comparative clini-
cal trial including 900 patients from multiple sites [4].
AI technologies must satisfy stringent regulations for
approval as medical devices, because (1) the decision
support provided is optimized and personalized con-
tinuously in real time, according to the phenotype of the
patient [7]; (2) the performance of AI depends strongly
on the training datasets used [8], resulting in a large risk
of AI performing less well in real practice [911] or on
another group of patients or institutions [9]. It is, there-
fore, essential to assess the performance and safety of AI
before its introduction into routine clinical use.
Robust evaluations are required for AI to be trans-
ferred to clinical settings, but, in practice, only a few
such systems have been validated with external datasets
[12, 13]. A recent literature review reported that most
studies assessing AI did not include the recommended
design features for the robust validation of AI [9]. ere
is, therefore, a need to develop frameworks for the robust
validation of the performance and safety of AI with reli-
able external datasets [14, 15].
Finding, accessing and re-using reliable datasets is a
real challenge in medicine (contrasting with other FAIR
data collections [16]). However, with the development
of clinical data warehouses within hospitals, it should
become easier to obtain access to “real datasets”. e
benefit of using real-world data for research purposes
[17], and, particularly, for generating complementary evi-
dence during AI life cycles, has been highlighted by the
European Medicines Agency [18]. Real-world data from
clinical data warehouses may, therefore, constitute a
valuable source of reliable external datasets for validating
AI before its implementation in healthcare settings.
Guidelines on the regulation of AI technologies include
high-level directions, but not specific guidance on the
practical steps in AI evaluation [19]. Here, we propose
a framework for assessing the clinical performance and
safety of AI in the context of precision oncology. More
precisely, the objective is to use real-world data collected
from clinical data warehouses and biobanks to assess AI
technologies for predicting the response to anti-cancer
drugs. We developed this framework as part of the Euro-
pean Flag-Era project ‘ITFoC (Information Technology
for the Future of Cancer)’ [20], to validate AI algorithms
with -omics and clinical data for the prediction of treat-
ment response in triple-negative breast cancer (TNBC).
is framework could help AI developers and institutions
to design clinically trustworthy decision support systems,
and to assess them with a robust methodology.
Breast cancer is the most common cancer in women
worldwide [21, 22]. e most aggressive type is triple-
negative breast cancer (TNBC), characterized by a lack
of estrogen receptor, progesterone receptor and human
epidermal growth factor expression, together with a high
histologic grade and a high rate of mitosis [23]. TNBC
accounts for 10–20% of all breast cancers, and has a very
poor prognosis, with chemotherapy the main therapeutic
option [23, 24]. New targeted and personalized therapies
are, therefore, urgently required [23].
In recent decades, cancer treatments has followed a
“one-size-fits-all” approach based on a limited set of
clinical criteria. Recent advances, rendering sequenc-
ing techniques more widely available, are providing new
opportunities for precision oncology, the personaliza-
tion of treatment based on a combination of clinical and
molecular data, and improvements in drug efficacy, with
fewer side effects.
In this context, many AI models have been developed,
based on the detailed molecular characterization of indi-
vidual tumors and patients. ey model the effects and
adverse effects of drugs in the context of TNBC treat-
ment [25, 26]. However, these AI models often lack clini-
cal validation, and require further external evaluation.
e ITFoC (Information Technology for the Future of
Cancer) consortium [20], a multidisciplinary group from
six European countries, has proposed a new approach
to the unbiased validation of these AI models. is
Keywords: Artificial intelligence, Precision medicine, Personalized medicine, Computerized decision support systems,
Cancer, Oncology
approach involves evaluating the performance and safety
of these AI models through robust clinical evaluation
with reliable and external real-world datasets, before
their implementation in healthcare settings. e ITFoC
consortium has designed a framework to meet this goal.
is framework is based on seven key steps specifying
(Fig.1): (1) the intended use of AI, (2) the target popula-
tion, (3) the timing of AI evaluation, (4) the datasets used
for evaluation, (5) the procedures used for ensuring data
safety (including data quality, privacy and security), (6)
the metrics used for measuring performance, and (7) the
procedures used to ensure that the AI is explainable.
e framework designed by the “ITFoC consortium”
follows seven principles that we consider essential for
the assessment of AI technologies. is framework was
developed to support a community-based programming
contest to be held during “Pink October”. is “ITFoC
challenge”, will open a platform enabling various teams
(academic, research, and MedTech organizations) to test
their AI-based approaches with TNBC datasets provided
by our partners for the purpose of this competition.
We describe here the framework and the paral-
lel actions planned for the setting up of the “ITFoC
Step 1: Specify theintended use ofAI
e first step in AI assessment is accurately defining its
intended use (for medical purposes) [7], together with
its input (i.e. the data required to run the AI), and out-
put (i.e. the results provided by AI) parameters.
Once the intended use of AI is clearly stated, it is
important to be sure that:
AI is used only to address questions that are relevant
and meaningful for the medical community. Indeed,
AI may be irrelevant if it is used in a correct, but not
useful manner in healthcare settings [27]. It is, there-
fore, important to define clearly the benefits of AI for
a particular clinical scenario.
• AI complies with ethical, legal and social standards
[27, 28]. As stated by the High-Level Expert Group
on AI established by the European Commission [29],
AI should (1) comply with all applicable laws and
regulations, (2) adhere to ethical principles and val-
ues, (3) not disadvantage people from particular soci-
odemographic backgrounds or suffering from certain
conditions, (4) not increase discrimination based on
ethnicity or sex.
Fig. 1 The seven key steps needed for the clinical validation of AI technologies
Planned actions
In the “ITFoC challenge”, we aim to assess AI with the fol-
lowing intended use: predicting the response of TNBC
patients to treatment, regardless of their origin or ethnic
background. More precisely, AI should be able to predict,
at the time of diagnosis, whether particular patients are
likely to respond to standard treatment, so that prob-
able non-responders can be offered alternative treatment
e expected clinical impact is an improvement in
survival rates for TNBC patients, particularly those not
responding to standard treatment.
Step 2: Clearly specify thetarget population
e second step in AI assessment is accurately defining
the target population. AI must be evaluated on inde-
pendent datasets similar to the target population of the
AI technology. e population is defined during the
development phase, by specifying patient and disease
characteristics, in a similar manner to the definition
of eligibility criteria in conventional clinical trials. e
sets of patients selected for the assessment should be
representative of the target population, and consecu-
tive inclusion or random selection should be used for
patient recruitment, at multiple sites, to limit the risk
of spectrum bias (i.e. the risk of the patients selected
not reflecting the target population) [15], and to ensure
that the results can be generalized.
Contrary to the AI validation and training stages,
which require large datasets, AI evaluation does not
necessarily require ‘big data’ [15]. As in randomized
clinical trials, the study sample should be determined
according to the study hypothesis, expected effect (e.g.
superiority, non-inferiority) and degree of importance
(differences important or unimportant) [15].
Planned actions
In the “ITFoC challenge”, the target population is “women
who have been diagnosed with TNBC”. We need to assess
AI performance in terms of treatment response. We
must therefore select patients who have already received
first-line treatment (making it possible to compare the
predicted and observed responses in a retrospective mul-
ticentre cohort of TNBC patients).
Step 3: Specify thetiming ofAI evaluation
e third step in AI assessment is clearly defining the
timing of the evaluation. As in drug development, vari-
ous phases can be distinguished for AI evaluation (Fig.2):
e “fine-tuning” phase is an essential part of AI
development. It is equivalent to the “preclinical
phase” in drug development, when drugs are tested
in a laboratory setting. Here, AI is evaluated inter-
nally in three steps: training, internal validation, and
testing. e training step involves training the algo-
rithm on a subset of so-called “training” data. e
internal validation involves fine-tuning the algorithm
or selecting the most optimized parameters. e test
step corresponds to the final internal assessment of
the performance of the algorithm.
e “clinical validation” phase follows the internal
validation and testing of AI. It is equivalent to phases
I and II of clinical trials, in which drug efficacy and
safety are assessed in a limited number of patients.
Here, the performance and safety of AI are assessed
with external data. e goal is to check that AI will
not result in lost opportunities for patients through
the generation of false-positive or false-negative pre-
dictions (i.e. for patients predicted to respond to a
treatment who do not in reality, and vice-versa).
Finally, patient outcomes are assessed after clini-
cal validation with external datasets. is phase is
equivalent to the phase III of clinical trials, in which
new drugs are compared to standard treatment in
randomized controlled trials (RCT). Here, AI is
implemented in healthcare settings, and its effect on
patient outcomes and the efficiency of the healthcare
system is assessed with real patients, via a RCT.
Planned actions
In the “ITFoC challenge”, we will focus on the “clini-
cal validation” phase. Akin to early-phase drug trials,
the goal will be to determine whether the AI developed
is sufficiently accurate and safe for transfer into clinical
practice for further assessment in RCTs.
Step 4: Specify thedatasets used forAI evaluation
e fourth step in AI assessment is the selection of reli-
able and representative datasets:
Publicly accessible datasets [1] are available through
public repositories (e.g. ArrayExpress [30], GEO [31])
or are released by research and/or medical institu-
tions (e.g. TCGA, or ICGC collections). However,
most are more suitable for bioinformatics than for
clinical informatics [1].
• Patient databases store retrospective or prospective
datasets generated by clinical trials or routine care
(real-world data).
‘Clinical trial’ datasets are collected in the con-
trolled environment of a specific clinical trial
Fig. 2 Evaluation of AI‑timing
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 6 of 14
Tsopraetal. BMC Med Inform Decis Mak (2021) 21:274
(Table 1), from a restricted population that may
not be representative of the general population.
e data collection process is time-consuming
and costly, but the resulting data should be homo-
geneous, highly reliable and should have a well-
structured format. However, such datasets are not
generally made publicly available, for the follow-
ing reasons [32]: the potential loss of competitive
advantage for the organization funding the study;
the possibility of invalidating the results published
through secondary analyses; the costs associated
with data sharing and, finally, due to ethical and
scientific considerations. Moreover, data collec-
tion is usually limited to predefined sets of vari-
ables, and it may, therefore, be difficult to re-use
secondarily these data to address questions not
included in the initial protocol [32].
Real-world datasets are usually stored in clinical
data warehouses (Table1). ese datasets are col-
lected throughout patient care and have various
clinical sources (structured and unstructured clin-
ical records, laboratory, pharmacy, and radiology
results, etc.) [17, 33]. e collection of these data
is less time-consuming and costly than that for
clinical trial datasets. However, their exploitation
requires careful data quality management, because
they are highly variable and were initially collected
for clinical purposes rather than for research [34
Split-sample validation involves randomly splitting
datasets into separate parts, which are then used for
both the development and internal evaluation of AI [12,
15]. is method is relevant only during the develop-
ment phase, and cannot be used to validate the gen-
eralizability of AI. Indeed, there is a risk of overfitting
bias (i.e. the AI fits too exactly to the training data), and
spectrum bias (i.e. the internal dataset is not represent-
ative of the population on which the AI will be used).
Validation on completely independent external data-
sets is required to overcome these limitations and for
validation of the generalizability of AI [15]. Geographic
sampling (i.e. using datasets collected by independent
investigators from different sites) could considerably
limit both biases, and improve the estimation of AI
generalizability in healthcare settings [15].
Planned actions
In the “ITFoC challenge”, we are working with retrospec-
tive real-world datasets collected from the clinical data
warehouses and biobanks of multiple hospitals, ensuring
that the TNBC population is broadly represented.
The inclusion criteria for datasets are:
A follow-up period of at least three years, to ensure
the standardized evaluation of treatment response
High-quality data extracted from a clinical data ware-
house or from a dedicated cancer database
Biological samples must be available in biobanks for
additional -omics analyses, if required.
Patients must have signed a consent form for the
reuse of their data and the reuse of their samples for
research purposes
e objective is not to acquire thousands of patient
datasets of variable quality, but to collect a representative
set of high-quality patient data.
Step 5: Specify theprocedures used toensure data safety
e fifth step in AI assessment is ensuring data safety,
including data quality, privacy and security, during the
evaluation phase.
Table 1 Clinical trial versus Real‑world datasets for AI evaluation
Clinical trial datasets Real-world datasets
Setting Experimental Real world
Population Representativeness Selective sample Large sample
Type Homogeneous Heterogeneous
Size +/− ++++
Time period for recruitment and follow‑
up Limited Long
Data Type Clinical +/ ‑omics Clinical +/ ‑omics
Collected by Dedicated specialist professionals Various healthcare professionals
Quality +++ +/
Need for data management +/− +++
Need for anonymization + +
Data quality
Standardization is strongly recommended, to guarantee
the quality, sharing, portability and reusability of data
for AI evaluation [38]. Standardization is defined as the
representation of heterogeneous data with consensual
specifications [38]. It includes specifications for both data
fields (i.e. variables) and their value sets (i.e. codes) [38].
Standardization is highly dependent on the type of data-
sets involved.
Clinical data Clinical data are highly complex, for sev-
eral reasons: (1) they come from different sources (e.g.
electronic health records, reimbursement claims data),
(2) they have various formats (e.g. free text, numbers,
images), and representations (e.g. structured, semi-struc-
tured, unstructured); (3) the level of granularity is highly
variable, ranging from general to fine-grained concepts;
(4) datasets are not complete (e.g. missing data); (5) data-
set content varies within and between institutions.
Various common data models can be used to standard-
ize clinical datasets. ese models include the CDISC
(Clinical Data Interchange Standards Consortium) model
for “clinical trial datasets”, which can be used to ensure
information system interoperability between healthcare
and clinical research, and the OMOP (Observational
Medical Outcomes Partnership) common data model for
real-world datasets. e data values must also be harmo-
nized by the use of terminologies ensuring interoperabil-
ity between AI systems, such as the ICD 10 (International
Classification of Diseases) for the standardization of
medical diagnoses, LOINC (Logical Observation Iden-
tifiers Names and Codes) for biological tests, Med-
DRA (Medical Dictionary for Regulatory Activities) for
adverse events, and so on. Most standard terminologies
are integrated into the UMLS (Unified Medical Language
System) metathesaurus, which can be used as a global
thesaurus in the biomedical domain.
-Omics data -Omics data are complex: (1) they are
generated by different techniques, with different bioin-
formatic tools; (2) they may be based on different types
of NGS (next-generation sequencing) data, such as
WGS (whole-genome sequencing), WES (whole-exome
sequencing), and RNA-sequencing, or on data from prot-
eomics and metabolomics platforms; (3) their integration
and interpretation remain challenging, due to their size
and complexity, and the possibility of experimental and
technical errors during sample preparation, sequencing
and data analysis [39].
-Omics data can be standardized at any stage from data
generation to data interpretation. For example, MIAME
(minimum information about a microarray experi-
ment) [40] and MAGE (microarray gene expression data
modeling and exchange standards) have been developed
for microarray experiments [41]. e most widely used
format for variant identification is VCF (variant clinical
format), which includes a number of fields for genomic
coordinates, reference nucleotide, and variant nucleotide,
for example, but also metadata adding meaningful infor-
mation relating to variants: e.g. gene symbol, location,
type, HGVS (human genome variation society) nomen-
clature, predicted protein sequence alterations and
additional resources, such as cross-references to cancer-
specific and general genomic databases and prior in silico
algorithm-based predictions.
Standardization ofclinical and-omics data Standardi-
zation makes it possible to combine data from multiple
institutions. It also ensures the consistency of datasets,
and improves the quality and reliability of clinical and
-omics data. ese aspects are crucial, to maximize the
chances of predicting the real impact of AI on the health-
care process. Indeed, the ultimate performance of AI
depends strongly on the quality of data used for evalua-
tion [12, 13].
Planned actions In the “ITFoC” challenge, we will apply
a range of internationally accepted standards for breast
cancer data, to overcome issues of data heterogeneity
and variability associated with the use of data of different
provenances [34, 35] and to ensure access to high-quality
real-world datasets [38]
Clinical datasets will be standardized with the OMOP
common data model [42] for data structure and the OSI-
RIS model [43] for data content. e OMOP CDM is sup-
ported by the OHDSI consortium (Observational Health
Data Sciences and Informatics), and OSIRIS is supported
by the French National Institute of Cancer. Both stand-
ards include a list of concepts and source values, con-
sidered the minimal dataset necessary for the sharing of
clinical and biological data in oncology. Items and values
are structured and standardized according to interna-
tional medical terminologies, such as ICD 10, LOINC,
SNOMED CT. A standardized TNBC data model based
on these models will be used: items will be added,
removed and/or transformed, and values will be adapted
to TNBC data (e.g. the values of the “biomarker” item are
limited to RO, RP and HER2 receptors, Ki67). e instan-
tiated model contains the dataset specifications provided
to participants in this challenge. e database will be
populated locally through dedicated extract-transform-
load pipelines.
It may not be possible to extract -omics data directly
from clinical data warehouses, because these data are
not widely collected in routine care. If not already pre-
sent in the electronic health record of the patient, -omics
data will be generated from patient samples stored in
biobanks. For the challenge, WES data, RNA-sequencing
data, microRNA expression levels and metabolomic data
will be obtained from primary tumor samples, and from
blood samples as a control. Data quality will be ensured
by using only freshly frozen tumors with a celll content of
more than 30% (as determined by a pathologist). Multi-
level -omics data contain a wealth of potentially relevant
information, including molecular variants (directly or
indirectly) affecting clinically significant pathways. eir
incorporation into the challenge dataset should greatly
increase the predictive power of the AI technologies
Data privacy
e patients’ right to privacy must be respected. Patients
must be informed about the storage and use of their
data, and must have signed a consent form authorizing
the collection and use of their data for research [44, 45].
Within Europe, data privacy is regulated by the General
Data Protection Regulation (GDPR) [45]), which protects
patients against the inappropriate use of their data. Such
regulations ensure that (1) patients can choose whether
or not to consent to the collection of their data, (2)
patients are informed about the storage and use of their
data (principle of transparency), (3) data are stored in
an appropriate manner (principle of integrity), (4) data
are used only for certain well-defined purposes, and (5)
patients have the right to change their minds and to with-
draw consent at any time.
Planned actions In the “ITFoC challenge”, data privacy
will be respected:
Only datasets from patients who have signed a con-
sent form authorizing the reuse of their data and
samples for research will be included in the chal-
e clinical data will be pseudo-anonymized by
state-of-the-art methods (and in accordance with the
GDPR), without altering the scientific content. Any
clinical information that could be used, directly or
indirectly, to identify the individual will be removed
(e.g. dates will be transformed into durations (com-
puted as a number of days)).
Data security
AI evaluation should be hosted and managed on a secure
platform [46], that can ensure that confidentiality, integ-
rity and/or the availability of patient information are not
compromised deliberately or accidentally [44]. Any plat-
form used for AI evaluation should implement the strict-
est control over access, to ensure that data are available
only to authorized parties [44], only for the duration of
the evaluation [44], and that any personal data (including
both data directly linked to a patient, such as surname,
and indirectly linked to the patient, such as diagnosis
date) are removed [47].
Planned actions In the “ITFoC challenge”, data security
will be ensured by using a dedicated ITFoC data space.
Workflows will be created between local clinical data
warehouses and the local ITFoC data space, for standardi-
zation of the datasets with respect to the standard TNBC
model. Each standardized dataset will be transferred to a
secure platform, on which it will be stored (Fig.3).
Participants will assess their AI technologies with the
same datasets hosted on a secure platform, but they will
not be allowed to access datasets directly. Clinical and
-omics data will be inaccessible throughout the duration
of the challenge, and participants will be provided only
with the specifications of the datasets.
Step 6: Specify themetrics used formeasuring AI
e sixth step in AI assessment is defining the metrics
used to evaluate the performance of the AI algorithm.
e intrinsic performance of the AI itself is assessed
during the “fine-tuning” and the “clinical validation”
phases. Discrimination performance is measured in
terms of sensitivity and specificity for binary outputs [15].
By plotting the effects of different levels of sensitivity and
specificity for different thresholds, a ROC (receiver oper-
ating characteristics) curve can be generated [48]. is
ROC curve represents the discrimination performance of
a particular predictive algorithm [15]. e most common
metric used is the AUC (area under the ROC Curve), the
values of which lie between 0 and 1. Algorithms with high
levels of performance have a high sensitivity and specific-
ity, resulting in an AUC close to 1 [15, 48].
Calibration performance is measured for quantitative
outputs, such as probabilities [15]. It is used to determine
whether predicted probabilities agree with the real prob-
abilities [15]. e predicted probabilities are plotted on
the x-axis, and the observed real probabilities are plotted
on the y-axis, to generate a calibration plot [15]. is plot
can be used to estimate the goodness of fit between the
predicted and real probabilities [49]. Bland–Altman plots
can also be used to analyze the agreement between the
predicted and the observed probabilities [50].
A more detailed discussion of the statistical methods
used to measure AI performance is beyond the scope of
this article but can be found elsewhere [49].
Fig. 3 Data workflow for the ITFoC challenge
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 10 of 14
Tsopraetal. BMC Med Inform Decis Mak (2021) 21:274
e clinical performance of AI in real clinical settings is
assessed during the “patient outcome assessment” phase.
AI metrics, such as AUC, are not always understood by
clinicians [51], and do not necessarily reflect clinical effi-
cacy [52]. ere is a need to determine the effect of AI
on patient outcomes in real-life conditions. Ideally, the
effects of AI should be compared to a gold standard [53]
or baseline (i.e. standard procedure) in an RCT using
standard statistical approaches [15].
Planned actions
In the “ITFoC challenge”, we will assess the performance
of AI itself with the binary criterion “predicted response
to treatment” during the clinical validation phase. For
each AI algorithm, various metrics will be reported,
including AUC, confusion matrix, sensitivity, specificity,
positive and negative predictive values.
e evaluation will be carried out by a scientific com-
mittee, independent of the ITFoC organizational com-
mittee. is scientific committee will include members
from various disciplines (e.g. bioinformaticians, medical
doctors, data scientists, statistical and machine-learning
experts) and from various international institutions (aca-
demic, research and hospital institutions).
Step 7: Specify theprocedures toensure AI explainability
e seventh step in the assessment of AI is examining the
underlying algorithm [54, 55]. is step has two expected
benefits. First, it may prevent an inappropriate represen-
tation of the dataset used for training/validation. Sec-
ond, it may reveal the learning of unanticipated artifacts
instead of relevant inputs [54].
e input data must be analyzed first [54]. e type
(structured or unstructured), format (e.g. text, numbers,
images), and specifications (e.g. variables used) of the
data must be assessed. A better comprehension of the
input data should ensure that the data used by the AI are
comprehensive and relevant to clinical practice.
e underlying algorithm should also be analyzed [54].
e code, documented scripts, and the computer envi-
ronment should be evaluated by independent research-
ers. Ideally, independent researchers should even run the
pipeline, check the underlying AI methods and evaluate
the explainability of the outputs [54]. However, AI devel-
opers may be reluctant to share their codes openly, for
scientific or economic reasons. In such cases, alternatives
can be found, such as a trusted neutral third party signing
a confidentiality form, or a virtual computing machine
running the code with new datasets [54], or the provision
of documentations about the AI.
Planned actions
In the “ITFoC challenge”, we aim at explain why some AI
successfully predict treatment response, whereas oth-
ers fail. Each AI developer participating in the challenge
should provide the data specifications used by the AI.
We will encourage the AI developers to share their codes
openly. Alternatively, they could opt for restricted code
sharing with the scientific committee (the scientific com-
mittee will sign a confidentiality agreement).
We describe here the framework designed by the ITFoC
consortium for the assessment of AI technologies for
predicting treatment response in oncology. is frame-
work will be used to construct a validation platform for
the “ITFoC Challenge”, a community-wide competition
for assessing and comparing AI algorithms predicting
the response to treatments in TNBC patients from real-
world datasets.
Use ofreal-world datasets forvalidating AI technologies
e systematic and rigorous validation of AI technologies
is essential before their integration into clinical practice.
Such evaluation is the only way to prevent unintentional
harm, such as misdiagnosis, inappropriate treatment or
adverse effects, potentially decreasing patient survival.
To date, only a few AI-based solutions have actually been
clinically validated [9], mostly exclusively on internal
datasets, with no external validation. RCTs in which AI
technologies are compared to the gold standard (i.e. rou-
tine care delivered by medical experts) are the strongest
and most reliable approach for assessing AI performance
and safety [56]. Such trials provide a more detailed evalu-
ation, including a range of relevant parameters, such as
patient benefits in terms of quality of life, acceptance by
physicians, integration into the clinical workflow, and
economic impact. However, RCTs are costly, both finan-
cially and in terms of time required, and should be pre-
ceded by early-phase studies [4].
Here, we support the idea that when AI technologies
reach a state of sufficient “maturity”, they should undergo
clinical validation with external real-world datasets. is
would make it possible to measure the performance
and safety of AI quickly and reliably in conditions close
to those encountered in real-life. is validation pro-
cess would save both money and time, due to the use of
real-world datasets from clinical data warehouses. At the
end of this early validation step, if the performance of a
specific AI technology falls short of expectations (e.g. if
it fails to predict response to treatment, or is considered
unsafe), then it can be rejected (as in early-phase trials
for drugs), and no further evaluation in RCTs is required.
Tsopraetal. BMC Med Inform Decis Mak (2021) 21:274
datasets, it can be considered a good candidate and
allowed to progress to the next stage in evaluation (i.e. an
RCT). e validation process outlined here (“validation
step with retrospective real-world datasets”) should thus
be an integral part of the entire AI evaluation process,
constituting the decisive step concerning whether or not
to perform a RCT.
Use ofacommunity-wide competition toassess AI
We propose here to organize the “validation step” in the
form of a community-wide competition. Competition-
based approaches are increasingly being seen as relevant
in the medical informatics domain, with participating
teams usually tackling a challenge over a limited time
period, with access to an anonymized dataset for the test-
ing of methods. For example, the i2b2 (Informatics for
Integrating Biology and the Bedside) project includes
a “Natural Language Processing” challenge for assess-
ing methods for understanding clinical narratives [57].
Competition-based approaches have also been developed
in oncology (e.g. the Sage Bionetworks—DREAM Breast
Cancer Prognosis Challenge, designed for developing
computational models that can predict breast cancer
survival [58, 59]; and the Prostate DREAM Challenge,
for identifying prognostic models capable of predicting
survival in patients with metastatic castration-resistant
prostate cancer [46]). e utility of these crowdsourced
challenges for the community has clearly been demon-
strated. ey have multiple advantages: (1) they allow the
development of models that outperform those developed
with traditional research approaches [58, 60], (2) they
encourage collaboration between teams for the improve-
ment of models [60], and (3) they provide more trans-
parent results, because both favorable and unfavorable
results are published [58, 60].
We derived a framework from these competition-
based approaches. Our approach is based on the same
principles as these existing challenges, but focusing on
the combination of real-world data collected from clini-
cal data warehouses (rather than data collected through
RCTs), and -omics data generated by next-generation
sequencing techniques. e results of the “ITFoC chal-
lenge” will provide essential proof-of-principle evidence
for the use of real-world datasets for validating AI tech-
nologies in a competition setting, as an essential precur-
sor to RCTs.
Accelerating AI transfer tohealthcare settings
We propose a framework for the clinical validation of AI
technologies before their transfer to clinical settings and
clear actions in the domain of TNBC treatment. Both the
framework and the planned actions can be generalized
to other questions in oncology, with minor adaptations.
For instance, for diagnosis, other datasets could be con-
sidered (e.g. images, signals). Likewise, we propose here
the use of real world dataset from various healthcare
centres, to guarantee the volume and representativeness
of the dataset. Similarly, when dealing with rare can-
cers, the datasets may come from various centers, and
may even be extended to other sources, such as clinical
research data. Dataset from other sources have already
been successfully used for the assessment of AI in breast
and prostate cancers [46, 58]. Furthermore, the metrics
used to assess AI performance may also differ, depend-
ing on the type of cancer and the intended use of AI (e.g.
for diagnosis, the primary outcome could be compared to
the diagnosis made by an oncologist).
We believe that a platform, as described here, could
help to accelerate AI transfer to healthcare settings in
oncology. AI systems are currently considered to be
medical devices that can only be implemented in health
centers after the demonstration of their safety and effi-
cacy through a large prospective RCT [4]. However, this
is time-consuming and expensive, and there is a risk of
patient outcome studies becoming obsolete by the time
the results become available [15]. e use of a valida-
tion platform has several advantages: (1) several AI tech-
nologies can be assessed in parallel for the same price
(whereas a RCT is usually designed to assess a single
AI technology); (2) the platform can be re-used for fur-
ther AI evaluations; (3) new datasets can easily be added
to the platform; (4) transparency is guaranteed, as the
results are communicated even if unfavorable. For all
these reasons, validation platforms constitute a credible
route towards establishing a rigorous, unbiased, trans-
parent and durable approach to the assessment of AI
Supporting precision medicine
Clinical care decision are traditionally driven by patient
symptoms and disease characteristics. In precision
oncology, the scope is extended to the patient pheno-
type, preclinical symptoms, tumor characteristics and
the complex molecular mechanisms underlying disease
[61]. Recent advances in genetics and sequencing tech-
nologies are now enabling clinicians to include molec-
ular aspects of the disease in their clinical decision
processes, and advances in metabolomics have facili-
tated considerations of the functional activity of can-
cer cells [62, 63]. e use of -omics data in routine care
(e.g. genomic, metabolomic or proteomic data [64]), is
strongly supported by the European Medicines Agency
[18], and could lead to significant improvements in
patient care.
Page 12 of 14
Tsopraetal. BMC Med Inform Decis Mak (2021) 21:274
Here, we provide support for the idea that -omics anal-
ysis should be part of the clinical decision process. e
“ITFoC Challenge” aims to demonstrate the benefits of
integrating clinical data warehouses and biobanks into
the clinical care process, in accordance with the findings
of previous studies [65, 66]. By combining clinical and
-omics data, AI tools may facilitate the delivery of treat-
ments that are personalized according to the characteris-
tics of the patients and their tumors, thereby increasing
of the chances of survival and decreasing side effects. By
designing the “ITFoC Challenge”, we aim to encourage
the development of AI based on clinical and -omics data
for the prediction of treatment response in cancer, and
the personalization of cancer treatment.
We hereby propose a framework for assessing AI tech-
nologies based on real-world data, before their use in
healthcare settings. is framework includes seven
key steps specifying: (1) the intended use of AI, (2)
the target population, (3) the timing for AI evaluation,
(4) the datasets selected for evaluation, (5) the proce-
dures used to ensure data safety, (6) the metrics used to
measure performance, and (7) the procedures used to
ensure that the AI is explainable. e proposed frame-
work has the potential to accelerate the transfer of AI
into clinical settings, and to boost the development of
AI solutions using clinical and -omics data to predict
treatment responses and to personalize treatment in
oncology. Here, we applied this framework to the estab-
lishment of a community-wide competition in the con-
text of predicting treatment responses in TNBC.
AI: Artificial intelligence; CDISC: Clinical data interchange standards con‑
sortium; GDPR: General data protection regulation; HGVS: Human genome
variation society; ICD: International classification of diseases; LOINC: Logical
observation identifiers names and codes; MedDRA: Medical dictionary for
regulatory activities; OMOP: Observational medical outcomes partner‑
ship; MIAME: Minimum information about a microarray experiment; MAGE:
MicroArray gene expression; ML: Machine learning; NGS: Next‑generation
sequencing; OHDSI: Observational health data sciences and informatics; RCT
: Randomized controlled trials; ROC: Receiver operating characteristics; TNBC:
Triple‑negative breast cancer; UMLS: Unified medical language system; VCF:
Variant clinical format; WGS: Whole‑genome sequencing; WES: Whole‑exome
Authors’ contributions
Design: RT, XF, CL, LA, HL, MV, FD, OUS, MC, MDT, EM, LMI, MG, LO, FG, CNB,
Writing original manuscript: RT, AB. Agreement with all aspects of the work: RT,
SC, OHC, AI, BL, MLG; AL, GM, HOS, BR, PT, LT, AV, CW, NB, AB. All authors read
and approved the final manuscript.
This work was supported by the ITFoC project (Information Technology for the
Future of Cancer) – FLAG‑ERA support.
Availability of data and materials
Data sharing is not applicable to this article as no datasets were generated or
analyzed during this study.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
Hans Lehrach is a member of the board of Alacris Theranostics GmbH. Felix
Dreher is an employee of Alacris Theranostics GmbH. Lesley Ogilvie is an
employee of Alacris Theranostics GmbH. Bodo Lange is the CEO of Alacris
Theranostics GmbH. Christoph Wierling is an employee of Alacris Theranostics
GmbH. The other authors have no conflicts of interest to declare.
Author details
1 Centre de Recherche Des Cordeliers, Inserm, Université de Paris, Sorbonne
Université, 75006 Paris, France. 2 Inria, HeKA, Inria Paris, France. 3 Department
of Medical Informatics, Hôpital Européen Georges‑Pompidou, AP‑HP, Paris,
France. 4 Institut Curie, 25 Rue d’Ulm, 75005 Paris, France. 5 Centro R isonanze
Magnetiche ‑ CERM/CIRMMP and Department of Chemistry, University of Flor‑
ence, 50019 Sesto Fiorentino (Florence), Italy. 6 Department of Biotechnology
and Biosciences, University of Milano Bicocca and ISBE‑Italy/SYSBIO ‑ Candi‑
date National Node of Italy for ISBE, Research Infrastructure for Systems Biol‑
ogy Europe, Milan, Italy. 7 Max Planck Institute for Molecular Genetics, Berlin,
Germany. 8 Alacris Theranostics GmbH, Berlin, Germany. 9 School of Medicine
Biostatistics and Medical Informatics Dept., Acibadem University, Istanbul,
Turkey. 10 Univ Rennes, CHU Rennes, Inserm, LTSI ‑ UMR 1099, 35000 Rennes,
France. 11 Univ Rennes, Department of Molecular Genetics and Genomics, CHU
Rennes, IGDR‑UMR6290, CNRS, 35000 Rennes, France. 12 RSU Institute of Oncol‑
ogy, Dzirciema str. 16, Riga 1010, Latvia. 13 Transilvania University of Brasov,
Brasov, Romania. 14 Centre for Innovation in Medicine, Bucharest, Romania.
15 INSERM U1242 « Chemistry, Oncogenesis Stress Signaling », Université de
Rennes, 35042 CEDEX, Rennes, France. 16 Centre de Lutte Contre Le Cancer
Eugène Marquis, CRB Santé (BRIF Number: BB‑0033‑00056), 35042 CEDEX,
Rennes, France. 17 Univ Rennes, CLCC Eugène Marquis, INSERM, LTSI ‑ UMR
1099, 35000 Rennes, France. 18 Department of Molecular Genetics and Genom‑
ics, CHU Rennes, 35000 Rennes, France. 19 Department of Informatics, Systems
and Communication, University of Milano Bicocca and ISBE‑Italy/SYSBIO ‑
Candidate National Node of Italy for ISBE, Research Infrastructure for Systems
Biology Europe, Milan, Italy. 20 EPIGENETICS Inc. BUDOTEK, Istanbul, Turkey.
21 Direction de La Recherche Fondamentale (DRF), CEA, Université Paris‑Saclay,
91191 Gif‑sur‑Yvette, France. 22 PaRis Artificial Intelligence Research InstitutE
(Prairie), Paris, France.
Received: 18 June 2020 Accepted: 22 September 2021
Here we present the MI-CLAIM checklist, a tool intended to improve transparent reporting of AI algorithms in medicine.
Advanced triple negative breast cancer (TNBC) is an aggressive disease (high probability of visceral metastasis) with poor outcome. Triple negative breast cancer is characterized by lack of expression of estrogen receptor (ER), progesterone receptor (PR) and human epidermal growth factor receptor‐2 (HER2), high histologic grade, and high mitotic rate. Chemotherapy remains the primary systemic treatment, with international guidelines supporting the use of single‐agent taxanes (with or without bevacizumab) or anthracyclines as first‐line therapy, with a median overall survival of approximately 18 months or less. Given the suboptimal outcomes with chemotherapy, new targeted therapies for advanced TNBC are urgently needed. This review summarizes the current status of treatment, and future challenges of using new treatment options for advanced TNBC, such as poly‐adenosine‐diphosphate‐ribose‐polymerase inhibitors (olaparib and talazoparib) and immune checkpoint inhibitors (eg atezolizumab) as monotherapy or in combination with chemotherapy.