ArticlePDF AvailableLiterature Review

Sample-Size Determination Methodologies for Machine Learning in Medical Imaging Research: A Systematic Review


Abstract and Figures

Purpose: The required training sample size for a particular machine learning (ML) model applied to medical imaging data is often unknown. The purpose of this study was to provide a descriptive review of current sample-size determination methodologies in ML applied to medical imaging and to propose recommendations for future work in the field. Methods: We conducted a systematic literature search of articles using Medline and Embase with keywords including "machine learning," "image," and "sample size." The search included articles published between 1946 and 2018. Data regarding the ML task, sample size, and train-test pipeline were collected. Results: A total of 167 articles were identified, of which 22 were included for qualitative analysis. There were only 4 studies that discussed sample-size determination methodologies, and 18 that tested the effect of sample size on model performance as part of an exploratory analysis. The observed methods could be categorized as pre hoc model-based approaches, which relied on features of the algorithm, or post hoc curve-fitting approaches requiring empirical testing to model and extrapolate algorithm performance as a function of sample size. Between studies, we observed great variability in performance testing procedures used for curve-fitting, model assessment methods, and reporting of confidence in sample sizes. Conclusions: Our study highlights the scarcity of research in training set size determination methodologies applied to ML in medical imaging, emphasizes the need to standardize current reporting practices, and guides future work in development and streamlining of pre hoc and post hoc sample size approaches.
Content may be subject to copyright.
Critically Appraised Topic /
Evaluation critique
Sample-Size Determination Methodologies for Machine Learning in
Medical Imaging Research: A Systematic Review
Indranil Balki, HBSc
, Afsaneh Amirabadi, PhD
, Jacob Levman, PhD
, Anne L. Martel, PhD
Ziga Emersic, MEng
, Blaz Meden, MEng
, Angel Garcia-Pedrero, PhD
Saul C. Ramirez, MEng
, Dehan Kong, PhD
, Alan R. Moody, MBBS, FRCP, FRCR
Pascal N. Tyrrell, PhD
Department of Medical Imaging, University of Toronto, Toronto, Ontario, Canada
Department of Diagnostic Imaging, Hospital for Sick Children, Toronto, Ontario, Canada
Department of Mathematics, Statistics and Computer Science, St Francis Xavier University, Antigonish, Nova Scotia, Canada
Boston Children’s Hospital, Harvard Medical School, Boston, Massachusetts, USA
Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada
Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia
Department of Botany, Universidad de Valladolid, Castile and Leon, Spain
Computing School, Instituto Tecnol
ogico de Costa Rica, Cartago, Costa Rica
Department of Statistical Sciences, University of Toronto, Toronto, Ontario, Canada
Purpose: The required training sample size for a particular machine learning (ML) model applied to medical imaging data is often unknown.
The purpose of this study was to provide a descriptive review of current sample-size determination methodologies in ML applied to medical
imaging and to propose recommendations for future work in the field.
Methods: We conducted a systematic literature search of articles using Medline and Embase with keywords including ‘‘machine learning,’
‘‘image,’ and ‘‘sample size.’ The search included articles published between 1946 and 2018. Data regarding the ML task, sample size, and
train-test pipeline were collected.
Results: A total of 167 articles were identified, of which 22 were included for qualitative analysis. There were only 4 studies that discussed sample-
size determination methodologies, and 18 that tested the effect of sample size on model performance as part of an exploratory analysis. The observed
methods could be categorized as pre hoc model-based approaches, which relied on features of the algorithm, or post hoc curve-fitting approaches
requiring empirical testing to model and extrapolate algorithm performance as a function of sample size. Between studies, we observed great
variability in performance testing procedures used for curve-fitting, model assessment methods, and reporting of confidence in sample sizes.
Conclusions: Our study highlights the scarcity of research in training set size determination methodologies applied to ML in medical
imaging, emphasizes the need to standardize current reporting practices, and guides future work in development and streamlining of pre hoc
and post hoc sample size approaches.
Objectif : On ignore souvent la taille de l’
echantillon d’apprentissage n
ecessaire pour un mod
ele d’apprentissage artificiel en particulier,
a des donn
ees d’imagerie m
edicale. L’objectif de cette
etait d’
etablir une synth
ese descriptive des m
ethodologies actuelles
eterminer les tailles d’
echantillon en apprentissage artificiel appliqu
a l’imagerie m
edicale et de proposer des recommandations
pour la r
ealisation des futurs travaux dans ce domaine.
ethodes : Nous avons effectu
e une recherche syst
ematique de documentation scientifique des articles disponibles dans les bases de donn
Medline et Embase, en utilisant notamment les mots cl
es suivants : «apprentissage artificiel »,«image »et «taille d’
echantillon ». Cette
recherche portait sur des articles publi
es entre 1946 et 2018. Les donn
ees associ
ees aux activit
es d’apprentissage artificiel, aux tailles
echantillon et aux syst
emes de test-apprentissage ont
e recueillies.
* Address for correspondence: Pascal N. Tyrrell, PhD, University of Tor-
onto, Department of Medical Imaging, 263 McCaul Street, 4th Floor Room
409, Toronto, Ontario M5T 1W7, Canada.
E-mail address: (P. N. Tyrrell).
0846-5371/$ - see front matter Ó2019 Canadian Association of Radiologists. All rights reserved.
Canadian Association of Radiologists Journal xx (2019) 1e10
esultats : Au total, 167 articles ont
e identifi
es, dont 22 ont
es pour faire l’objet d’une analyse qualitative. Seulement 4
etudes abordaient les m
ethodologies de d
etermination de taille d’
echantillon et 18
evaluaient l’impact de la taille d’
echantillon sur l’efficacit
du mod
ele, au sein d’une analyse exploratoire. Les m
ethodes observ
ees pouvaient ^
etre class
ees en deux cat
egories: les m
ethodes bas
ees sur
un mod
ele a priori (pre hoc) fond
ees sur les propri
es de l’algorithme et les m
ethodes d’ajustement de courbe a posteriori (post hoc)
ecessitant des analyses empiriques du mod
ele et l’extrapolation des performances de l’algorithme en tant que fonction de la taille
echantillon. Nous avons observ
e une forte variabilit
e entre les
etudes au niveau de l’efficacit
e des proc
edures d’analyse utilis
ees pour les
ethodes d’
evaluation des mod
eles d’ajustement de courbe et de la confiance relative
a la taille de l’
Conclusions : Notre
etude met en
evidence la raret
e des
etudes comportant des m
ethodologies de d
etermination de taille d’
echantillon pour
l’apprentissage artificiel appliqu
a l’imagerie m
edicale. Elle souligne le besoin de standardiser les pratiques actuelles de communication de
ees et pr
econise les travaux ult
ealiser au niveau de la mise au point et de la simplification des d
emarches a priori et a posteriori
de d
etermination de la taille d’
Ó2019 Canadian Association of Radiologists. All rights reserved.
Key Words: Sample size; Machine learning; Medical imaging; Radiology
Establishing principled methods for assessing the value of
machine learning (ML) algorithms used in medical imaging
research remains a substantial challenge. In order to improve
the accuracy of ML methods applied to the field of diag-
nostic radiology, there is a trend toward extracting more and
more features; in the limit, every pixel in an image can be
regarded as a feature. Under such circumstances, to avoid the
‘curse of dimensionality’’ (where a classifier fails to
generalize as the number of features increases), enormous
training sets become necessary [1e4]. The sample size of the
training set has long been recognized as the single biggest
influence on the design and performance of pattern recog-
nition systems [5]. Finite training set size is potentially an
important source of model bias, and both the training and test
set size contribute to variance in a model’s performance
assessment [6e8].
In classical statistics, sample-size determination method-
ologies (SSDMs) estimate the optimum number of partici-
pants to arrive at scientifically valid results, often balancing
an acceptable degree of precision with availability of re-
sources [9]. Analogously, for ML in medical imaging, we
define an SSDM as a procedure to estimate the number of
images required for an ML algorithm to reach a particular
threshold of performance, or a sufficiently low generaliz-
ability error. While sample size issues may affect many ML
disciplines, this is a particularly poignant challenge in
medical imaging, where access to large quantities of high-
quality data is elusive [2,10]. In a recent influential review
article [11], the importance of obtaining adequately sized,
unbiased validation and training sets was identified as a
crucial factor in assessment and development of robust ML
models. Recent studies in the field of neuroimaging have
concluded that the generalizability of ML algorithms for
classification tasks are heavily subject to the influence of the
size and quality of the training sample set on which they are
trained [12,13]. A recent white paper by the Canadian As-
sociation of Radiologists also described the effect of data
size on model performance and highlighted the importance
of developing sound, clinically validated models for medical
applications [14]. Methods of calculating the required
sample size for a given model applied to medical imaging
data remain unknown.
We hypothesized that there would be a paucity of
research relating to SSDMs applied to ML applications in
medical imaging and great scope for establishing standards
of practice. Exploration of SSDMs will allow researchers to
plan cost-effective experiments and assess confidence in the
generalizability of a model trained on a limited number of
samples. The purpose of our study was to investigate cur-
rent sample size determination practices in the literature.
The objectives of our study were to (1) systematically
identify studies pertaining to SSDMs for ML in medical
imaging, (2) provide a descriptive review of the SSDMs
employed, and (3) propose recommendations for future
work in the field.
Materials and Methods
We conducted a systematic review of the literature to
identify research involving SSDMs in medical imaging–
specific ML. Relevant studies were identified by searching
English-language scientific literature in OVID Medline
(1946-November 21, 2018) and OVID Embase (1947-
November 21, 2018) with keywords including but not
limited to ‘‘machine learning,’ ‘‘imaging,’’ and ‘‘training
sample size.’’ Duplicates were removed with the use of
reference management software. A detailed search strategy
may be found in the Supplemental Tables S3 and S4. Manual
searching of relevant ML journals and a review of references
from collected studies was also conducted. Articles from
both peer-reviewed and preprint journals (Arxiv, BioRxiv)
were considered for inclusion.
Studies were included if they met the following 2 inclu-
sion criteria: (1) the study (a) proposed or tested the efficacy
of an SSDM used in ML or (b) tested the effect of training
sample size on model performance and (2) the ML model
was shown to be applicable to the field of medical imaging.
Criterion 1b facilitated the additional inclusion of studies
that would allow us to explore reporting of model perfor-
mance under various sample size conditions, with the hope
2I. Balki et al. / Canadian Association of Radiologists Journal xx (2019) 1e10
that this may inspire future development of SSDMs. Along
with traditional radiologic modalities (magnetic resonance
imaging, computed tomography, x-ray, ultrasound), studies
using images from the broader domain of medical imaging,
such as histopathological imaging and laser tomography
imaging, were also considered for inclusion. This enabled us
to explore the handling of sample-size considerations in
domains that in many ways resemble the complexity, prob-
lem choice, rigorous validation standards, and algorithmic
analysis of classical radiologic classification, regression, and
segmentation tasks.
Articles were independently assessed for inclusion by
authors I.B. and A.A., both of whom had prior experience in
ML and evaluating scientific and medical literature. Any
discrepancies were resolved by consulting study author P.T.
Studies were excluded if they did not meet the inclusion
criteria or if the full text could not be obtained.
The standard and well-established Preferred Reporting
Items for Systematic Reviews and Meta-Analysis protocol
was followed throughout the study screening and evalua-
tion process [15]. At the outcome level, SSDMs were
described on their empirical validity (as demonstrated by
measures in confidence of sample size estimates) and
potential for application in medical imaging domains. The
primary outcome was to evaluate methodologies for sam-
ple size determination in various medical ML studies in
order to formulate proposals for future research in the
field. The secondary outcome was to describe and evaluate
how machine model performance was evaluated at
different sample sizes. In order to perform qualitative
analysis, we extracted information from each included
study pertaining to the medical imaging task, modality of
images, ML model used, train-test pipeline, number and
type of features, figure of merit for model evaluation,
range of sample size conditions studied, range of observed
model performance and SSDMs employed, amongst other
A total of 167 articles were identified from Medline
(n ¼81), Embase (n ¼81) and hand-searching (n ¼5)
(Figure 1). After the removal of duplicate articles, screening
of title and abstract, and full-text appraisal based on the in-
clusion criteria, 22 articles were selected for qualitative
analysis. Of the 22 articles included, only 4 studies discussed
an SSDM (Table 1), while 18 studies evaluated an ML model
under different sample size conditions (Supplemental Table
S1). A distribution of the publication years of included ar-
ticles is shown in Figure 2, demonstrating that half of
included articles were published in the past 4 years (2016
through 2019). The ML tasks studied included binary clas-
sification (n ¼11), segmentation (n ¼5), regression (n ¼3),
and multiclass classification (n ¼3) problems. The algo-
rithms assessed were: some form of neural network (n ¼12),
some form of linear regression (n ¼10), support vector
machine (n ¼7), linear discriminant analysis (n ¼3), de-
cision tree (n ¼3), Adaboost (n ¼2), K-nearest neighbour
Figure 1. Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) study design flowchart. PRISMA flow diagram demonstrating study
3Sample size in medical image machine learning / Canadian Association of Radiologists Journal xx (2019) 1e10
Table 1
Sample-size determination methodologies in machine learning applied to medical imaging (inclusion criteria 1a and 2)
Author, year [ref]
imaging task
Modality of
model Train-test pipeline Features
FOM for
Range of
sample size
range of
performance SSDM
Sample size
for given threshold
et al, 2003 [34]
of healthy vs
visual defect
scanning laser
1) LDA
2) Classification
3) Neural
1) 10-fold CV
2) 10-fold CV
3) 6:3:1
1130 training eyes
(1020 healthy,
110 glaucomatous)
1) 13
2) 3
3) 6
visual field
95% CI for
Sen., Sp.,
1) 1017 total
images only
2) 1017 total
images only
3) 678 total
images only
1) 98-99 (Sen.),
4-16 (Sp.)
2) 99.3-100
(Sen.), 0-7.8 (Sp.)
3) 98.1-99.4
(Sen.), 6.9-19.9
Sen. Sp.
1) Baum and
Haussler [35]
2) Haykin [36]
3) Vapnik and
Chervonenkis [50]
1) 1580 images for
generalization error
of 1/8 and neural
network with 12
units, 30 weights.
2) 240 images
required for
generalization error
of 1/8 and neural
network with 12
units, 30 weights
3) ‘‘Inappropriate
large sample size’’
et al, 2017 [37]
of healthy vs
AMD retinae
CNN Separate test set of
10,000 images,
11 x subsampling
within the training/
validation data of
w90,000 images
(192 124)
OCA (range) 1300-35,000
or 4%-100%
of training
73% (69-80)
to 86% (84-87)
approach (graph
OCA vs N
10,000 images/class
for 82% OCA
Cho et
al, 2016 [38]
of 6
body parts
Axial computed
CNN Separate test set of
6000 images, 10
x balanced
subsampling within
the training data
Pixels OCA (sd),
per class
8% (w25) to
97% (w1)
OCA reported
Learning curve-
fitting approach
(graph OCA vs
4092 images/class
for 99.5% OCA
et al, 2000 [30]
of malignant
vs benign
Simulation of
classes by
based on
matrix of 249
LDA 500 samples/class
generated 50x; 20 x
cross validation
within each group
of generated
100 SGLD
AUC for hold
-out and
0.7 to 0.82
0.98 to 0.875
Linear Curve-
fitting Approach
(graph AUC vs 1/
(based on mean/
covariance matrix),
fitted value (by
regression) within
AMD ¼age-related macular degeneration; AUC ¼area under receiving operator characteristic; AUC(N)¼AUC at infinite sample size conditions; CI, confidence interval; CNN ¼convolutional neural network;
CV ¼cross validation; FOM ¼figure of merit; LDA ¼linear discriminant analysis; N
¼number of training samples; OCA ¼overall classification accuracy; ROI ¼region of interest; sd ¼standard deviation;
Sen. ¼sensitivity; SGLD ¼spatial grey level dependence; Sp. ¼specificity.
Only relevant model assessment methods and explicitly stated features used prior to feature selection are presented.
4I. Balki et al. / Canadian Association of Radiologists Journal xx (2019) 1e10
(n ¼1), and logistic regression (n ¼1). The feature-set size
ranged from 3 to 2 10
, while the sample sizes studied
ranged from 2 to 90,000/class (Supplemental Table S1). In
general, increasing sample size beyond 1000/class demon-
strated progressively smaller improvements associated with
increased sample size (Supplemental Table S1). We have
also included an appendix in the Supplementary Materials
that contains the definition of key concepts and terminology
used in this article (Supplemental Appendix S1).
Methods of Assessing Effect of Sample Size on Model
There were 3 general methods used to assess the effect of
sample size on model performance, which we term ‘‘per-
formance-testing procedures’’ (PTPs), illustrated in Figure 3.
The NxSubsampling (Figure 3A) scheme was employed in 8
studies [16e23]. Typically, a random subsample consisting
of Y number of images was drawn from a large image pool
and used to train an ML model. The model was then eval-
uated on an independent test set. This process was repeated N
times for each subsample size Ywith replacement in order to
allow for construction of a mean and confidence interval for
the observed performance. In a Balanced NxSubsampling
scheme (3 out of 8), each subsample contained an equal
number of images from each class [18e20]. The NxRepeated
Cross-Validation scheme (Figure 3B) and variations
employed by 6 studies involved randomly splitting the
dataset into a desired training set size Yand the (remaining)
test set size on which the model was evaluated [24e29].This
random splitting was repeated N times for a desired training
set size Y, and the mean and standard deviation of perfor-
mance for each sample size could be obtained. Sahiner et al
[30] and Way et al [6] used variations of these PTPs while
simulating samples. In a No Repetition scheme (Figure 3C),
images were subsampled only once for each desired training
set size Yand evaluated on an independent test set [31e33].
The SSDMs identified were categorized as either pre
hoc (model-based) or post hoc (curve-fitting) approaches.
Model-based approaches provided sample size estimates
based on characteristics of the algorithm, allowed classi-
fication error upon generalization, and acceptable confi-
dence in that error. These methods were based on the
assumption that training and test samples were chosen
from the same distribution. Hitzl et al [34] explored the use
of several model-based approaches, one of which was
postulated by Baum and Hausler [35] for use with single
hidden-layer feedforward neural networks with k units and
d weights. This method predicted that for some classifi-
cation error ε(0 <ε<1/8), a network trained on mimages
with the fraction 1 eε/2 of the images correctly classified,
wouldapproachaclassicationaccuracyof1eεon an
unseen test set, with the condition that m O(d/εlog
ε)). For the parameters in the study, ε¼1/8, k ¼12,
d¼30, this method suggested 1580 samples were
required. A second method proposed by Haykin [36]
suggested valid generalization if the condition
m¼O((dþk)/ε) was satisfied. This method is similar to
Widrow’s rule of thumb, and in practice, m z(d/ε)[35].
This method suggested that 240 training samples were
Curve-fitting SSDMs relied on empirically evaluating
model performance at select sample sizes (using PTPs), with
the goal of extrapolating algorithm performance as a function
of training set size. Two major curve-fitting approaches were
identified. The learning curve-fitting approach relied on
modelling the relationship between training data size and
classification accuracy using an inverse power law function
to model the ML algorithm learning curve (Figure 4A).
Rokem et al [37] and Cho et al [38] employed this approach
in their binary and multi-class classification tasks respec-
tively. Rokem et al [37] estimated the requirement of 10,000
images/class to obtain an overall classification accuracy
(OCA) of 82%, while Cho et al [38] predicted 4092 images/
class to acquire an OCA of 99.5%. The observed differences
can be attributed to the particular ML model and task, sub-
sampling procedures, and variations in learning curve
Sahiner et al [30] used a linear curve-fitting approach
pioneered by Fukunaga and Hayes [39]. Empirically ob-
tained area under receiving operator characteristic (AUC)
metrics (through PTPs) were plotted against their respective
¼number of training images) values, and
performance at higher sample sizes was extrapolated by
linear regression as N
tended to infinity (Figure 4B). This
linear relationship, where higher order 1/N
terms can be
neglected, has been a subject of research by numerous
theoretical texts and simulation studies [40e42]. The theo-
retical AUC at infinite sample size conditions, or AUC(N),
for the particular computer-aided diagnosis experiment was
calculated based on the mean/covariance matrix of 249
available mammograms. Sahiner et al [30] followed by
conducting a simulation of training data features and classes
based on the observed matrix. The linear curve-fitting
approach using the training data was then used to validate
the accuracy of the original AUC(N) metric.
Figure 2. Publication years of included articles. Histogram displaying pub-
lication years of 22 articles included in analysis.
5Sample size in medical image machine learning / Canadian Association of Radiologists Journal xx (2019) 1e10
Evaluating SSDMs
A systematic literature search revealed the scarcity of SSDM
research in ML applied to medical imaging. To our knowledge,
this is the first study to systematically assess this topic. Model
performance at different sample sizes was assessed using
NxSubsampling, NxRepeated Cross Validation, and No Repeti-
tion schemes. The SSDMs identified relied on model-based
considerations or on generating predictive functions of model
strength based on empirical testing at select sample sizes.
Figure 3. Performance testing procedures. (A) An illustration of the Nx Subsampling Scheme. (B) An illustration of the Nx Cross-Validation Scheme. (C) An
illustration of the No Repetition Scheme.
6I. Balki et al. / Canadian Association of Radiologists Journal xx (2019) 1e10
Pre hoc approaches
Model-based SSDMs provide pre hoc sample size esti-
mates for particular ML algorithms. Baum and Haussler’s
method [35] was shown to be accurate within 5% of the
actual observed OCA in Hitzl et al’s study [34]. Recently,
domain-specific model-based SSDMs have been developed
for DNA microarray classification tasks. These approaches
are based on parametric probability models and rely on
factors including standardized fold change, class prevalence,
and the number of genes and features on the arrays [43].
However, these methods were not robust in high dimen-
sionality, differential gene expression, and problems with
great intraclass variability [44], which are characteristics of
medical imaging data.
Figure 4. Post hoc approaches. (A) Learning curve-fitting approach. Overall classification accuracy of a machine learning algorithm can be modelled against
training data set size, typically resulting in a saturating inverse power law curve. The sample size x required to reach 95% classification accuracy is shown to be
extrapolated with dotted lines. (B) Linear curve-fitting approach. The area under receiving operator characteristic of a machine learning algorithm can be
modelled against the inverse of the training data set size, typically resulting in a negative linear relationship. The sample size x required to reach an area under
receiving operator characteristic of 0.95 is shown to be extrapolated with dotted lines.
Figure 3. (Continued)
7Sample size in medical image machine learning / Canadian Association of Radiologists Journal xx (2019) 1e10
Post hoc approaches
A class of pseudo-SSDMs was identified, which we
categorized as curve-fitting approaches providing post hoc
sample size estimates. An ML problem where the underlying
data structure is more complex and difficult to model (eg, a
classification task with subtle interclass variance) is analo-
gous to low effect size in classical statistics; therefore, a
greater number of samples is often required to adequately
train the ML model [13]. Since performance of models can
vary greatly depending on various parameters (eg, algorithm,
feature size, anatomy being imaged), an empirical approach
has the advantage of accurately modelling performance for a
specific task, without making distributional assumptions. In
their simulation experiment, Sahiner et al [30] predicted
(regressed) AUC(N) using the linear curve-fitting approach,
which was within 0.02 of the theoretical AUC(N) calcu-
lated from the mean/covariance matrix of 249 mammograms.
Meanwhile, Cho et al [38] validated the learning curve
approach to be accurate within 0.75% of the observed OCA
at 1000 samples/class. Learning curve-fitting approaches
have recently been used in other fields, such as DNA
microarray, text, waveform, and biospectroscopy classifica-
tion, and differences between actual and predicted classifi-
cation errors range from 1% to 7% and were lower for higher
sample sizes [43e47].
Recommendations for Future Work
The small number of studies included, restriction to
English-language literature, and risk of publication bias
(with lesser-impact studies going unpublished) remain limi-
tations to our systematic review. However, our study high-
lights great scope in standardizing current computational and
reporting practices. Efforts to create ML train-test pipelines
[48] and international documentations on standardization
still lack sample size considerations. Position statements by
radiologic associations have highlighted that adequate sam-
ple size considerations are essential to ensure robust, unbi-
ased clinical model development [11,14]. In our case, 18 of
22 studies provided a measure of variance for model per-
formance at different sample sizes. However, variance of the
curve parameters and estimates of required sample size were
unreported in all studies employing curve-fitting procedures.
Hitzl et al [34] reported that only 1 of 18 studies provided
confidence intervals for sensitivity and specificity rates for
binary classification tasks similar to their own. Measures of
variance provide insight into a model’s generalizability and
can be used to perform statistical tests of significance be-
tween performance at different sample sizes to identify a
point of diminishing returns [33].
There is scope for future work in streamlining and
comparing various PTPs. There is a need to develop and
test the efficacy of more model-based approaches (by
comparing predicted versus observed performance) and to
consider development of hybrid approach where model-
based approaches can be leveraged with limited empir-
ical testing. These methods should aim to optimize
different performance metrics not merely limited to OCA.
Studying the response of different SSDMs and PTPs to
modality, feature set size and algorithm choice remains an
unexplored area of research. Moreover, many real-life
datasets are highly class-imbalanced, and the effect of
this skew in the data structure in the context of SSDMs
and PTPs is yet to be elucidated. In cases where classes
are not balanced, the smaller group may indeed be the
constraining factor in acquiring enough samples in order to
train a reliable ML model. As such, in unbalanced data-
sets, additional caution is warranted when handling
SSDMs, and researchers should potentially focus on the
sample size needs associated with the group with the
smallest number of samples. Finally, there is a need for
consensus in domain-specific thresholds for reporting
model performance so studies can be easily compared. All
SSDMs categorized in our study were applied to (binary
or multiclass) classification tasks under the umbrella of
supervised learning. Exploring sample size requirements in
the growing field of unsupervised learning in the context
of classification, regression and segmentation tasks remains
an issue to be investigated. Furthermore, while our study
did not review sample size and error rate for all ML al-
gorithms applied to imaging, the establishment of a re-
pository containing standard models applied to various ML
tasks in the field of medical imaging would be helpful in
guiding the sample size requirements of future projects in
the field.
Limitations to SSDMs
In Hitzl et al’s [34] experiment alone, model-based ap-
proaches had sample size estimates differing by greater than
1000 samples to achieve similar accuracy. Model-based
methods may lack the ability to capture intricacies of a
specific ML task and must be tuned to bias-variance trade-
offs [29] and function generating mechanisms [33] of
different algorithms. Curve-fitting approaches can be criti-
cized for their need to perform empirical testing. We
observed great variability in sample sizes and replicates used
in PTPs, functions used in fitting, saturating parameters,
balancing schemes employed, and thresholds chosen for
optimal performance. The NxSubsampling and NxCross
Validation schemes are limited by high variance at small
subsample sizes [38] and low test sample sizes [24],
respectively. Attempts to overcome these challenges
[19,45,49] remain key to widespread implementation.
Finally, validity of the linear curve-fitting approach is highly
dependent on the algorithm, number of features, and their
distributions [30,42]. While simulation is a powerful tool to
investigate these issues, its reliance on distributional as-
sumptions, inability to match subtle interclass and high
intraclass variability of real datasets, and computational
challenges in high dimensionality may render it unrepre-
sentative. Based on our systematic review, researchers should
attempt to estimate sample size requirements for their study
using both pre hoc and post hoc methods, keeping in mind
8I. Balki et al. / Canadian Association of Radiologists Journal xx (2019) 1e10
their unique advantages and disadvantages. Supplemental
Table S2 provides a brief summary comparison between
the two approaches based on the literature reviewed; how-
ever, there is yet much scope for establishing definitive
sample size estimation practice guidelines.
A systematic review of the literature enabled the identi-
fication and categorization of several procedures used to
evaluate ML model performance at various sample sizes. Pre
hoc model-based and post hoc curve-fitting SSDMs may help
researchers in planning more cost-effective experiments.
This study highlights the scarcity of research in SSDMs in
medical ML, and guides further work in the field.
Supplementary Data
Supplementary data related to this article can be found at
[1] Gorban AN, Tyukin IY. Blessing of dimensionality: mathematical
foundations of the statistical physics of data. Philos Trans A Math Phys
Eng Sci 2018;376:20170237.
[2] Ithapul VK, Singh V, Okonkwo O, Johnson SC. Randomized denoising
autoencoders for smaller and efficient imaging based AD clinical trials.
Med Image Comput Comput Assist Interv 2014;17:470e8.
[3] Rasmussen PM, Hansen LK, Madsen KH, Churchill NW, Strother SC.
Model sparsity and brain pattern interpretation of classification models
in neuroimaging. Pattern Recogn 2012;45:2085e100.
[4] Obermeyer Z, Emanuel E. Predicting the future dbig data, machine
learning, and clinical medicine. N Engl J Med 2016;375:1216e9.
[5] Raudys S, Jain A. Small sample size effects in statistical pattern
recognition: recommendations for practitioners. IEEE Trans Pattern
Anal Mach Intell 1991;13:252e64.
[6] Way TW, Sahiner B, Hadjiiski LM, Chan HP. Effect of finite sample
size on feature selection and classification: a simulation study. Med
Phys 2010;37:907e20.
[7] Beiden SV, Maloof MA, Wagner RF. A general model for finite-sample
effects in training and testing of competing classifiers. IEEE Trans
Pattern Anal Mach Intell 2003;25:1561e9.
[8] Chan HP, Sahiner B, Wagner RF, Petrick N. Classifier design for
computer-aided diagnosis: effects of finite sample size on the mean
performance of classical and neural network classifiers. Med Phys
[9] Biau DJ, Kern
eis S, Porcher R. Statistics in brief: the importance of
sample size in the planning and interpretation of medical research. Clin
Orthop Relat Res 2008;466:2282e8.
[10] Moody A. Perspective: the big picture. Nature 2013;502:S95.
[11] Park SH, Han K. Methodologic guide for evaluating clinical perfor-
mance and effect of artificial intelligence technology for medical
diagnosis and prediction. Radiology 2018;286:800e9.
[12] Pellegrini E, Ballerini L, Hernandez M del CV, et al. Machine learning
of neuroimaging for assisted diagnosis of cognitive impairment and
dementia: a systematic review. Alzheimers Dement 2018;10:519e35.
[13] Schnack HG, Kahn RS. Detecting neuroimaging biomarkers for psy-
chiatric disorders: sample size matters. Front Psychiatry 2016;7:50.
[14] Tang A, Tam R, Cadrin-Chenevert A, et al. Canadian Association of
Radiologists white paper on artificial intelligence in radiology. Can
Assoc Radiol J 2018;69:120e35.
[15] Moher D, Liberati A, Tetzlaff J, Altman DG. Preferred reporting items
for systematic reviews and meta-analyses: the PRISMA statement.
PLoS Med 2009;6:e1000097.
[16] Cui Z, Gong G. The effect of machine learning regression algorithms
and sample size on individualized behavioral prediction with functional
connectivity features. Neuroimage 2018;178:622e37.
[17] Cohen O, Zhu B, Rosen MS. MR fingerprinting Deep RecOnstruction
NEtwork (DRONE). Magn Reson Med 2018;80:885e94.
[18] Zheng B, Chang Y-HH, Good WF, et al. Adequacy testing of training
set sample sizes in the development of a computer-assisted diagnosis
scheme. Acad Radiol 1997;4:497e502.
[19] Abdulkadir A, Mortamet BB, Vemuri P, Jack Jr CR, Krueger G,
oppel S, Alzheimer’s Disease Neuroimaging Initiative. Effects of
hardware heterogeneity on the performance of SVM Alzheimer’s dis-
ease classifier. Neuroimage 2011;58:785e92.
[20] Samala RK, Chan HP, Hadjiiski L, Helvie MA, Richter CD, Cha KH.
Breast cancer diagnosis in digital breast tomosynthesis: effects of
training sample size on multi-stage transfer learning using deep neural
nets. IEEE Trans Med Imaging 2018;38:686e96.
[21] Dunnmon JA, Yi D, Langlotz CP, Re C, Rubin DL, Lungren MP.
Assessment of convolutional neural networks for automated classifi-
cation of chest radiographs. Radiology 2018;290:537e54.
[22] Looney P, Stevenson GN, Nicolaides KH, et al. Fully automated, real-
time 3D ultrasound segmentation to estimate first trimester placental
volume using deep learning. JCI Insight 2018;3:e120178.
[23] McKinley R, Hung F, Wiest R, Liebeskind DS, Scalzo F. A machine
learning approach to perfusion imaging with dynamic susceptibility
contrast MR. Front Neurol 2018;9:717.
[24] Tourassi GD, Floyd CE. The effect of data sampling on the perfor-
mance evaluation of artificial neural networks in medical diagnosis.
Med Decis Making 1997;17:186e92.
[25] Wang JY, Ngo MM, Hessl D, Hagerman RJ, Rivera SM. Robust ma-
chine learning-based correction on automatic segmentation of the
cerebellum and brainstem. PLoS One 2016;11:e0156123.
[26] Chang H, Borowsky A, Spellman P, Parvin B. Classification of tumor
histology via morphometric context. Proc IEEE Comput Soc Conf
Comput Vis Pattern Recognit 2013;2013:10.
[27] Chang H, Nayak N, Spellman PT, Bahram P. Characterization of
tissue histopathology via predictive sparse decomposition and
spatial pyramid matching. Med Image Comput Comput Assist Interv
[28] Dinh CV, Steenbergen P, Ghobadi G, et al. Multicenter validation of
prostate tumor localization using multiparametric MRI and prior
knowledge. Med Phys 2017;44:949e61.
[29] Juntu J, Sijbers J, De Backer S, Rajan J, Van Dyck D. Machine learning
study of several classifiers trained with texture analysis features to
differentiate benign from malignant soft-tissue tumors in T1-MRI
images. J Magn Reson Imaging 2010;31:680e9.
[30] Sahiner B, Chan H-P, Petrick N, Wagner RF, Hadjiiski L. Feature se-
lection and classifier performance in computer-aided diagnosis: the
effect of finite sample size. Med Phys 2000;27:1509e22.
[31] Zhao G, Liu F, Oler JA, Meyerand ME, Kalin NH, Birn RM. Bayesian
convolutional neural network based MRI brain extraction on nonhuman
primates. Neuroimage 2018;175:32e44.
[32] Morra JH, Tu Z, Apostolova LG, Green AE, Toga AW, Thompson PM.
Comparison of AdaBoost and support vector machines for detecting
Alzheimer’s disease through automated hippocampal segmentation.
IEEE Trans Med Imaging 2010;29:30e43.
[33] Park SC, Sukthankar R, Mummert L, Satyanarayanan M, Zheng B.
Optimization of reference library used in content-based medical image
retrieval scheme. Med Phys 2007;34:4331e9.
[34] Hitzl W, Reitsamer HA, Hornykewycz K, Mistlberger A, Grabner G.
Application of discriminant, classification tree and neural network
analysis to differentiate between potential glaucoma suspects with and
without visual field defects. J Theor Med 2003;5:161e70.
[35] Baum EB, Haussler D. What size net gives valid generalization? Neural
Comput 1989;1:151e60.
[36] Haykin S. Multilayer perceptrons. In: Haykin S, editor. Neural Net-
works: A Comprehensive Foundation. 2nd ed. Upper Saddle River, NJ:
Prentice Hall; 1998. p. 205e26.
9Sample size in medical image machine learning / Canadian Association of Radiologists Journal xx (2019) 1e10
[37] Rokem A, Wu Y, Lee A. Assessment of the need for separate test set
and number of medical images necessary for deep learning: a sub-
sampling study. bioRxiv 2017:196659. Available at: https://www. Accessed
August 9, 2019.
[38] Cho J, Lee K, Shin E, Choy G, Do S. How much data is needed to train
a medical image deep learning system to achieve necessary high ac-
curacy? arXiv preprint 2015. arXiv:1511.06348. Available at: https:// Accessed August 9, 2019.
[39] Fukunaga K, Hayes RR. Effects of sample size in classifier design.
IEEE Trans Pattern Anal Mach Intell 1989;11:873e85.
[40] Wagner RF, Chan H-P, Sahiner B, Petrick N, Mossoba JT. Finite-
sample effects and resampling plans: applications to linear classifiers in
computer-aided diagnosis. Med Imaging 1997;3034:467e77.
[41] Chan H-P, Sahiner B, Wagner RF, Petrick N. Effects of sample size on
classifier design for computer-aided diagnosis. Proc SPIE Conf Med-
ical Imaging 1998;3338:845e58.
[42] Chan HP, Sahiner B, Wagner RF, Petrick N, Mossoba J. Effects
of sample size on classifier design: quadratic and neural network clas-
sifiers. Image Process Med Imaging 1997;3034(Pts 1 2):1102e13. 1150.
[43] Dobbin KK, Zhao Y, Simon RM. How large a training set is needed to
develop a classifier for microarray data? Clin Cancer Res 2008;14:108e14.
[44] Dobbin KK, Simon RM. Sample size planning for developing classi-
fiers using high-dimensional DNA microarray data. Biostatistics 2007;
[45] Mukherjee S, Tamayo P, Rogers S, et al. Estimating dataset size re-
quirements for classifying DNA microarray data. J Comput Biol 2003;
[46] Beleites C, Neugebauer U, Bocklitz T, Krafft C, Popp J. Sample
size planning for classification models. Anal Chim Acta 2013;760:
[47] Figueroa RL, Zeng-Treitler Q, Kandula S, Ngo LH. Predicting sample
size required for classification performance. BMC Med Inform Decis
Mak 2012;12:8.
[48] Samper-Gonz
alez J, Burgos N, Fontanella S, et al. Yet another ADNI
machine learning paper? Paving the way towards fully-reproducible
research on classification of Alzheimer’s disease. Proc Machine
Learning in Medical Imaging MLMI 2017, MICCAI Worskhop, Lec-
ture Notes in Computer Science 2017;10541:53e60.
[49] Sanchez BN, Wu M, Song PX, Wang W. Study design in high-
dimensional classification analysis. Biostatistics 2016;17:722e36.
[50] Vapnik VN, Chervonenkis AY. On the uniform convergence of relative
frequencies of events to their probabilities. Theory Probab Appl 1971;
10 I. Balki et al. / Canadian Association of Radiologists Journal xx (2019) 1e10
... Therefore, training an effective deep learning solution typically requires a considerable amount of labelled data. In the context of medical imaging analysis, high quality labelled data can be expensive to obtain, leading to a paucity of labelled data settings [16,17]. ...
... In this chapter we discussed the impact of using target datasets with scarce labelled data in the performance of SSDL and supervised deep learning models for detection of malign cases using mammogram images. As presented in [16], or self-supervision [250] and/or domain adaptation [251], along with more complex data augmentation approaches as in [198], might improve the overall model Saúl Calderón Ramírez properties of deep learning models such as robustness and predictive uncertainty, as recommended in [26], is also a future work-line to develop. This is discussed in the next chapter, for COVID-19 detection using chest X-ray images. ...
... Implementing the tested DeDiMs requires no model training, as a generic pre-trained ImageNet model seems to be good enough to estimate the benefit of using a specific unlabelled dataset D (u) s , according to our results. Data quality metrics for deep learning models as argued in [16,262] is an interesting path to develop further, as it might help to narrow the gap between research Saúl Calderón Ramírez and real-world implementation of deep learning systems. For instance, building high quality datasets for training a semi-supervised model, or assess the safety of using a deep learning model before hand, can benefit from quantitative data quality measures. ...
Full-text available
Deep learning methodologies have shown outstanding success in different image analysis applications. They rely on the abundance of labelled observations to build the model. However, frequently it is expensive to gather labelled observations of data, making the usage of deep learning models imprudent. Different practical examples of this challenge can be found in the analysis of medical images. For instance, labelling images to solve medical imaging problems require expensive labelling efforts, as experts (i.e., radiologists) are required to produce reliable labels. Semi-supervised learning is an increasingly popular alternative approach to deal with small labelled datasets and increase model test accuracy, by leveraging unlabelled data. However, in real-world usage settings, an unlabelled dataset might present a different distribution than the labelled dataset (i.e., the labelled dataset was sampled from a target clinic and the unlabelled dataset from a source clinic). There are different causes for a distribution mismatch between the labelled and the unlabelled dataset: a prior probability shift, a set of observations from unseen classes in the labelled dataset, and a covariate shift of the features. In this work, we assess the impact of this phenomena, for the state of the art semi-supervised model known as MixMatch. We evaluate both label and feature distribution mismatch impact in MixMatch in a real-world application: the classification of chest X-ray images for COVID-19 detection. We also test the performance gain of using MixMatch for malignant cancer detection using mammograms. For both study cases we managed to build new datasets from a private clinic in Costa Rica. We propose different approaches to address different causes of a distribu�tion mismatch between the labelled and unlabelled datasets. First, regarding the prior probability shift, a simple model-oriented approach to deal with this challenge, is proposed. According to our experiments, the proposed method yielded accuracy gains of up to 14% statistical significance. As for more challenging distribution mismatch settings caused by a covariate shift in the feature space and sampling unseen classes in the unlabelled dataset we propose a data-oriented approach to deal with such challenges. As an assessment tool, we propose a set of dataset dissimilarity metrics designed to measure how much perfor�mance benefit a semi-supervised training regime can get from using a specific unlabelled dataset over another. Also, two techniques designed to score each unlabelled observation according to how much accuracy might bring including such observation into the unlabelled dataset for semi-supervised training are proposed. These scores can be used to discard harmful unlabelled observations. The novel methods use a generic feature extractor to build a feature space where the metrics and scores are computed. The dataset dissimilarity metrics yielded a linear correlation of up to 90% to the performance of the state-of-the-art Mix�Match semi-supervised training algorithm, suggesting that such metrics can be used to assess the quality of an unlabelled dataset. As for the scoring methods for unlabelled data, according to our tests, using them to discard harmful unla�belled data, was able to increase the performance of MixMatch to around 20%. This in the context of medical image analysis applications.
... The recently introduced artificial intelligence (AI)-based machine learning (ML) techniques have been suggested to be very promising and potentially able to result in a breakthrough in LBP (non-)recovery prediction [26,27]. ML -in comparison to traditional regression analysis -is considered to be more flexible and pragmatic in handling complex datasets with large number of predictors (and their interactions), without strict rules regarding sample sizes and missing values [28]. ...
... In ML, a sample size calculation is generally not performed as there is no consensus regarding sample sizes for ML [28]. However, we aimed a priori at including at least 300 participants. ...
Full-text available
Background While low back pain occurs in nearly everybody and is the leading cause of disability worldwide, we lack instruments to accurately predict persistence of acute low back pain. We aimed to develop and internally validate a machine learning model predicting non-recovery in acute low back pain and to compare this with current practice and ‘traditional’ prediction modeling. Methods Prognostic cohort-study in primary care physiotherapy. Patients ( n = 247) with acute low back pain (≤ one month) consulting physiotherapists were included. Candidate predictors were assessed by questionnaire at baseline and (to capture early recovery) after one and two weeks. Primary outcome was non-recovery after three months, defined as at least mild pain (Numeric Rating Scale > 2/10). Machine learning models to predict non-recovery were developed and internally validated, and compared with two current practices in physiotherapy (STarT Back tool and physiotherapists’ expectation) and ‘traditional’ logistic regression analysis. Results Forty-seven percent of the participants did not recover at three months. The best performing machine learning model showed acceptable predictive performance (area under the curve: 0.66). Although this was no better than a’traditional’ logistic regression model, it outperformed current practice. Conclusions We developed two prognostic models containing partially different predictors, with acceptable performance for predicting (non-)recovery in patients with acute LBP, which was better than current practice. Our prognostic models have the potential of integration in a clinical decision support system to facilitate data-driven, personalized treatment of acute low back pain, but needs external validation first.
... In contrast to hypothesis-driven clinical studies, in which the required sample size to detect a certain effect size at a certain alpha level with certain power can be calculated a priori, 9 such calculations are not straightforward for machine learning approaches. 10 The amount of data depends on a variety of unknown factors, including the signalto-noise ratio, the relative contribution of different sensors and complexity of the classification model, and the incidence of cardiac arrests in our patient population. Based on a previous study in which The open access data platform will allow connection of a variety of smartwatches (and will be available for other wearables capable of detecting OHCA should such devices be developed by other research groups in the future), and the user of a connected smartwatch will be continuously monitored for OHCA. ...
Full-text available
Out-of-hospital cardiac arrest (OHCA) is a leading cause of mortality. Immediate detection and treatment are of paramount importance for survival and good quality of life. The first link in the ‘chain of survival’ after OHCA – the early recognition and alerting of emergency medical services – is at the same time the weakest link as it entirely depends on witnesses. About one half of OHCA cases are unwitnessed, and victims of unwitnessed OHCA have virtually no chance of survival with good neurologic outcome. Also in case of a witnessed cardiac arrest, alerting of emergency medical services is often delayed for several minutes. Therefore, a technological solution to automatically detect cardiac arrests and to instantly trigger an emergency response has the potential to save thousands of lives per year and to greatly improve neurologic recovery and quality of life in survivors. The HEART-SAFE consortium, consisting of two academic centres and three companies in the Netherlands, collaborates to develop and implement a technical solution to reliably detect OHCA based on sensor signals derived from commercially available smartwatches using artificial intelligence. In this manuscript, we describe the rationale, the envisioned solution, as well as a protocol outline of the work packages involved in the development of the technology.
... While deep learning-based solutions [1] have achieved great success in medical image analysis, such as X-ray image classification and segmentation tasks for diseases in bones, lungs and heads, it might require an extremely large number of images with fine annotations to train the deep neural network (DNN) models and deliver decent performance [2] in a supervised learning manner. To lower the sample size required, the self-supervised learning (SSL) paradigm has been recently proposed to boost the performance of DNN models through learning visual features from images [3] without using labels. ...
Full-text available
While self-supervised learning (SSL) algorithms have been widely used to pre-train deep models, few efforts [11] have been done to improve representation learning of X-ray image analysis with SSL pre-trained models. In this work, we study a novel self-supervised pre-training pipeline, namely Multi-task Self-super-vised Continual Learning (MUSCLE), for multiple medical imaging tasks, such as classification and segmentation, using X-ray images collected from multiple body parts, including heads, lungs, and bones. Specifically, MUSCLE aggregates X-rays collected from multiple body parts for MoCo-based representation learning, and adopts a well-designed continual learning (CL) procedure to further pre-train the backbone subject various X-ray analysis tasks jointly. Certain strategies for image pre-processing, learning schedules, and regularization have been used to solve data heterogeneity, over-fitting, and catastrophic forgetting problems for multi-task/dataset learning in MUSCLE. We evaluate MUSCLE using 9 real-world X-ray datasets with various tasks, including pneumonia classification, skeletal abnormality classification, lung segmentation, and tuberculosis (TB) detection. Comparisons against other pre-trained models [7] confirm the proof-of-concept that self-supervised multi-task/dataset continual pre-training could boost the performance of X-ray image analysis.KeywordsX-ray images (X-ray)Self-supervised learning
... 26 Although there is no established minimum 23 Patients with chronic stroke: n = 24 number of training samples, 27 there are experiments that have indicated that increasing the size of the dataset improves performance. [27][28][29] Therefore, once the algorithm starts to detect patterns, it is best to increase the sample size. Of the included studies, 4 did not report the number of samples used for training, 18,21,22,24 while the rest reported using 992, 19 1080, 17 1680, 23 and 1960 20 samples; however, they did not justify the sample size used. ...
Full-text available
Background: The assessment of motor function is vital in post-stroke rehabilitation protocols, and it is imperative to obtain an objective and quantitative measurement of motor function. There are some innovative machine learning algorithms that can be applied in order to automate the assessment of upper extremity motor function. Objectives: To perform a systematic review and meta-analysis of the efficacy of machine learning algorithms for assessing upper limb motor function in post-stroke patients and compare these algorithms to clinical assessment. Material and methods: The protocol was registered in the International Prospective Register of Systematic Reviews (PROSPERO) database. The review was carried out according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines and the Cochrane Handbook for Systematic Reviews of Interventions. The search was performed using 6 electronic databases. The meta-analysis was performed with the data from the correlation coefficients using a random model. Results: The initial search yielded 1626 records, but only 8 studies fully met the eligibility criteria. The studies reported strong and very strong correlations between the algorithms tested and clinical assessment. The meta-analysis revealed a lack of homogeneity (I2 = 85.29%, Q = 48.15), which is attributable to the heterogeneity of the included studies. Conclusion: Automated systems using machine learning algorithms could support therapists in assessing upper extremity motor function in post-stroke patients. However, to draw more robust conclusions, methodological designs that minimize the risk of bias and increase the quality of the methodology of future studies are required.
Sleep posture monitoring is an essential assessment for obstructive sleep apnea (OSA) patients. The objective of this study is to develop a machine learning-based sleep posture recognition system using a dual ultra-wideband radar system. We collected radiofrequency data from two radars positioned over and at the side of the bed for 16 patients performing four sleep postures (supine, left and right lateral, and prone). We proposed and evaluated deep learning approaches that streamlined feature extraction and classification, and the traditional machine learning approaches that involved different combinations of feature extractors and classifiers. Our results showed that the dual radar system performed better than either single radar. Predetermined statistical features with random forest classifier yielded the best accuracy (0.887), which could be further improved via an ablation study (0.938). Deep learning approach using transformer yielded accuracy of 0.713.
Full-text available
Applications based on artificial intelligence (AI) and deep learning (DL) are rapidly being developed to assist in the detection and characterization of lesions on medical images. In this study, we developed and examined an image-processing workflow that incorporates both traditional image processing with AI technology and utilizes a standards-based approach for disease identification and quantitation to segment and classify tissue within a whole-body [18F]FDG PET/CT study.Methods One hundred thirty baseline PET/CT studies from two multi-institutional preoperative clinical trials in early-stage breast cancer were semi-automatically segmented using techniques based on PERCIST v1.0 thresholds and the individual segmentations classified as to tissue type by an experienced nuclear medicine physician. These classifications were then used to train a convolutional neural network (CNN) to automatically accomplish the same tasks.ResultsOur CNN-based workflow demonstrated Sensitivity at detecting disease (either primary lesion or lymphadenopathy) of 0.96 (95% CI [0.9, 1.0], 99% CI [0.87,1.00]), Specificity of 1.00 (95% CI [1.0,1.0], 99% CI [1.0,1.0]), DICE score of 0.94 (95% CI [0.89, 0.99], 99% CI [0.86, 1.00]), and Jaccard score of 0.89 (95% CI [0.80, 0.98], 99% CI [0.74, 1.00]).Conclusion This pilot work has demonstrated the ability of AI-based workflow using DL-CNNs to specifically identify breast cancer tissue as determined by [18F]FDG avidity in a PET/CT study. The high sensitivity and specificity of the network supports the idea that AI can be trained to recognize specific tissue signatures, both normal and disease, in molecular imaging studies using radiopharmaceuticals. Future work will explore the applicability of these techniques to other disease types and alternative radiotracers, as well as explore the accuracy of fully automated and quantitative detection and response assessment.
Objective: Machine learning (ML) algorithms have emerged as powerful predictive tools in the field stroke. Here, we examine the predictive accuracy of ML models for predicting functional outcomes using 24-hour post-treatment characteristics in the Systematic Evaluation of Patients Treated With Neurothrombectomy Devices for Acute Ischemic Stroke (STRATIS) Registry. Methods: ML models, adaptive boost, random forest (RF), classification and regression tree (CART), C5.0 decision tree (C5.0), support vector machine (SVM), least absolute shrinkage and selection operator (LASSO), and logistic regression (LR), and traditional LR models were used to predict 90-day functional outcome (modified Rankin Scale score 0-2). Twenty-four-hour National Institutes of Health Stroke Scale (NIHSS) was examined as a continuous or dichotomous variable in all models. Model accuracy was assessed using the area under characteristic curve (AUC). Results: The 24-Hour NIHSS score was a top-predictor of functional outcome in all models. ML models using the continuous 24-hour NIHSS scored showed moderate-to-good predictive performance (range mean AUC:0.76-0.92); however, RF (AUC:0.92±0.028) outperformed all ML models, except LASSO (AUC:0.89±0.023, p=0.0958). Importantly, RF demonstrated a significantly higher predictive value than LR (AUC:0.87±0.031, p=0.048) and traditional LR (AUC:85±0.06, p=0.035) when utilizing the 24-hour continuous NIHSS score. Predictive accuracy was similar between the 24-hour NIHSS score dichotomous and continuous ML models. Interpretation: In this substudy, we found similar predictive accuracy for functional outcome when utilizing the 24-hr NIHSS score as a continuous or dichotomous variable in ML models. ML models had moderate-to-good predictive accuracy, with RF outperforming LR models. External validation of these ML models is warranted. This article is protected by copyright. All rights reserved.
Full-text available
Position statements by radiologic associations have highlighted that adequate sample-size considerations are essential to ensure robust, unbiased clinical model development for machine learning (ML) studies. The generalizability of ML algorithms for classification tasks are highly dependent on the size and quality of the training data set on which they are trained. This is particularly relevant for ML methods applied to the field of diagnostic radiology where access to large quantities of high-quality data is challenging. The authors sought to provide a systematic review of sample-size determination methodologies in ML applied to medical imaging based on articles published between 1946 and 2018. The authors found a paucity of research in training set size determination methodologies applied to ML in medical imaging, highlighting the need for additional research in this area. Balki et al is a meaningful contribution to the body of literature that informs efforts to standardize reporting practices and methodologies at the intersection of computer science, machine learning and clinical research. Congratulations to the authors on their achievement and contribution to radiology research and applied informatics.
Full-text available
Background: Dynamic susceptibility contrast (DSC) MR perfusion is a frequently-used technique for neurovascular imaging. The progress of a bolus of contrast agent through the tissue of the brain is imaged via a series of T2*-weighted MRI scans. Clinically relevant parameters such as blood flow and Tmax can be calculated by deconvolving the contrast-time curves with the bolus shape (arterial input function). In acute stroke, for instance, these parameters may help distinguish between the likely salvageable tissue and irreversibly damaged infarct core. Deconvolution typically relies on singular value decomposition (SVD): however, studies have shown that these algorithms are very sensitive to noise and artifacts present in the image and therefore may introduce distortions that influence the estimated output parameters. Methods: In this work, we present a machine learning approach to the estimation of perfusion parameters in DSC-MRI. Various machine learning models using as input the raw MR source data were trained to reproduce the output of an FDA approved commercial implementation of the SVD deconvolution algorithm. Experiments were conducted to determine the effect of training set size, optimal patch size, and the effect of using different machine-learning models for regression. Results: Model performance increased with training set size, but after 5,000 samples (voxels) this effect was minimal. Models inferring perfusion maps from a 5 by 5 voxel patch outperformed models able to use the information in a single voxel, but larger patches led to worse performance. Random Forest models produced had the lowest root mean squared error, with neural networks performing second best: however, a phantom study revealed that the random forest was highly susceptible to noise levels, while the neural network was more robust. Conclusion: The machine learning-based approach produces estimates of the perfusion parameters invariant to the noise and artifacts that commonly occur as part of MR acquisition. As a result, better robustness to noise is obtained, when evaluated against the FDA approved software on acute stroke patients and simulated phantom data.
Full-text available
Introduction Advanced machine learning methods might help to identify dementia risk from neuroimaging, but their accuracy to date is unclear. Methods We systematically reviewed the literature, 2006 to late 2016, for machine learning studies differentiating healthy aging from dementia of various types, assessing study quality, and comparing accuracy at different disease boundaries. Results Of 111 relevant studies, most assessed Alzheimer's disease versus healthy controls, using AD Neuroimaging Initiative data, support vector machines, and only T1-weighted sequences. Accuracy was highest for differentiating Alzheimer's disease from healthy controls and poor for differentiating healthy controls versus mild cognitive impairment versus Alzheimer's disease or mild cognitive impairment converters versus nonconverters. Accuracy increased using combined data types, but not by data source, sample size, or machine learning method. Discussion Machine learning does not differentiate clinically relevant disease categories yet. More diverse data sets, combinations of different types of data, and close clinical integration of machine learning would help to advance the field.
Full-text available
We present a new technique to fully automate the segmentation of an organ from 3D ultrasound (3D-US) volumes, using the placenta as the target organ. Image analysis tools to estimate organ volume do exist but are too time consuming and operator dependant. Fully automating the segmentation process would potentially allow the use of placental volume to screen for increased risk of pregnancy complications. The placenta was segmented from 2,393 first trimester 3D-US volumes using a semiautomated technique. This was quality controlled by three operators to produce the "ground-truth" data set. A fully convolutional neural network (OxNNet) was trained using this ground-truth data set to automatically segment the placenta. OxNNet delivered state-of-the-art automatic segmentation. The effect of training set size on the performance of OxNNet demonstrated the need for large data sets. The clinical utility of placental volume was tested by looking at predictions of small-for-gestational-age babies at term. The receiver-operating characteristics curves demonstrated almost identical results between OxNNet and the ground-truth). Our results demonstrated good similarity to the ground-truth and almost identical clinical results for the prediction of SGA.
Full-text available
Artificial intelligence (AI) is rapidly moving from an experimental phase to an implementation phase in many fields, including medicine. The combination of improved availability of large datasets, increasing computing power, and advances in learning algorithms has created major performance breakthroughs in the development of AI applications. In the last 5 years, AI techniques known as deep learning have delivered rapidly improving performance in image recognition, caption generation, and speech recognition. Radiology, in particular, is a prime candidate for early adoption of these techniques. It is anticipated that the implementation of AI in radiology over the next decade will significantly improve the quality, value, and depth of radiology's contribution to patient care and population health, and will revolutionize radiologists' workflows. The Canadian Association of Radiologists (CAR) is the national voice of radiology committed to promoting the highest standards in patient-centered imaging, lifelong learning, and research. The CAR has created an AI working group with the mandate to discuss and deliberate on practice, policy, and patient care issues related to the introduction and implementation of AI in imaging. This white paper provides recommendations for the CAR derived from deliberations between members of the AI working group. This white paper on AI in radiology will inform CAR members and policymakers on key terminology, educational needs of members, research and development, partnerships, potential clinical applications, implementation, structure and governance, role of radiologists, and potential impact of AI on radiology in Canada.
Full-text available
The concentrations of measure phenomena were discovered as the mathematical background to statistical mechanics at the end of the nineteenth/beginning of the twentieth century and have been explored in mathematics ever since. At the beginning of the twenty-first century, it became clear that the proper utilization of these phenomena in machine learning might transform the curse of dimensionality into the blessing of dimensionality . This paper summarizes recently discovered phenomena of measure concentration which drastically simplify some machine learning problems in high dimension, and allow us to correct legacy artificial intelligence systems. The classical concentration of measure theorems state that i.i.d. random points are concentrated in a thin layer near a surface (a sphere or equators of a sphere, an average or median-level set of energy or another Lipschitz function, etc.). The new stochastic separation theorems describe the thin structure of these thin layers: the random points are not only concentrated in a thin layer but are all linearly separable from the rest of the set, even for exponentially large random sets. The linear functionals for separation of points can be selected in the form of the linear Fisher’s discriminant. All artificial intelligence systems make errors. Non-destructive correction requires separation of the situations (samples) with errors from the samples corresponding to correct behaviour by a simple and robust classifier. The stochastic separation theorems provide us with such classifiers and determine a non-iterative (one-shot) procedure for their construction. This article is part of the theme issue ‘Hilbert’s sixth problem’.
Purpose To assess the ability of convolutional neural networks (CNNs) to enable high-performance automated binary classification of chest radiographs. Materials and Methods In a retrospective study, 216 431 frontal chest radiographs obtained between 1998 and 2012 were procured, along with associated text reports and a prospective label from the attending radiologist. This data set was used to train CNNs to classify chest radiographs as normal or abnormal before evaluation on a held-out set of 533 images hand-labeled by expert radiologists. The effects of development set size, training set size, initialization strategy, and network architecture on end performance were assessed by using standard binary classification metrics; detailed error analysis, including visualization of CNN activations, was also performed. Results Average area under the receiver operating characteristic curve (AUC) was 0.96 for a CNN trained with 200 000 images. This AUC value was greater than that observed when the same model was trained with 2000 images (AUC = 0.84, P < .005) but was not significantly different from that observed when the model was trained with 20 000 images (AUC = 0.95, P > .05). Averaging the CNN output score with the binary prospective label yielded the best-performing classifier, with an AUC of 0.98 (P < .005). Analysis of specific radiographs revealed that the model was heavily influenced by clinically relevant spatial regions but did not reliably generalize beyond thoracic disease. Conclusion CNNs trained with a modestly sized collection of prospectively labeled chest radiographs achieved high diagnostic performance in the classification of chest radiographs as normal or abnormal; this function may be useful for automated prioritization of abnormal chest radiographs. © RSNA, 2018 Online supplemental material is available for this article. See also the editorial by van Ginneken in this issue.
In this work we developed a deep convolutional neural network (CNN) for classification of malignant and benign masses in digital breast tomosynthesis (DBT) using a multistage transfer learning approach that utilized data from similar auxiliary domains for intermediate-stage fine-tuning. Breast imaging data from DBT and digitized screen-film mammography (SFM), digital mammography (DM) totaling 4,039 unique ROIs (1,797 malignant and 2,242 benign) were collected. Using crossvalidation, we selected the best transfer network from six transfer networks by varying the level up to which the convolutional layers were frozen. In a single-stage transfer learning approach, knowledge from CNN trained on ImageNet data was fine-tuned directly with DBT data. In a multi-stage transfer learning approach, knowledge learned from ImageNet was first fine-tuned with the mammography data and then fine-tuned with the DBT data. Two transfer networks were compared for the secondstage transfer learning by freezing most of the CNN structure versus freezing only the first convolutional layer. We studied the dependence of the classification performance on training sample size for various transfer learning and fine-tuning schemes by varying the training data from 1% to 100% of the available sets. The area under the receiver operating characteristic curve (AUC) was used as a performance measure. The view-based AUC on the test set for single-stage transfer learning was 0.85±0.05 and improved significantly (p<0.05) to 0.91±0.03 for multi-stage learning. This study demonstrated that, when the training sample size from the target domain is limited, an additional stage of transfer-learning using data from a similar auxiliary domain is advantageous.
Individualized behavioral/cognitive prediction using machine learning (ML) regression approaches is becoming increasingly applied. The specific ML regression algorithm and sample size are two key factors that non-trivially influence prediction accuracies. However, the effects of the ML regression algorithm and sample size on individualized behavioral/cognitive prediction performance have not been comprehensively assessed. To address this issue, the present study included six commonly used ML regression algorithms: ordinary least squares (OLS) regression, least absolute shrinkage and selection operator (LASSO) regression, ridge regression, elastic-net regression, linear support vector regression (LSVR), and relevance vector regression (RVR), to perform specific behavioral/cognitive predictions based on different sample sizes. Specifically, the publicly available resting-state functional MRI (rs-fMRI) dataset from the Human Connectome Project (HCP) was used, and whole-brain resting-state functional connectivity (rsFC) or rsFC strength (rsFCS) were extracted as prediction features. Twenty-five sample sizes (ranged from 20 to 700) were studied by sub-sampling from the entire HCP cohort. The analyses showed that rsFC-based LASSO regression performed remarkably worse than the other algorithms, and rsFCS-based OLS regression performed markedly worse than the other algorithms. Regardless of the algorithm and feature type, both the prediction accuracy and its stability exponentially increased with increasing sample size. The specific patterns of the observed algorithm and sample size effects were well replicated in the prediction using re-testing fMRI data, data processed by different imaging preprocessing schemes, and different behavioral/cognitive scores, thus indicating excellent robustness/generalization of the effects. The current findings provide critical insight into how the selected ML regression algorithm and sample size influence individualized predictions of behavior/cognition and offer important guidance for choosing the ML regression algorithm or sample size in relevant investigations.
Purpose Demonstrate a novel fast method for reconstruction of multi‐dimensional MR fingerprinting (MRF) data using deep learning methods. Methods A neural network (NN) is defined using the TensorFlow framework and trained on simulated MRF data computed with the extended phase graph formalism. The NN reconstruction accuracy for noiseless and noisy data is compared to conventional MRF template matching as a function of training data size and is quantified in simulated numerical brain phantom data and International Society for Magnetic Resonance in Medicine/National Institute of Standards and Technology phantom data measured on 1.5T and 3T scanners with an optimized MRF EPI and MRF fast imaging with steady state precession (FISP) sequences with spiral readout. The utility of the method is demonstrated in a healthy subject in vivo at 1.5T. Results Network training required 10 to 74 minutes; once trained, data reconstruction required approximately 10 ms for the MRF EPI and 76 ms for the MRF bSSFP sequence. Reconstruction of simulated, noiseless brain data using the NN resulted in a RMS error (RMSE) of 2.6 ms for T1 and 1.9 ms for T2. The reconstruction error in the presence of noise was less than 10% for both T1 and T2 for SNR greater than 25 dB. Phantom measurements yielded good agreement (R² = 0.99/0.99 for MRF EPI T1/T2 and 0.94/0.98 for MRF bSSFP T1/T2) between the T1 and T2 estimated by the NN and reference values from the International Society for Magnetic Resonance in Medicine/National Institute of Standards and Technology phantom. Conclusion Reconstruction of MRF data with a NN is accurate, 300‐ to 5000‐fold faster, and more robust to noise and dictionary undersampling than conventional MRF dictionary‐matching.