Critically Appraised Topic /
Sample-Size Determination Methodologies for Machine Learning in
Medical Imaging Research: A Systematic Review
Indranil Balki, HBSc
, Afsaneh Amirabadi, PhD
, Jacob Levman, PhD
, Anne L. Martel, PhD
Ziga Emersic, MEng
, Blaz Meden, MEng
, Angel Garcia-Pedrero, PhD
Saul C. Ramirez, MEng
, Dehan Kong, PhD
, Alan R. Moody, MBBS, FRCP, FRCR
Pascal N. Tyrrell, PhD
Department of Medical Imaging, University of Toronto, Toronto, Ontario, Canada
Department of Diagnostic Imaging, Hospital for Sick Children, Toronto, Ontario, Canada
Department of Mathematics, Statistics and Computer Science, St Francis Xavier University, Antigonish, Nova Scotia, Canada
Boston Children’s Hospital, Harvard Medical School, Boston, Massachusetts, USA
Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada
Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia
Department of Botany, Universidad de Valladolid, Castile and Leon, Spain
Computing School, Instituto Tecnol
ogico de Costa Rica, Cartago, Costa Rica
Department of Statistical Sciences, University of Toronto, Toronto, Ontario, Canada
Purpose: The required training sample size for a particular machine learning (ML) model applied to medical imaging data is often unknown.
The purpose of this study was to provide a descriptive review of current sample-size determination methodologies in ML applied to medical
imaging and to propose recommendations for future work in the ﬁeld.
Methods: We conducted a systematic literature search of articles using Medline and Embase with keywords including ‘‘machine learning,’’
‘‘image,’’ and ‘‘sample size.’’ The search included articles published between 1946 and 2018. Data regarding the ML task, sample size, and
train-test pipeline were collected.
Results: A total of 167 articles were identiﬁed, of which 22 were included for qualitative analysis. There were only 4 studies that discussed sample-
size determination methodologies, and 18 that tested the effect of sample size on model performance as part of an exploratory analysis. The observed
methods could be categorized as pre hoc model-based approaches, which relied on features of the algorithm, or post hoc curve-ﬁtting approaches
requiring empirical testing to model and extrapolate algorithm performance as a function of sample size. Between studies, we observed great
variability in performance testing procedures used for curve-ﬁtting, model assessment methods, and reporting of conﬁdence in sample sizes.
Conclusions: Our study highlights the scarcity of research in training set size determination methodologies applied to ML in medical
imaging, emphasizes the need to standardize current reporting practices, and guides future work in development and streamlining of pre hoc
and post hoc sample size approaches.
Objectif : On ignore souvent la taille de l’
echantillon d’apprentissage n
ecessaire pour un mod
ele d’apprentissage artiﬁciel en particulier,
a des donn
ees d’imagerie m
edicale. L’objectif de cette
etablir une synth
ese descriptive des m
eterminer les tailles d’
echantillon en apprentissage artiﬁciel appliqu
a l’imagerie m
edicale et de proposer des recommandations
pour la r
ealisation des futurs travaux dans ce domaine.
ethodes : Nous avons effectu
e une recherche syst
ematique de documentation scientiﬁque des articles disponibles dans les bases de donn
Medline et Embase, en utilisant notamment les mots cl
es suivants : «apprentissage artiﬁciel »,«image »et «taille d’
echantillon ». Cette
recherche portait sur des articles publi
es entre 1946 et 2018. Les donn
ees aux activit
es d’apprentissage artiﬁciel, aux tailles
echantillon et aux syst
emes de test-apprentissage ont
* Address for correspondence: Pascal N. Tyrrell, PhD, University of Tor-
onto, Department of Medical Imaging, 263 McCaul Street, 4th Floor Room
409, Toronto, Ontario M5T 1W7, Canada.
E-mail address: firstname.lastname@example.org (P. N. Tyrrell).
0846-5371/$ - see front matter Ó2019 Canadian Association of Radiologists. All rights reserved.
Canadian Association of Radiologists Journal xx (2019) 1e10
esultats : Au total, 167 articles ont
es, dont 22 ont
es pour faire l’objet d’une analyse qualitative. Seulement 4
etudes abordaient les m
ethodologies de d
etermination de taille d’
echantillon et 18
evaluaient l’impact de la taille d’
echantillon sur l’efﬁcacit
ele, au sein d’une analyse exploratoire. Les m
ees pouvaient ^
ees en deux cat
egories: les m
ele a priori (pre hoc) fond
ees sur les propri
es de l’algorithme et les m
ethodes d’ajustement de courbe a posteriori (post hoc)
ecessitant des analyses empiriques du mod
ele et l’extrapolation des performances de l’algorithme en tant que fonction de la taille
echantillon. Nous avons observ
e une forte variabilit
e entre les
etudes au niveau de l’efﬁcacit
e des proc
edures d’analyse utilis
ees pour les
evaluation des mod
eles d’ajustement de courbe et de la conﬁance relative
a la taille de l’
Conclusions : Notre
etude met en
evidence la raret
etudes comportant des m
ethodologies de d
etermination de taille d’
l’apprentissage artiﬁciel appliqu
a l’imagerie m
edicale. Elle souligne le besoin de standardiser les pratiques actuelles de communication de
ees et pr
econise les travaux ult
ealiser au niveau de la mise au point et de la simpliﬁcation des d
emarches a priori et a posteriori
etermination de la taille d’
Ó2019 Canadian Association of Radiologists. All rights reserved.
Key Words: Sample size; Machine learning; Medical imaging; Radiology
Establishing principled methods for assessing the value of
machine learning (ML) algorithms used in medical imaging
research remains a substantial challenge. In order to improve
the accuracy of ML methods applied to the ﬁeld of diag-
nostic radiology, there is a trend toward extracting more and
more features; in the limit, every pixel in an image can be
regarded as a feature. Under such circumstances, to avoid the
‘‘curse of dimensionality’’ (where a classiﬁer fails to
generalize as the number of features increases), enormous
training sets become necessary [1e4]. The sample size of the
training set has long been recognized as the single biggest
inﬂuence on the design and performance of pattern recog-
nition systems . Finite training set size is potentially an
important source of model bias, and both the training and test
set size contribute to variance in a model’s performance
In classical statistics, sample-size determination method-
ologies (SSDMs) estimate the optimum number of partici-
pants to arrive at scientiﬁcally valid results, often balancing
an acceptable degree of precision with availability of re-
sources . Analogously, for ML in medical imaging, we
deﬁne an SSDM as a procedure to estimate the number of
images required for an ML algorithm to reach a particular
threshold of performance, or a sufﬁciently low generaliz-
ability error. While sample size issues may affect many ML
disciplines, this is a particularly poignant challenge in
medical imaging, where access to large quantities of high-
quality data is elusive [2,10]. In a recent inﬂuential review
article , the importance of obtaining adequately sized,
unbiased validation and training sets was identiﬁed as a
crucial factor in assessment and development of robust ML
models. Recent studies in the ﬁeld of neuroimaging have
concluded that the generalizability of ML algorithms for
classiﬁcation tasks are heavily subject to the inﬂuence of the
size and quality of the training sample set on which they are
trained [12,13]. A recent white paper by the Canadian As-
sociation of Radiologists also described the effect of data
size on model performance and highlighted the importance
of developing sound, clinically validated models for medical
applications . Methods of calculating the required
sample size for a given model applied to medical imaging
data remain unknown.
We hypothesized that there would be a paucity of
research relating to SSDMs applied to ML applications in
medical imaging and great scope for establishing standards
of practice. Exploration of SSDMs will allow researchers to
plan cost-effective experiments and assess conﬁdence in the
generalizability of a model trained on a limited number of
samples. The purpose of our study was to investigate cur-
rent sample size determination practices in the literature.
The objectives of our study were to (1) systematically
identify studies pertaining to SSDMs for ML in medical
imaging, (2) provide a descriptive review of the SSDMs
employed, and (3) propose recommendations for future
work in the ﬁeld.
Materials and Methods
We conducted a systematic review of the literature to
identify research involving SSDMs in medical imaging–
speciﬁc ML. Relevant studies were identiﬁed by searching
English-language scientiﬁc literature in OVID Medline
(1946-November 21, 2018) and OVID Embase (1947-
November 21, 2018) with keywords including but not
limited to ‘‘machine learning,’’ ‘‘imaging,’’ and ‘‘training
sample size.’’ Duplicates were removed with the use of
reference management software. A detailed search strategy
may be found in the Supplemental Tables S3 and S4. Manual
searching of relevant ML journals and a review of references
from collected studies was also conducted. Articles from
both peer-reviewed and preprint journals (Arxiv, BioRxiv)
were considered for inclusion.
Studies were included if they met the following 2 inclu-
sion criteria: (1) the study (a) proposed or tested the efﬁcacy
of an SSDM used in ML or (b) tested the effect of training
sample size on model performance and (2) the ML model
was shown to be applicable to the ﬁeld of medical imaging.
Criterion 1b facilitated the additional inclusion of studies
that would allow us to explore reporting of model perfor-
mance under various sample size conditions, with the hope
2I. Balki et al. / Canadian Association of Radiologists Journal xx (2019) 1e10
that this may inspire future development of SSDMs. Along
with traditional radiologic modalities (magnetic resonance
imaging, computed tomography, x-ray, ultrasound), studies
using images from the broader domain of medical imaging,
such as histopathological imaging and laser tomography
imaging, were also considered for inclusion. This enabled us
to explore the handling of sample-size considerations in
domains that in many ways resemble the complexity, prob-
lem choice, rigorous validation standards, and algorithmic
analysis of classical radiologic classiﬁcation, regression, and
Articles were independently assessed for inclusion by
authors I.B. and A.A., both of whom had prior experience in
ML and evaluating scientiﬁc and medical literature. Any
discrepancies were resolved by consulting study author P.T.
Studies were excluded if they did not meet the inclusion
criteria or if the full text could not be obtained.
The standard and well-established Preferred Reporting
Items for Systematic Reviews and Meta-Analysis protocol
was followed throughout the study screening and evalua-
tion process . At the outcome level, SSDMs were
described on their empirical validity (as demonstrated by
measures in conﬁdence of sample size estimates) and
potential for application in medical imaging domains. The
primary outcome was to evaluate methodologies for sam-
ple size determination in various medical ML studies in
order to formulate proposals for future research in the
ﬁeld. The secondary outcome was to describe and evaluate
how machine model performance was evaluated at
different sample sizes. In order to perform qualitative
analysis, we extracted information from each included
study pertaining to the medical imaging task, modality of
images, ML model used, train-test pipeline, number and
type of features, ﬁgure of merit for model evaluation,
range of sample size conditions studied, range of observed
model performance and SSDMs employed, amongst other
A total of 167 articles were identiﬁed from Medline
(n ¼81), Embase (n ¼81) and hand-searching (n ¼5)
(Figure 1). After the removal of duplicate articles, screening
of title and abstract, and full-text appraisal based on the in-
clusion criteria, 22 articles were selected for qualitative
analysis. Of the 22 articles included, only 4 studies discussed
an SSDM (Table 1), while 18 studies evaluated an ML model
under different sample size conditions (Supplemental Table
S1). A distribution of the publication years of included ar-
ticles is shown in Figure 2, demonstrating that half of
included articles were published in the past 4 years (2016
through 2019). The ML tasks studied included binary clas-
siﬁcation (n ¼11), segmentation (n ¼5), regression (n ¼3),
and multiclass classiﬁcation (n ¼3) problems. The algo-
rithms assessed were: some form of neural network (n ¼12),
some form of linear regression (n ¼10), support vector
machine (n ¼7), linear discriminant analysis (n ¼3), de-
cision tree (n ¼3), Adaboost (n ¼2), K-nearest neighbour
Figure 1. Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) study design ﬂowchart. PRISMA ﬂow diagram demonstrating study
3Sample size in medical image machine learning / Canadian Association of Radiologists Journal xx (2019) 1e10
Sample-size determination methodologies in machine learning applied to medical imaging (inclusion criteria 1a and 2)
Author, year [ref]
model Train-test pipeline Features
for given threshold
et al, 2003 
of healthy vs
1) 10-fold CV
2) 10-fold CV
1130 training eyes
95% CI for
1) 1017 total
2) 1017 total
3) 678 total
1) 98-99 (Sen.),
(Sen.), 0-7.8 (Sp.)
1) Baum and
2) Haykin 
3) Vapnik and
1) 1580 images for
of 1/8 and neural
network with 12
units, 30 weights.
2) 240 images
of 1/8 and neural
network with 12
units, 30 weights
large sample size’’
et al, 2017 
of healthy vs
CNN Separate test set of
11 x subsampling
within the training/
validation data of
OCA (range) 1300-35,000
to 86% (84-87)
OCA vs N
for 82% OCA
al, 2016 
CNN Separate test set of
6000 images, 10
the training data
Pixels OCA (sd),
8% (w25) to
(graph OCA vs
for 99.5% OCA
et al, 2000 
matrix of 249
LDA 500 samples/class
generated 50x; 20 x
within each group
AUC for hold
0.7 to 0.82
0.98 to 0.875
(graph AUC vs 1/
(based on mean/
ﬁtted value (by
AMD ¼age-related macular degeneration; AUC ¼area under receiving operator characteristic; AUC(N)¼AUC at inﬁnite sample size conditions; CI, conﬁdence interval; CNN ¼convolutional neural network;
CV ¼cross validation; FOM ¼ﬁgure of merit; LDA ¼linear discriminant analysis; N
¼number of training samples; OCA ¼overall classiﬁcation accuracy; ROI ¼region of interest; sd ¼standard deviation;
Sen. ¼sensitivity; SGLD ¼spatial grey level dependence; Sp. ¼speciﬁcity.
Only relevant model assessment methods and explicitly stated features used prior to feature selection are presented.
4I. Balki et al. / Canadian Association of Radiologists Journal xx (2019) 1e10
(n ¼1), and logistic regression (n ¼1). The feature-set size
ranged from 3 to 2 10
, while the sample sizes studied
ranged from 2 to 90,000/class (Supplemental Table S1). In
general, increasing sample size beyond 1000/class demon-
strated progressively smaller improvements associated with
increased sample size (Supplemental Table S1). We have
also included an appendix in the Supplementary Materials
that contains the deﬁnition of key concepts and terminology
used in this article (Supplemental Appendix S1).
Methods of Assessing Effect of Sample Size on Model
There were 3 general methods used to assess the effect of
sample size on model performance, which we term ‘‘per-
formance-testing procedures’’ (PTPs), illustrated in Figure 3.
The NxSubsampling (Figure 3A) scheme was employed in 8
studies [16e23]. Typically, a random subsample consisting
of Y number of images was drawn from a large image pool
and used to train an ML model. The model was then eval-
uated on an independent test set. This process was repeated N
times for each subsample size Ywith replacement in order to
allow for construction of a mean and conﬁdence interval for
the observed performance. In a Balanced NxSubsampling
scheme (3 out of 8), each subsample contained an equal
number of images from each class [18e20]. The NxRepeated
Cross-Validation scheme (Figure 3B) and variations
employed by 6 studies involved randomly splitting the
dataset into a desired training set size Yand the (remaining)
test set size on which the model was evaluated [24e29].This
random splitting was repeated N times for a desired training
set size Y, and the mean and standard deviation of perfor-
mance for each sample size could be obtained. Sahiner et al
 and Way et al  used variations of these PTPs while
simulating samples. In a No Repetition scheme (Figure 3C),
images were subsampled only once for each desired training
set size Yand evaluated on an independent test set [31e33].
The SSDMs identiﬁed were categorized as either pre
hoc (model-based) or post hoc (curve-ﬁtting) approaches.
Model-based approaches provided sample size estimates
based on characteristics of the algorithm, allowed classi-
ﬁcation error upon generalization, and acceptable conﬁ-
dence in that error. These methods were based on the
assumption that training and test samples were chosen
from the same distribution. Hitzl et al  explored the use
of several model-based approaches, one of which was
postulated by Baum and Hausler  for use with single
hidden-layer feedforward neural networks with k units and
d weights. This method predicted that for some classiﬁ-
cation error ε(0 <ε<1/8), a network trained on mimages
with the fraction 1 eε/2 of the images correctly classiﬁed,
unseen test set, with the condition that m O(d/εlog
ε)). For the parameters in the study, ε¼1/8, k ¼12,
d¼30, this method suggested 1580 samples were
required. A second method proposed by Haykin 
suggested valid generalization if the condition
m¼O((dþk)/ε) was satisﬁed. This method is similar to
Widrow’s rule of thumb, and in practice, m z(d/ε).
This method suggested that 240 training samples were
Curve-ﬁtting SSDMs relied on empirically evaluating
model performance at select sample sizes (using PTPs), with
the goal of extrapolating algorithm performance as a function
of training set size. Two major curve-ﬁtting approaches were
identiﬁed. The learning curve-ﬁtting approach relied on
modelling the relationship between training data size and
classiﬁcation accuracy using an inverse power law function
to model the ML algorithm learning curve (Figure 4A).
Rokem et al  and Cho et al  employed this approach
in their binary and multi-class classiﬁcation tasks respec-
tively. Rokem et al  estimated the requirement of 10,000
images/class to obtain an overall classiﬁcation accuracy
(OCA) of 82%, while Cho et al  predicted 4092 images/
class to acquire an OCA of 99.5%. The observed differences
can be attributed to the particular ML model and task, sub-
sampling procedures, and variations in learning curve
Sahiner et al  used a linear curve-ﬁtting approach
pioneered by Fukunaga and Hayes . Empirically ob-
tained area under receiving operator characteristic (AUC)
metrics (through PTPs) were plotted against their respective
¼number of training images) values, and
performance at higher sample sizes was extrapolated by
linear regression as N
tended to inﬁnity (Figure 4B). This
linear relationship, where higher order 1/N
terms can be
neglected, has been a subject of research by numerous
theoretical texts and simulation studies [40e42]. The theo-
retical AUC at inﬁnite sample size conditions, or AUC(N),
for the particular computer-aided diagnosis experiment was
calculated based on the mean/covariance matrix of 249
available mammograms. Sahiner et al  followed by
conducting a simulation of training data features and classes
based on the observed matrix. The linear curve-ﬁtting
approach using the training data was then used to validate
the accuracy of the original AUC(N) metric.
Figure 2. Publication years of included articles. Histogram displaying pub-
lication years of 22 articles included in analysis.
5Sample size in medical image machine learning / Canadian Association of Radiologists Journal xx (2019) 1e10
A systematic literature search revealed the scarcity of SSDM
research in ML applied to medical imaging. To our knowledge,
this is the ﬁrst study to systematically assess this topic. Model
performance at different sample sizes was assessed using
NxSubsampling, NxRepeated Cross Validation, and No Repeti-
tion schemes. The SSDMs identiﬁed relied on model-based
considerations or on generating predictive functions of model
strength based on empirical testing at select sample sizes.
Figure 3. Performance testing procedures. (A) An illustration of the Nx Subsampling Scheme. (B) An illustration of the Nx Cross-Validation Scheme. (C) An
illustration of the No Repetition Scheme.
6I. Balki et al. / Canadian Association of Radiologists Journal xx (2019) 1e10
Pre hoc approaches
Model-based SSDMs provide pre hoc sample size esti-
mates for particular ML algorithms. Baum and Haussler’s
method  was shown to be accurate within 5% of the
actual observed OCA in Hitzl et al’s study . Recently,
domain-speciﬁc model-based SSDMs have been developed
for DNA microarray classiﬁcation tasks. These approaches
are based on parametric probability models and rely on
factors including standardized fold change, class prevalence,
and the number of genes and features on the arrays .
However, these methods were not robust in high dimen-
sionality, differential gene expression, and problems with
great intraclass variability , which are characteristics of
medical imaging data.
Figure 4. Post hoc approaches. (A) Learning curve-ﬁtting approach. Overall classiﬁcation accuracy of a machine learning algorithm can be modelled against
training data set size, typically resulting in a saturating inverse power law curve. The sample size x required to reach 95% classiﬁcation accuracy is shown to be
extrapolated with dotted lines. (B) Linear curve-ﬁtting approach. The area under receiving operator characteristic of a machine learning algorithm can be
modelled against the inverse of the training data set size, typically resulting in a negative linear relationship. The sample size x required to reach an area under
receiving operator characteristic of 0.95 is shown to be extrapolated with dotted lines.
Figure 3. (Continued)
7Sample size in medical image machine learning / Canadian Association of Radiologists Journal xx (2019) 1e10
Post hoc approaches
A class of pseudo-SSDMs was identiﬁed, which we
categorized as curve-ﬁtting approaches providing post hoc
sample size estimates. An ML problem where the underlying
data structure is more complex and difﬁcult to model (eg, a
classiﬁcation task with subtle interclass variance) is analo-
gous to low effect size in classical statistics; therefore, a
greater number of samples is often required to adequately
train the ML model . Since performance of models can
vary greatly depending on various parameters (eg, algorithm,
feature size, anatomy being imaged), an empirical approach
has the advantage of accurately modelling performance for a
speciﬁc task, without making distributional assumptions. In
their simulation experiment, Sahiner et al  predicted
(regressed) AUC(N) using the linear curve-ﬁtting approach,
which was within 0.02 of the theoretical AUC(N) calcu-
lated from the mean/covariance matrix of 249 mammograms.
Meanwhile, Cho et al  validated the learning curve
approach to be accurate within 0.75% of the observed OCA
at 1000 samples/class. Learning curve-ﬁtting approaches
have recently been used in other ﬁelds, such as DNA
microarray, text, waveform, and biospectroscopy classiﬁca-
tion, and differences between actual and predicted classiﬁ-
cation errors range from 1% to 7% and were lower for higher
sample sizes [43e47].
Recommendations for Future Work
The small number of studies included, restriction to
English-language literature, and risk of publication bias
(with lesser-impact studies going unpublished) remain limi-
tations to our systematic review. However, our study high-
lights great scope in standardizing current computational and
reporting practices. Efforts to create ML train-test pipelines
 and international documentations on standardization
still lack sample size considerations. Position statements by
radiologic associations have highlighted that adequate sam-
ple size considerations are essential to ensure robust, unbi-
ased clinical model development [11,14]. In our case, 18 of
22 studies provided a measure of variance for model per-
formance at different sample sizes. However, variance of the
curve parameters and estimates of required sample size were
unreported in all studies employing curve-ﬁtting procedures.
Hitzl et al  reported that only 1 of 18 studies provided
conﬁdence intervals for sensitivity and speciﬁcity rates for
binary classiﬁcation tasks similar to their own. Measures of
variance provide insight into a model’s generalizability and
can be used to perform statistical tests of signiﬁcance be-
tween performance at different sample sizes to identify a
point of diminishing returns .
There is scope for future work in streamlining and
comparing various PTPs. There is a need to develop and
test the efﬁcacy of more model-based approaches (by
comparing predicted versus observed performance) and to
consider development of hybrid approach where model-
based approaches can be leveraged with limited empir-
ical testing. These methods should aim to optimize
different performance metrics not merely limited to OCA.
Studying the response of different SSDMs and PTPs to
modality, feature set size and algorithm choice remains an
unexplored area of research. Moreover, many real-life
datasets are highly class-imbalanced, and the effect of
this skew in the data structure in the context of SSDMs
and PTPs is yet to be elucidated. In cases where classes
are not balanced, the smaller group may indeed be the
constraining factor in acquiring enough samples in order to
train a reliable ML model. As such, in unbalanced data-
sets, additional caution is warranted when handling
SSDMs, and researchers should potentially focus on the
sample size needs associated with the group with the
smallest number of samples. Finally, there is a need for
consensus in domain-speciﬁc thresholds for reporting
model performance so studies can be easily compared. All
SSDMs categorized in our study were applied to (binary
or multiclass) classiﬁcation tasks under the umbrella of
supervised learning. Exploring sample size requirements in
the growing ﬁeld of unsupervised learning in the context
of classiﬁcation, regression and segmentation tasks remains
an issue to be investigated. Furthermore, while our study
did not review sample size and error rate for all ML al-
gorithms applied to imaging, the establishment of a re-
pository containing standard models applied to various ML
tasks in the ﬁeld of medical imaging would be helpful in
guiding the sample size requirements of future projects in
Limitations to SSDMs
In Hitzl et al’s  experiment alone, model-based ap-
proaches had sample size estimates differing by greater than
1000 samples to achieve similar accuracy. Model-based
methods may lack the ability to capture intricacies of a
speciﬁc ML task and must be tuned to bias-variance trade-
offs  and function generating mechanisms  of
different algorithms. Curve-ﬁtting approaches can be criti-
cized for their need to perform empirical testing. We
observed great variability in sample sizes and replicates used
in PTPs, functions used in ﬁtting, saturating parameters,
balancing schemes employed, and thresholds chosen for
optimal performance. The NxSubsampling and NxCross
Validation schemes are limited by high variance at small
subsample sizes  and low test sample sizes ,
respectively. Attempts to overcome these challenges
[19,45,49] remain key to widespread implementation.
Finally, validity of the linear curve-ﬁtting approach is highly
dependent on the algorithm, number of features, and their
distributions [30,42]. While simulation is a powerful tool to
investigate these issues, its reliance on distributional as-
sumptions, inability to match subtle interclass and high
intraclass variability of real datasets, and computational
challenges in high dimensionality may render it unrepre-
sentative. Based on our systematic review, researchers should
attempt to estimate sample size requirements for their study
using both pre hoc and post hoc methods, keeping in mind
8I. Balki et al. / Canadian Association of Radiologists Journal xx (2019) 1e10
their unique advantages and disadvantages. Supplemental
Table S2 provides a brief summary comparison between
the two approaches based on the literature reviewed; how-
ever, there is yet much scope for establishing deﬁnitive
sample size estimation practice guidelines.
A systematic review of the literature enabled the identi-
ﬁcation and categorization of several procedures used to
evaluate ML model performance at various sample sizes. Pre
hoc model-based and post hoc curve-ﬁtting SSDMs may help
researchers in planning more cost-effective experiments.
This study highlights the scarcity of research in SSDMs in
medical ML, and guides further work in the ﬁeld.
Supplementary data related to this article can be found at
 Gorban AN, Tyukin IY. Blessing of dimensionality: mathematical
foundations of the statistical physics of data. Philos Trans A Math Phys
Eng Sci 2018;376:20170237.
 Ithapul VK, Singh V, Okonkwo O, Johnson SC. Randomized denoising
autoencoders for smaller and efﬁcient imaging based AD clinical trials.
Med Image Comput Comput Assist Interv 2014;17:470e8.
 Rasmussen PM, Hansen LK, Madsen KH, Churchill NW, Strother SC.
Model sparsity and brain pattern interpretation of classiﬁcation models
in neuroimaging. Pattern Recogn 2012;45:2085e100.
 Obermeyer Z, Emanuel E. Predicting the future dbig data, machine
learning, and clinical medicine. N Engl J Med 2016;375:1216e9.
 Raudys S, Jain A. Small sample size effects in statistical pattern
recognition: recommendations for practitioners. IEEE Trans Pattern
Anal Mach Intell 1991;13:252e64.
 Way TW, Sahiner B, Hadjiiski LM, Chan HP. Effect of ﬁnite sample
size on feature selection and classiﬁcation: a simulation study. Med
 Beiden SV, Maloof MA, Wagner RF. A general model for ﬁnite-sample
effects in training and testing of competing classiﬁers. IEEE Trans
Pattern Anal Mach Intell 2003;25:1561e9.
 Chan HP, Sahiner B, Wagner RF, Petrick N. Classiﬁer design for
computer-aided diagnosis: effects of ﬁnite sample size on the mean
performance of classical and neural network classiﬁers. Med Phys
 Biau DJ, Kern
eis S, Porcher R. Statistics in brief: the importance of
sample size in the planning and interpretation of medical research. Clin
Orthop Relat Res 2008;466:2282e8.
 Moody A. Perspective: the big picture. Nature 2013;502:S95.
 Park SH, Han K. Methodologic guide for evaluating clinical perfor-
mance and effect of artiﬁcial intelligence technology for medical
diagnosis and prediction. Radiology 2018;286:800e9.
 Pellegrini E, Ballerini L, Hernandez M del CV, et al. Machine learning
of neuroimaging for assisted diagnosis of cognitive impairment and
dementia: a systematic review. Alzheimers Dement 2018;10:519e35.
 Schnack HG, Kahn RS. Detecting neuroimaging biomarkers for psy-
chiatric disorders: sample size matters. Front Psychiatry 2016;7:50.
 Tang A, Tam R, Cadrin-Chenevert A, et al. Canadian Association of
Radiologists white paper on artiﬁcial intelligence in radiology. Can
Assoc Radiol J 2018;69:120e35.
 Moher D, Liberati A, Tetzlaff J, Altman DG. Preferred reporting items
for systematic reviews and meta-analyses: the PRISMA statement.
PLoS Med 2009;6:e1000097.
 Cui Z, Gong G. The effect of machine learning regression algorithms
and sample size on individualized behavioral prediction with functional
connectivity features. Neuroimage 2018;178:622e37.
 Cohen O, Zhu B, Rosen MS. MR ﬁngerprinting Deep RecOnstruction
NEtwork (DRONE). Magn Reson Med 2018;80:885e94.
 Zheng B, Chang Y-HH, Good WF, et al. Adequacy testing of training
set sample sizes in the development of a computer-assisted diagnosis
scheme. Acad Radiol 1997;4:497e502.
 Abdulkadir A, Mortamet BB, Vemuri P, Jack Jr CR, Krueger G,
oppel S, Alzheimer’s Disease Neuroimaging Initiative. Effects of
hardware heterogeneity on the performance of SVM Alzheimer’s dis-
ease classiﬁer. Neuroimage 2011;58:785e92.
 Samala RK, Chan HP, Hadjiiski L, Helvie MA, Richter CD, Cha KH.
Breast cancer diagnosis in digital breast tomosynthesis: effects of
training sample size on multi-stage transfer learning using deep neural
nets. IEEE Trans Med Imaging 2018;38:686e96.
 Dunnmon JA, Yi D, Langlotz CP, Re C, Rubin DL, Lungren MP.
Assessment of convolutional neural networks for automated classiﬁ-
cation of chest radiographs. Radiology 2018;290:537e54.
 Looney P, Stevenson GN, Nicolaides KH, et al. Fully automated, real-
time 3D ultrasound segmentation to estimate ﬁrst trimester placental
volume using deep learning. JCI Insight 2018;3:e120178.
 McKinley R, Hung F, Wiest R, Liebeskind DS, Scalzo F. A machine
learning approach to perfusion imaging with dynamic susceptibility
contrast MR. Front Neurol 2018;9:717.
 Tourassi GD, Floyd CE. The effect of data sampling on the perfor-
mance evaluation of artiﬁcial neural networks in medical diagnosis.
Med Decis Making 1997;17:186e92.
 Wang JY, Ngo MM, Hessl D, Hagerman RJ, Rivera SM. Robust ma-
chine learning-based correction on automatic segmentation of the
cerebellum and brainstem. PLoS One 2016;11:e0156123.
 Chang H, Borowsky A, Spellman P, Parvin B. Classiﬁcation of tumor
histology via morphometric context. Proc IEEE Comput Soc Conf
Comput Vis Pattern Recognit 2013;2013:10.
 Chang H, Nayak N, Spellman PT, Bahram P. Characterization of
tissue histopathology via predictive sparse decomposition and
spatial pyramid matching. Med Image Comput Comput Assist Interv
 Dinh CV, Steenbergen P, Ghobadi G, et al. Multicenter validation of
prostate tumor localization using multiparametric MRI and prior
knowledge. Med Phys 2017;44:949e61.
 Juntu J, Sijbers J, De Backer S, Rajan J, Van Dyck D. Machine learning
study of several classiﬁers trained with texture analysis features to
differentiate benign from malignant soft-tissue tumors in T1-MRI
images. J Magn Reson Imaging 2010;31:680e9.
 Sahiner B, Chan H-P, Petrick N, Wagner RF, Hadjiiski L. Feature se-
lection and classiﬁer performance in computer-aided diagnosis: the
effect of ﬁnite sample size. Med Phys 2000;27:1509e22.
 Zhao G, Liu F, Oler JA, Meyerand ME, Kalin NH, Birn RM. Bayesian
convolutional neural network based MRI brain extraction on nonhuman
primates. Neuroimage 2018;175:32e44.
 Morra JH, Tu Z, Apostolova LG, Green AE, Toga AW, Thompson PM.
Comparison of AdaBoost and support vector machines for detecting
Alzheimer’s disease through automated hippocampal segmentation.
IEEE Trans Med Imaging 2010;29:30e43.
 Park SC, Sukthankar R, Mummert L, Satyanarayanan M, Zheng B.
Optimization of reference library used in content-based medical image
retrieval scheme. Med Phys 2007;34:4331e9.
 Hitzl W, Reitsamer HA, Hornykewycz K, Mistlberger A, Grabner G.
Application of discriminant, classiﬁcation tree and neural network
analysis to differentiate between potential glaucoma suspects with and
without visual ﬁeld defects. J Theor Med 2003;5:161e70.
 Baum EB, Haussler D. What size net gives valid generalization? Neural
 Haykin S. Multilayer perceptrons. In: Haykin S, editor. Neural Net-
works: A Comprehensive Foundation. 2nd ed. Upper Saddle River, NJ:
Prentice Hall; 1998. p. 205e26.
9Sample size in medical image machine learning / Canadian Association of Radiologists Journal xx (2019) 1e10
 Rokem A, Wu Y, Lee A. Assessment of the need for separate test set
and number of medical images necessary for deep learning: a sub-
sampling study. bioRxiv 2017:196659. Available at: https://www.
August 9, 2019.
 Cho J, Lee K, Shin E, Choy G, Do S. How much data is needed to train
a medical image deep learning system to achieve necessary high ac-
curacy? arXiv preprint 2015. arXiv:1511.06348. Available at: https://
arxiv.org/pdf/1511.06348.pdf. Accessed August 9, 2019.
 Fukunaga K, Hayes RR. Effects of sample size in classiﬁer design.
IEEE Trans Pattern Anal Mach Intell 1989;11:873e85.
 Wagner RF, Chan H-P, Sahiner B, Petrick N, Mossoba JT. Finite-
sample effects and resampling plans: applications to linear classiﬁers in
computer-aided diagnosis. Med Imaging 1997;3034:467e77.
 Chan H-P, Sahiner B, Wagner RF, Petrick N. Effects of sample size on
classiﬁer design for computer-aided diagnosis. Proc SPIE Conf Med-
ical Imaging 1998;3338:845e58.
 Chan HP, Sahiner B, Wagner RF, Petrick N, Mossoba J. Effects
of sample size on classiﬁer design: quadratic and neural network clas-
siﬁers. Image Process Med Imaging 1997;3034(Pts 1 2):1102e13. 1150.
 Dobbin KK, Zhao Y, Simon RM. How large a training set is needed to
develop a classiﬁer for microarray data? Clin Cancer Res 2008;14:108e14.
 Dobbin KK, Simon RM. Sample size planning for developing classi-
ﬁers using high-dimensional DNA microarray data. Biostatistics 2007;
 Mukherjee S, Tamayo P, Rogers S, et al. Estimating dataset size re-
quirements for classifying DNA microarray data. J Comput Biol 2003;
 Beleites C, Neugebauer U, Bocklitz T, Krafft C, Popp J. Sample
size planning for classiﬁcation models. Anal Chim Acta 2013;760:
 Figueroa RL, Zeng-Treitler Q, Kandula S, Ngo LH. Predicting sample
size required for classiﬁcation performance. BMC Med Inform Decis
alez J, Burgos N, Fontanella S, et al. Yet another ADNI
machine learning paper? Paving the way towards fully-reproducible
research on classiﬁcation of Alzheimer’s disease. Proc Machine
Learning in Medical Imaging MLMI 2017, MICCAI Worskhop, Lec-
ture Notes in Computer Science 2017;10541:53e60.
 Sanchez BN, Wu M, Song PX, Wang W. Study design in high-
dimensional classiﬁcation analysis. Biostatistics 2016;17:722e36.
 Vapnik VN, Chervonenkis AY. On the uniform convergence of relative
frequencies of events to their probabilities. Theory Probab Appl 1971;
10 I. Balki et al. / Canadian Association of Radiologists Journal xx (2019) 1e10