ArticlePDF AvailableLiterature Review

Classical Statistics and Statistical Learning in Imaging Neuroscience

Authors:
  • McGill University & Mila

Abstract and Figures

Brain-imaging research has predominantly generated insight by means of classical statistics, including regression-type analyses and null-hypothesis testing using t-test and ANOVA. Throughout recent years, statistical learning methods enjoy increasing popularity especially for applications in rich and complex data, including cross-validated out-of-sample prediction using pattern classification and sparsity-inducing regression. This concept paper discusses the implications of inferential justifications and algorithmic methodologies in common data analysis scenarios in neuroimaging. It is retraced how classical statistics and statistical learning originated from different historical contexts, build on different theoretical foundations, make different assumptions, and evaluate different outcome metrics to permit differently nuanced conclusions. The present considerations should help reduce current confusion between model-driven classical hypothesis testing and data-driven learning algorithms for investigating the brain with imaging techniques.
Content may be subject to copyright.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
REVIEW
published: xx September 2017
doi: 10.3389/fnins.2017.00543
Frontiers in Neuroscience | www.frontiersin.org 1September 2017 | Volume 11 | Article 543
Edited by:
Yar os la v O. H al ch en ko ,
Dartmouth College, United States
Reviewed by:
Matthew Brett,
University of Cambridge,
United Kingdom
Jean-Baptiste Poline,
University of California, Berkeley,
United States
*Correspondence:
Danilo Bzdok
danilo.bzdok@rwth-aachen.de
Q2
Specialty section:
This article was submitted to
Brain Imaging Methods,
asectionofthejournal
Frontiers in Neuroscience
Received: 12 April 2017
Accepted: 19 September 2017
Published: xx September 2017
Citation:
Bzdok D (2017) Classical Statistics
and Statistical Learning in Imaging
Neuroscience.
Front. Neurosci. 11:543.
doi: 10.3389/fnins.2017.00543
Classical Statistics and Statistical
Learning in Imaging Neuroscience
Q1
Danilo Bzdok 1, 2, 3
*
1Department of Psychiatry, Psychotherapy and Psychosomatics, Medical Faculty, RWTH Aachen, Aachen, Germany, 2JARA, Q8
Jülich-Aachen Research Alliance, Translational Brain Medicine, Aachen, Germany, 3Parietal Team, INRIA, Gif-sur-Yvette,
France
Brain-imaging research has predominantly generated insight by means of classical
statistics, including regression-type analyses and null-hypothesis testing using t-test
and ANOVA. Throughout recent years, statistical learning methods enjoy increasing
popularity especially for applications in rich and complex data, including cross-validated
out-of-sample prediction using pattern classification and sparsity-inducing regression.
This concept paper discusses the implications of inferential justifications and algorithmic
methodologies in common data analysis scenarios in neuroimaging. It is retraced how
classical statistics and statistical learning originated from different historical contexts,
build on different theoretical foundations, make differentassumptions,andevaluate
different outcome metrics to permit differently nuanced conclusions. The present
considerations should help reduce current confusion between model-driven classical
hypothesis testing and data-driven learning algorithms forinvestigatingthebrainwith
imaging techniques.
Keywords: neuroimaging, data science, epistemology, statistical inference, machine learning, p-value, rosetta
stone
“The trick to being a scientist is to be open to using a wide variety of tools.”
Breiman (2001)
INTRODUCTION
Among the greatest challenges humans face are cultural misunderstandings between individuals, Q5 Q6
groups, and institutions (Hall, 1976). The topic of the present paper is the culture clash between
knowledge generation based on null-hypothesis testing and out-of-sample pattern generalization
(Friedman, 1998; Breiman, 2001; Shmueli, 2010; Donoho, 2015). These statistical paradigms are
now increasingly combined in brain-imaging studies (Kriegeskorte et al., 2009; Varoquaux and
Thirion, 2014). Ensuing inter-cultural misunderstandings are unfortunate because the invention
and application of new research methods has always been a driving force in the neurosciences
(Greenwald, 2012; Yuste, 2015). Here the goal is to disentangle the contexts underlying classical
statistical inference and out-of-sample generalization by providing a direct comparison of their
historical trajectories, modeling philosophies, conceptual frameworks, and performance metrics.
During recent years, neuroscience has transitioned from qualitative reports of few patients with
neurological brain lesions to quantitative lesion-symptom mapping on the voxel level in hundreds
of patients (Gläscher et al., 2012). We have gone from manually staining and microscopically
inspecting single brain slices to 3D models of neuroanatomy at micrometer scale (Amunts
et al., 2013). We have also gone from experimental studies conducted by a single laboratory
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
Bzdok Two St a tis t ica l C u lt u res i n N eur o im a g ing
to automatized knowledge aggregation across thousands of
previously isolated neuroimaging findings (Yarkoni et al., 2011;
Fox et al., 2014). Rather than laboriously collecting in-house
data published in a single paper, investigators are now routinely
reanalyzing multi-modal data repositories (Derrfuss and Mar,
2009; Markram, 2012; Van Essen et al., 2012; Kandel et al., 2013;
Poldrack and Gorgolewski, 2014). The detail of neuroimaging
datasets is hence growing in terms of information resolution,
sample size, and complexity of meta-information (Van Horn an d
Toga, 2014; Eickhoet al., 2016; Bzdok and Yeo, 2017). As a
consequence of the data demand of many pattern-recognition
algorithms, the scope of neuroimaging analyses has expanded
beyond the predominance of regression-type analyses combined
with null-hypothesis testing (Figure 1). Applications of statistical
learning methods (i) are more data-driven due to particularly
flexible models, (ii) have scaling properties compatible with
high-dimensional data with myriads of input variables, and (iii)
follow a heuristic agenda by prioritizing useful approximations to
patterns in data (Jordan and Mitchell, 2015; LeCun et al., 2015;
Blei and Smyth, 2017). Statistical learning (Hastie et al., 2001)
henceforth comprises the umbrella of “machine learning,” “data
mining,” “pattern recognition,” “knowledge discovery,” “high-
dimensional statistics,” and bears close relation to “data science.”
From a technical perspective, one should make a note
of caution that holds across application domains such as
neuroscience: While the research question often precedes the
choice of statistical model, perhaps no single criterion exists
that alone allows for a clear-cut distinction between classical
statistics and statistical learning in all cases. For decades, the
two statistical cultures have evolved in partly independent
sociological niches (Breiman, 2001). There is currently a scarcity
of scientific papers and books that would provide an explicit
account on how concepts and tools from classical statistics and
statistical learning are exactly related to each other. Efron and
Hastie are perhaps among the first to discuss the issue in their
book “Computer-Age Statistical Inference” (2016). The authors
cautiously conclude that statistical learning inventions,suchas
support vector machines, random-forest algorithms, and “deep”
neural networks, can not be easily situated in the classical
theory of twentieth century statistics. They go on to say that
“pessimistically or optimistically, one can consider this as a
bipolar disorder of the field or as a healthy duality that is bound to
improve both branches” (Efron and Hastie, 2016, p. 447). In the
current absence of a commonly agreed-upon theoretical account
from the technical literature, the present concept paper examines
applications of classical statistics vs. statistical learning in the
concrete context of neuroimaging analysis questions.
More generally, ensuring that a statistical eect discovered in
one set of data extrapolates to new observations in the brain can
take dierent forms (Efron, 2012). As one possible definition,
“the goal of statistical inference is to say what we have learned
about the population Xfrom the observed data x” (Efron and
Tibshirani, 1994). In a similar spirit, a committee report to
the National Academies of the USA stated (Committee on the
Analysis of Massive Data et al., 2013, p. 8): “Inference is the
problem of turning data into knowledge, where knowledge often
is expressed in terms of variables [...] that are not present in the
data per se, but are present in models that one uses to interpret
the data.” According to these definitions, statistical inference
can be understood as encompassing not only the classical null-
hypothesis testing framework but also Bayesian model inversion
to compute posterior distributions as well as more recently
emerged pattern-learning algorithms relying on out-of-sample
generalization (cf. Gigerenzer and Murray, 1987; Cohen, 1990;
Efron, 2012; Ghahramani, 2015). The important consequence for
the present considerations is that classical statistics and statistical
learning can give rise to dierent categories of inferential
thinking (Chamberlin, 1890; Platt, 1964; Efron and Tibshirani,
1994)—an investigator may ask an identical neuroscientific
question in dierent mathematical contexts.
For a long time, knowledge generation in psychology,
neuroscience, and medicine has been dominated by classical
statistics with estimation of linear-regression-like models and
subsequent statistical significance testing whether an eect
exists in the sample. In contrast, computation-intensive pattern
learning methods have always had a strong focus on prediction
in frequently extensive data with more modest concern for
interpretability and the “right” underlying question (Hastie
et al., 2001; Ghahramani, 2015). In many statistical learning
applications, it is standard practice to quantify the ability of
a predictive pattern to extrapolate to other samples, possibly
in individual subjects. In a two-step procedure,alearning
algorithm is fitted on a bigger amount of available data (training
data) and the ensuing fitted model is empirically evaluated
on a smaller amount of independent data (test data). This
stands in contrast to classical statistical inference where the
investigator seeks to reject the null hypothesis by considering the
entirety of a data sample (Wasserstein and Lazar, 2016), typically
all available subjects. In this case, the desired relevance of a
statistical relationship in the underlying population is ensured
by formal mathematical proofs and is not commonly ascertained
by explicit evaluations on new data (Breiman, 2001; Wasserstein
and Lazar, 2016). As such, generating insight according to
classical statistics and statistical learning serves rather distinct
modeling purposes. Classical statistics and statistical learning
do therefore not judge data on the same aspects of evidence
(Breiman, 2001; Shmueli, 2010; Arbabshirani et al., 2017; Bzdok
and Yeo, 2017). The two statistical cultures perform dierent
types of principled assessment for successful extrapolation of
a statistical relationship beyond the particular observations
at hand.
Taking an epistemological perspective helps appreciating
that scientific research is rarely an entirely objective process
but deeply depends on the beliefs and expectations of the
investigator. A new “scientific fact” about the brain is probably
not established in vacuo (Fleck et al., 1935;termsinquotes
taken from source). Rather, a research “object” is recognized
and accepted by the “subject” according to socially conditioned
“thought styles” that are cultivated among members of
“thought collectives.” A witnessed and measured neurobiological
phenomenon tends to only become “true” if not at odds with
the constructed “thought history” and “closed opinion system”
shared by that subject. The present paper will revisit and
reintegrate two such thought milieus in the context of imaging
Frontiers in Neuroscience | www.frontiersin.org 2September 2017 | Volume 11 | Article 543
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
Bzdok Two St a tis t ica l C u lt u res i n N eur o im a g ing
FIGURE 1 | Application areas of two statistical paradigms. Lists examples of research domains which apply relatively more classical statistics (blue) or learning
Q3 Q4
algorithms (red). The co-occurrence of increased computational resources, growing data repositories, and improving pattern-learning techniques have initiated a shift
toward less hypothesis-driven and more algorithmic methodologies. As a broad intuition, researchers in the empirical sciences on the left tend to use statistics to
evaluate a pre-assumed model on the data. Researchers in the application domains on the right tend to derive a model directly from the data: A new function with
potentially many parameters is created that can predict the output from the input alone without explicit programming model. One of the key differences becomes
apparent when thinking of the neurobiological phenomenon under study as a black box (Breiman, 2001). ClSt typically aims at modeling the black box by making a
set of formal assumptions about its content, such as the nature of the signal distribution. Gaussian distributional assumptions have been very useful in many instances
to enhance mathematical convenience and, hence, computational tractability. Instead, StLe takes a brute-force approach to model the output of the black box (e.g.,
tell healthy and schizophrenic people apart) from its input (e.g., volumetric brain measurements) while making a possible minimum of assumptions (Abu-Mostafa et al.,
2012). In ClSt the stochastic processes that generated the data is therefore treated as partly known, whereas in StLe the phenomenon is treated as complex, largely
unknown, and partly unknowable.
neuroscience: classical statistics (ClSt) and statisticallearning
(StLe).
DIFFERENT HISTORIES: THE ORIGINS OF
CLASSICAL HYPOTHESIS TESTING AND
PATTERN-LEARNING ALGORITHMS
One of many possible ways to group statistical methods is by
framing them along the lines of ClSt and StLe. The incongruent
historical developments of the two statistical communities are
even evident from their basic terminology. Inputs to statistical
models are usually called independent variables,explanatory
variables, or predictors in the ClSt community, but are typically
called features collected in a feature space in the StLe community.
The model outputs are typically called dependent variables,
explained variable, or responses in ClSt, while these are often
called target variables in StLe. It follows a summary of
characteristic events in the development of what can today be
considered as ClSt and StLe (Figure 2).
Around 1,900 the notions of standard deviation,goodness
of fit,andthep<0.05 threshold emerged (Cowles and
Davis, 1982). This was also the period when William S. Gosset
published the t-test under the incognito name “Student” to
quantify production quality in Guinness breweries. Motivated
by concrete problems such as the interaction between potato
varieties and fertilizers, Ronald A. Fisher invented the analysis
of variance (ANOVA), null-hypothesis testing, promoted p-values,
and devised principles of proper experimental conduct (Fisher
and Mackenzie, 1923; Fisher, 1925, 1935). Another framework
by Jerzy Neyman and Egon S. Pearson proposed the alternative
hypothesis, which allowed for the statistical notions of power,
false positives and false negatives, but left out the concept
of p-values (Neyman and Pearson, 1933). This was a time
before electrical calculators emerged after World War II (Efron
and Tibshirani, 1991; Gigerenzer, 1993). Student’s t-test and
Fisher’s inference framework were institutionalized by American
psychology textbooks widely read in the 40s and 50s, while
Neyman and Pearson’s framework only became increasingly
known in the 50s and 60s. Today’s applied statistics textbooks
Frontiers in Neuroscience | www.frontiersin.org 3September 2017 | Volume 11 | Article 543
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
Bzdok Two St a tis t ica l C u lt u res i n N eur o im a g ing
FIGURE 2 | Developments in the history of classical statistics and statistical
learning. Examples of important inventions in statistical methodology. Roughly,
anumberofstatisticalmethodstaughtintodaystextbooksin psychology and
medicine have emerged in the first half of the twentieth century (blue). Instead,
many algorithmic techniques and procedures have emerged in the second half
of the twentieth century (red). “The postwar era witnessed a massive
expansion of statistical methodology, responding to the data-driven demands
fo modern scientific technology.” (Efron and Hastie, 2016)
have inherited a mixture of the Fisher and Neyman-Pearson
approaches to statistical inference.
It is a topic of current debate1,2,3 whether ClSt is a discipline
that is separate from StLe (e.g., Chambers, 1993; Breiman,
2001; Friedman, 2001; Bishop and Lasserre, 2007; Shalev-Shwartz
and Ben-David, 2014; Efron and Hastie, 2016)orif“statistics
denotes a broader methodological class that includes both ClSt
and StLe tools as its members (e.g., Tukey, 1962; Cleveland,
2001; Jordan and Mitchell, 2015; Blei and Smyth, 2017). StLe
methods may be more often adopted by computer scientists,
physicists, engineers, and others who typically have less formal
statistical background and may be more frequently working in
industry rather than academia. In fact, John W. Tukey foresaw
many of the developments that led up to what one might
today call statistical learning (Tukey, 1962). He early proposed
a “peaceful collision of computing and statistics”. A modern
reformulation of the same idea states (Efron and Hastie, 2016):
“If the inference/algorithm race is a tortoise-and-hare aair, then
modern electronic computation has bred a bionic hare.” Indeed,
1“Data Science and Statistics: dierent worlds?” (Panel at RoyalStatisticalSociety
UK, March 2015) (https://www.youtube.com/watch?v=C1zMUjHOLr4”)
2“50 years of Data Science” (David Donoho, Tukey Centennial workshop, USA,
September 2015)
3“Are ML and Statistics Complementary?” (Max Welling, 6th IMS-ISBA meeting,
December 2015)
kernel methods, decision trees, nearest-neighbor algorithms,
graphical models, and various other statistical tools actually
emerged in the ClSt community, but largely continued to develop
in the StLe community (Friedman, 2001).
As often cited beginnings of statistical learning approaches,
the perceptron was an early brain-inspired computing algorithm
(Rosenblatt, 1958), and Arthur Samuel created a checker
board program that succeeded in beating its own creator
(Samuel, 1959). Such studies toward artificial intelligence (AI)
led to enthusiastic optimism and subsequent periodes of
disappointment during the so-called “AI winters” in the late 70s
and around the 90s (Russell and Norvig, 2002; Kurzweil, 2005;
Cox and Dean, 2014), while the increasingly available computers
in the 80s encouraged a new wave of statistical algorithms
(Efron and Tibshirani, 1991). Later, the use of StLe methods
increased steadily in many quantitative scientific domains as
they underwent an increase in data richness from classical “long
data” (samples n>variables p)toincreasinglyencountered
“wide data” (n<< p)(
Tibshirani, 1996; Hastie et al., 2015).
The emerging field of StLe has received conceptual consolidation
by the seminal book “The Elements of Statistical Learning”
(Hastie et al., 2001). The coincidence of changing data properties,
increasing computational power, and cheaper memory resources
encouraged a still ongoing resurge in StLe research and
applications approximately since 2000 (Manyika et al., 2011;
UK House of Common S.a.T, 2016). For instance, over the last
15 years, sparsity assumptions gained increasing relevance for
statistical and computational tractability as well as for domain
interpretability when using supervised and unsupervised learning
algorithms (i.e., with and without target variables) in the high-
dimensional “n<< p” setting (
Bühlmann and Van De Geer,
2011; Hastie et al., 2015). More recently, improvements in
training very “deep” (i.e., many non-linear hidden layers) neural-
networks architectures (Hinton and Salakhutdinov, 2006)have
much improved automatized feature selection (Bengio et al.,
2013) and have exceeded human-level performance in several
application domains (LeCun et al., 2015).
In sum, “the biggest dierence between pre- and post-war
statistical practice is the degree of automation” (Efron and
Tibshirani, 1994) up to a point where “almost all topics in twenty-
first-century statistics are now computer-dependent” (Efron and
Hastie, 2016). ClSt has seen many important inventions in the
first half of the twentieth century, which have often developed
at statistical departments of academic institutions and remain in
nearly unchanged form in current textbooks of psychology and
other empirical sciences. The emergence of StLe as a coherent
field has mostly taken place in the second half of the twentieth
century as a number of disjoint developments in industry
and often non-statistical departments in academia (e.g., AT&T
Bell Laboratories), which lead for instance to artificial neural
networks, support vector machines, and boosting algorithms
(Efron and Hastie, 2016). Today, systematic education in StLe
is still rare at the large majority of universities, in contrast to
the many consistently oered ClSt courses (Cleveland, 2001;
Vanderplas, 2013; Burnham and Anderson, 2014; Donoho, 2015).
In neuroscience, the advent of brain-imaging techniques,
including positron emission tomography (PET) and functional
Frontiers in Neuroscience | www.frontiersin.org 4September 2017 | Volume 11 | Article 543
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
Bzdok Two St a tis t ica l C u lt u res i n N eur o im a g ing
magnetic resonance imaging (fMRI), allowed for the in-vivo
characterization of the neural correlates underlying sensory,
cognitive, or aective tasks. Brain scanning enabled quantitative
brain measurements with many variables per observation
(analogous to the advent of high-dimensional microarrays in
genetics; Efron, 2012). Since the inception of PET and fMRI,
deriving topographical localization of neural activity changes
was dominated by analysis approaches from ClSt, especially
the general linear model (Scheé, 1959; Poline and Brett,
2012; GLM). The classical approach to neuroimaging analysis is
probably best exemplified by the statistical parametric mapping
(SPM) software package that implements the GLM to provide a
mass-univariate characterization of regionally specific eects.
As distributed information over voxels is less well captured by
many ClSt approaches, including common GLM applications,
StLe models were proposed early on for neuroimaging
investigations. For instance, principal component analysis
was used to distinguish globally distributed neural activity
changes (Moeller et al., 1987)aswellastostudyAlzheimers
disease (Grady et al., 1990). Canonical correlation analysis
was used to quantify complex relationships between task-free
neural activity and schizophrenia symptoms (Friston et al.,
1992). However, these first approaches to “multivariate” brain-
behavior associations did not ignite a major research trend
(cf. Worsley et al., 1997; Friston et al., 2008). As a seminal
contribution, Haxby and colleagues devised an innovative
across-voxel correlation analysis to provide evidence against
the widely assumed face-specificity of neural responses in
the ventral temporal cortex (2001). This ClSt realization of
one-nearest neighbor classification based on correlation distance
foreshadowed several important developments, including (i)
joint analysis of sets of brain locations to capture “distributed and
overlapping representations”, (ii) repeated analysis in dierent
splits of the data sample to compare against chance performance,
and (iii) analysis across multiple stimulus categories to assess
the specificity of neural responses. The finding of distributed
face representation was confirmed in independent, similar data
(Cox and Savoy, 2003)andbasedonneuralnetworkalgorithms
(Hanson et al., 2004).
The application of StLe methods in neuroimaging increased
further after rebranding as “mind-reading,” “brain decoding,
and “MVPA” (Haynes and Rees, 2005; Kamitani and Tong,
2005). Note that “MVPA” initally referred to “multivoxel pattern
analysis” (Kamitani and Tong, 2005; Norman et al., 2006)
and later changed to “multivariate pattern analysis” (Haynes
and Rees, 2005; Hanke et al., 2009; Haxby, 2012). Up to that
point, the term prediction had less often been used by imaging
neuroscientists in the sense of out-of-sample generalization of
a learning algorithm and more often in the incompatible sense
of (in-sample) linear correlation such as using Pearson’s or
Spearman’s method (Shmueli, 2010; Gabrieli et al., 2015). While
there was scarce discussion of the position of “decoding” models
in formal statistical terms, growing interest was manifested in
first review publications and tutorial papers on applying StLe
methods to neuroimaging data (Haynes and Rees, 2006; Mur
et al., 2009; Pereira et al., 2009). The interpretational gains of
this new access to the neural representation of behavior and
its disturbances in disease was flanked by the availability of
necessary computing power and memory resources. Although
challenging to realize, “deep” neural network algorithms have
recently been introduced to neuroimaging research (Plis et al.,
2014; de Brebisson and Montana, 2015; Güçlü and van Gerven,
2015). These computation-intensive models might help in
approximating and deciphering the nature of neural processing
in brain circuits (Cox and Dean, 2014; Yamins and DiCarlo,
2016). As the dimensionality and complexity of neuroimaging
datasets are constantly increasing, neuroscientific investigations
will be always more likely to benefit from StLe methods given
their natural scaling to large-scale data analysis (Efron, 2012;
Efron and Hastie, 2016; Blei and Smyth, 2017).
From a conceptual viewpoint (Figure 3), a large majority of
statistical methods can be situated somewhere on a continuum
between the two poles of ClSt and StLe (Committee on the
Analysis of Massive Data et al., 2013; Efron and Hastie,
2016; p. 61). ClSt was mostly fashioned for problems with
small samples that can be grasped by plausible models with
a small number of parameters chosen by the investigator in
an analytical fashion. StLe was mostly fashioned for problems
with many variables in potentially large samples with little
knowledge of the data-generating process that gets emulated by
a mathematical function derived from data in a heuristic fashion.
Tools from ClSt therefore typically assume that the data behave
according to certain known mechanisms, whereas StLe exploits
algorithmic techniques to avoid many a-priori specifications of
data-generating mechanisms. Neither ClSt or StLe nor any of the
other categories of statistical models can be considered generally
superior. This relativism is captured by the so-called no free
lunch theorem4(Wolpert, 1996): no single statistical strategy can
consistently do better in all circumstances (cf. Gigerenzer, 2004).
As a very general rule of thumb, ClSt preassumes and formally
tests a model for the data, whereas StLe extracts and empirically
evaluates a model from the data.
CASE STUDY ONE: COGNITIVE
CONTRAST ANALYSIS AND DECODING
MENTAL STATES
Vignette: A neuroimaging investigator wants to reveal the neural
correlates underlying face processing in humans. 40 healthy,
right-handed adults are recruited and undergo a block design
experiment run in a 3T MRI scanner with whole-brain coverage.
In a passive viewing paradigm, 60 colored stimuli of unfamiliar
faces are presented, which have forward head and gaze position.
The control condition presents colored pictures of 60 dierent
houses to the participants. In the experimental paradigm, a
picture of a face or a house is presented for 2 s in each trial and
the interval between trials within each block is randomly jittered
varying from 2 to 7 s. The picture stimuli are presented in pseudo-
randomized fashion and are counterbalanced in each passively
4In the supervised setting, there is no a priori distinction between learning
algorithms evaluated by out-of-sample prediction error. In the optimization
setting of finite spaces, all algorithms searching an extremum perform identically
when averaged across possible cost functions. (http://www.no-free-lunch.org/)
Frontiers in Neuroscience | www.frontiersin.org 5September 2017 | Volume 11 | Article 543
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
Bzdok Two St a tis t ica l C u lt u res i n N eur o im a g ing
FIGURE 3 | Key differences in the modeling philosophy of classical statistics and statistical learning. Ten modeling intuitions that tend to be relatively more
characteristic for classical statistical methods (blue) or pattern-learning methods (red). In comparison to ClSt, StLe“isessentiallyaformofappliedstatisticswith
increased emphasis on the use of computers to statistically estimate complicated functions and a decreased emphasis on proving confidence intervals around these
functions” (Goodfellow et al., 2016). Broadly, ClSt tends to be more analytical by imposing mathematical rigor on the phenomenon, whereas StLe tends to be more
heuristic by finding useful approximations. In practice, ClSt is probably more often applied to experimental data, where a set of target variables are systematically
controlled by the investigator and the brain system under studied has been subject to experimental perturbation. Instead, StLe is probably more often applied to
observational data without such structured influence and where the studied system has been left unperturbed. ClSt fully specifies the statistical model at the beginning
of the investigation, whereas in StLe there is a bigger emphasis on models that can flexibly adapt to the data (e.g., learningalgorithmscreatingdecisiontrees).
watching participant. Despite the blocked presentation of stimuli,
each experiment trial is modeled separately. The fMRI data are
analyzed using a GLM as implemented in the SPM software
package. Two task regressors are included in the model for the
face and house conditions based on the stimulus onsets and
viewing durations and using a canonical hemodynamic response
function. In the GLM design matrix, the face column and house
column are hence set to 1 for brain scans from the corresponding
task condition and set to 0 otherwise. Separately in each brain
voxel, the GLM parameters are estimated, which fits betaface
and betahouse regression coecients to explain the contribution
of each experimental task to the neural activity increases and
decreases observed in that voxel. A t-test can then formally assess
whether the fMRI signal in the current voxel is significantly
more involved in viewing faces as opposed to the house control
condition.
Question: What is the statistical dierence between
subtracting the neural activity from the face vs. house conditions
and decoding the neural activity during face vs. house processing?
Computing cognitive contrasts is a ClSt approach that was and
still is routinely performed in the mass-univariate regime: it fits a
separate GLM model for each individual voxel in the brain scans
and then tests for significant dierences between the obtained
condition coecients (Friston et al., 1994). Instead, decoding
cognitive processes from neural activity is a StLe approach thatis
typically performed in a multivariate regime:alearningalgorithm
is trained on a large number of voxel observations in brain scans
and then the model’s prediction accuracy is evaluated on sets of
new brain scans. These ClSt and StLe approaches to identifying
the neural correlates underlying cognitive processes of interest
are closely related to the notions of encoding models and decoding
models, respectively (Kriegeskorte, 2011; Naselaris et al., 2011;
Pedregosa et al., 2015;butseeGüçlüetal.,2015). Q9
Encoding models regress the brain data against a design
matrix with indicators of the face vs. house condition and
formally test whether the dierence is statistically significant.
Decoding models typically aim to predict these indicators by
training and empirically evaluating classification algorithms on
dierent splits from the whole dataset. In ClSt parlance, the
model explains the neural activity, the dependent or explained
variable, measured in each separate brain voxel, by the beta
coecients according to the experimental condition indicators
in the design matrix columns, the independent or explanatory
variables. That is, the GLM can be used to explain neural activity
changes by a linear combination of experimental variables
(Naselaris et al., 2011). Answering the same neuroscientific
question with decoding models in StLe jargon, the model
weights of a classifier are fitted on the training set of the
input data to predict the class labels,thetarget variables, and
are subsequently evaluated on the test set by cross-validation
to obtain their out-of-sample generalization performance. Here,
classification algorithms are used to predict entries of the
Frontiers in Neuroscience | www.frontiersin.org 6September 2017 | Volume 11 | Article 543
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
Bzdok Two St a tis t ica l C u lt u res i n N eur o im a g ing
design matrix by identifying a linear or more complicated
combination between the many simultaneously considered brain
voxels (Pereira et al., 2009). More broadly, ClSt applications
in functional neuroimaging tend to estimate the presence of
cognitive processes from neural activity, whereas many StLe
applications estimate properties of neural activity from dierent
cognitive tasks.
A key dierence between many ClSt-mediated encoding
models and StLe-mediated decoding models thus pertains to
the direction of statistical estimation between brain space and
behavior space (Friston et al., 2008; Varoquaux and Thirion,
2014). It was noted (Friston et al., 2008) that the direction of
brain-behavior association is related to the question whether the
stimulus indicators in the model act as causes by representing
deterministic experimental variables of an encoding model or
consequences by representing probabilistic outputs of a decoding
model. Such considerations also reveal the intimate relationship
of ClSt models to the notion of forward inference, while StLe
methods are probably more often used for formal reverse
inference in functional neuroimaging (Poldrack, 2006; Eickho
et al., 2011; Yarkoni et al., 2011; Varoquaux and Thirion, 2014).
On the one hand, forward inference relates to encoding models
by testing the probability of observing activity in a brain location
given knowledge of a psychological process. On the other hand,
reverse inference relates to brain decoding to the extent that
classification algorithms can learn to distinguish experimental
fMRI data to belong to two psychological conditions and
subsequently be used to estimate the presence of specific
cognitive processes based on new neural activity observations (cf.
Poldrack, 2006). Finally, establishing a brain-behavior association
has been argued to be more important than the actual direction
of the mapping function (Friston, 2009). This author stated
that “showing that one can decode activity in the visual
cortex to classify [...] a subject’s percept is exactly the sameas
demonstrating significant visual cortex responses to perceptual
changes” and, conversely, “all demonstrations of functionally
specialized responses represent an implicit mindreading.”
Conceptually, GLM-based encoding models follow a
localization agenda by testing hypotheses on regional eects of
functional specialization in the brain (where?). A t-test is used
to compare pairs of neural activity estimates to statistically
distinguish the target face and the non-target house condition
(Friston et al., 1996). Essentially, this test for significant
dierences between the fitted beta coecients corresponds to
two stimulus indicators based on well-founded arguments from
cognitive theory. This statistical approach assumes that cognitive
subtraction is possible, that is, the regional brain responses of
interest can be isolated by contrasting two sets of brain scans that
are believed to dier in the cognitive facet of interest (Friston
et al., 1996; Stark and Squire, 2001). For one voxel location
at a time, an attempt is made to reject the null hypothesis of
no dierence between the averaged neural activity level of a
target brain state and the averaged neural activity of a control
brain state. It is important to appreciate that the localization
agenda thus emphasizes the relative dierence in fMRI signal
during tasks and may neglect the individual neural activity
information of each particular task (Logothetis et al., 2001).
Note that the univariate GLM analysis can be extended to more
than one output (dependent or explained) variable within the
ClSt regime by performing a multivariate analysis of covariance
(MANCOVA). This allows for tests of more complex hypotheses
but incurs multivariate normality assumptions (Kriegeskorte,
2011).
More generally, it is seldom mentioned that the standard
GLM would not have been solvable for unique solutions in the
high-dimensional “n<< p”regime,insteadoffittingonemodel
for each voxel in the brain scans. This is because the number
of brain voxels pexceed by far the number of data samples
n (i.e., leading to an under-determined system of equations),
which incapacitates many statistical estimators from ClSt (cf.
Giraud, 2014; Hastie et al., 2015). Regularization by sparsity-
inducing norms, such as in modern penalized regression analysis
using the LASSO and ElasticNet, emerged only later (Tibshirani,
1996; Zou and Hastie, 2005) as a principled StLe strategy to de-
escalate the need for dimensionality reduction or preliminary
filtering of important voxels and to enable the tractability of the
high-dimensional analysis setting.
Because hypothesis testing for significant dierences between
beta coecients of fitted GLMs relies on comparing the means of
neural activity measurements, the results from statistical tests are
not corrupted by the conventionally applied spatial smoothing
with a Gaussian filter. On the contrary, this image preprocessing
step even helps the correction for multiple comparisons based
on random fields theory (cf. below), alleviates inter-individual
neuroanatomical variability, and can thus increases sensitivity.
Spatial smoothing however discards fine-grained neural activity
patterns spatially distributed across voxels that potentially carry
information associated with mental operations (cf. Kamitani and
Sawahata, 2010; Haynes, 2015). Indeed, some authors believe that
sensory, cognitive, and motor processes manifest themselvesas
“neuronal population codes” (Averbec k et al ., 2006 ). Relevance
of such population codes in human neuroimaging was for
instance suggested by revealing subject-specific neural responses
in the fusiform gyrus to facial stimuli (Saygin et al., 2012).
In applications of StLe models, the spatial smoothing step
is therefore often skipped because the “decoding” algorithms
precisely exploit the locally varying structure of the salt-and-
pepper patterns in fMRI signals.
In so doing, decoding models use learning algorithms in an
information agenda by showing generalization of robust patterns
to new brain activity acquisitions (Kriegeskorte et al., 2006; Mur
et al., 2009; de-Wit et al., 2016). Information that is weak in one
voxel but spatially distributed across voxels can be eectively
harvested in a structure-preserving fashion (Haynes and Rees,
2006; Haynes, 2015). This modeling agenda is focused on the
whole neural activity pattern,incontrasttothelocalization
agenda dedicated to separate increases or decreases in neural
activity level. For instance, the default mode network typically
exhibits activity decreases at the onset of many psychological
tasks with visual or other sensory stimuli, whereas the induced
activity patterns in that less activated network may nevertheless
functionally subserve task execution (Bzdok et al., 2016; Christo
et al., 2016). Some brain-behavior associations might only emerge
when simultaneously capturing neural activity in a group of
Frontiers in Neuroscience | www.frontiersin.org 7September 2017 | Volume 11 | Article 543
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
Bzdok Two St a tis t ica l C u lt u res i n N eur o im a g ing
voxels but disappear in single-voxel approaches, such as mass-
univariate GLM analyses (cf. Davatzikos, 2004). Note that,
analogous to multivariate variants of the GLM, decoding could
also be replaced by classical statistical approaches (cf. Haxby
et al., 2001; Brodersen et al., 2011a). For many linear classification
algorithm trained to predict face vs. house stimuli based on many
brain voxels, model fitting typically searches iteratively through
the hypothesis space (=function space)ofthechosenlearning
model. In our case, the final hypothesis selected by the linear
classifier commonly corresponds to one specific combination of
model weights (i.e., a weighted contribution of individual brain
measurements) that equates with one mapping function from the
neural activity features to the face vs. house target variable.
Among other views, it has previously been proposed
(Brodersen, 2009) that four types of neuroscientific questions
become readily quantifiable through StLe applications to
neuroimaging: (i) Where is an information category neurally
processed? This can extend the interpretational spectrum from
increase and decrease of neural activity to the existence of
complex combinations of activity variations distributed across
voxels. For instance, across-voxel linear correlation could decode
object categories from the ventral temporal cortex even after
excluding the fusiform gyrus, which is known to be responsive
to object stimuli (Haxby et al., 2001). (ii) Whether a given
information category is reflected by neural activity? This
can extend the interpretational spectrum to topographically
similar but neurally distinct processes that potentially underlie
dierent cognitive facets. For instance, linear classifiers could
successfully distinguish whether a subject is attending to
the first or second of two simultaneously presented stimuli
(Kamitani and Tong, 2005). (iii) When is an information
category generated (i.e., onset), processed (i.e., duration), and
bound (i.e., alteration)? When applying classifiers to neural time
series, the interpretational spectrum can be extended to the
beginning, evolution, and end of distinct cognitive facets.For
instance, dierent classifiers have been demonstrated to map
the decodability time structure of mental operation sequences
(King and Dehaene, 2014). (iv) More controversially, how is an
information category neurally processed? The interpretational
spectrum can be extended to computational properties of the
neural processes, including processing in brain regions vs. brain
networks or isolated vs. partially shared processing facets. For
instance, a classifier trained for evolutionarily conservedeye
gaze movement was able to decode evolutionarily more recent
mathematical calculation processes as a possible case of “neural
recycling” in the human brain (Knops et al., 2009; Anderson,
2010). As an important caveat in interpreting StLe models, the
particular technical properties of a chosen learning algorithm
(e.g., linear vs. non-linear support vector machines) can probably
seldom serve as a convincing argument for reverse-engineering
mechanisms of neural information processing as measured by
fMRI scanning (cf. Misaki et al., 2010).
In sum, the statistical properties of ClSt and StLe methods
have characteristic consequences in neuroimaging analysis and
interpretation. They hence oer dierent access routes and
complementary answers to identical neuroscientific questions.
CASE STUDY TWO: SMALL VOLUME
CORRECTION AND SEARCHLIGHT
ANALYSIS
Vignette: The neuroimaging experiment from case study 1
successfully identified the fusiform gyrus of the ventral visual
stream to be more responsive to face stimuli than house
stimuli. However, the investigator’s initial hypothesis of also
observing face-responsive neural activity in the ventromedial
prefrontal cortex could not be confirmed in the whole-brain
analyses. The investigator therefore wants to follow up with a
topographically focused approach that examines dierences in
neural activity between the face and house conditions exclusively
in the ventromedial prefrontal cortex.
Question: What are the statistical implications of delineating
task-relevant neural responses in a spatially constrained search
space rather than analyzing brain measurements of the entire
brain?
A popular ClSt approach to corroborate less pronounced
neural activity findings is small volume correction. This region
of interest (ROI) analysis involves application of the mass-
univariate GLM approach only to the ventromedial prefrontal
cortex as a preselected biological compartment, rather than
considering the gray-matter voxels of the entire brain in a naïve,
topographically unconstrained fashion. Small volume correction
allows for significant findings in the ROI that remain sub-
threshold after accounting for the tens of thousands of multiple
comparisons in the whole-brain GLM analysis. Small volume
correction is therefore a simple means to alleviate the multiple-
comparisons problem that motivated more than two decades of
still ongoing methodological developments in the neuroimaging
domain (Worsley et al., 1992; Smith et al., 2001; Friston,
2006; Nichols, 2012). Whole-brain GLM results were initially
reported as uncorrected findings without accounting for multiple
comparisons, then with Bonferroni’s family wise error (FWE)
correction, later by random field theory correction using neural
activity height (or clusters), followed by false discovery rate
(FDR) (Genovese et al., 2002) and slowly increasing adoption
of cluster-thresholding for voxel-level inference via permutation
testing (Smith and Nichols, 2009). Rather than the isolated
voxel, it has early been discussed that a possibly better unit
of interest should be spatially neighboring voxel groups (see
here for an overview: Chumbley and Friston, 2009). The setting
of high regional correlation of neural activity was successfully
addressed by random field theory that provide inferences not
about individual voxels but topological features in the underlying
(spatially continuous) eects. This topological inference is used
to identify clusters of relevant neural activity changes from their
peak, size, or mass (Worsley et al., 1992). Importantly, the spatial
dependencies of voxel observations were not incorporated into
the GLM estimation step, but instead taken into account during
the subsequent model inference step to alleviate the multiple-
comparisons problem.
ArelatedcousinofsmallvolumecorrectionintheStLe
world would be to apply classification algorithms to a subset
of voxels to be considered as input to the model (i.e., feature
Frontiers in Neuroscience | www.frontiersin.org 8September 2017 | Volume 11 | Article 543
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
Bzdok Two St a tis t ica l C u lt u res i n N eur o im a g ing
selection). In particular, searchlight analysis is an increasingly
popular learning technique that can identify locally constrained
multivariate patterns in neural activity (Friman et al., 2001;
Kriegeskorte et al., 2006). For each voxel in the ventromedial
prefrontal cortex, the brain measurements of the immediate
neighborhood are first collected (e.g., radius of 10 mm voxels).
In each such searchlight, a classification algorithm, for instance
linear support vector machines, is then trained on one part
of the brain scans (training set) and subsequently applied to
determine the prediction accuracy in the remaining, unseen brain
scans (test set). In this StLe approach, the excess of brain voxels
is handled by performing pattern recognition analysis in only
dozens of locally adjacent voxel neighborhoods at a time. Finally,
the mean classification accuracy of face vs. house stimuli across
all permutations over the brain data is mapped to the center of
each considered sphere. The searchlight is then moved through
the ROI until each seed voxel had once been the center voxel
of the searchlight. This yields a voxel-wise classification map of
accuracy estimates for the entire ventromedial prefrontal cortex.
Consistent with the information agenda (cf. above), searchlight
analysis quantifies the extent to which (local) neural activity
patterns can predict the dierence between the house and face
conditions. It contrasts small volume correction that determines
whether one experimental condition exhibited a significant
neural activity increase or decrease relative to a particular
other experimental condition, consistent with the localization
agenda. Further, searchlight analysis alleviates the burden of
abundant input variables by fitting learning algorithms restricted
to the voxels in small sphere neighborhoods. However, the
searchlight procedure thus yields many prediction performances
for many brain locations, which motivates correction for multiple
comparisons across the considered neighborhoods.
When considering high-dimensional brain scans through
the ClSt lens, the statistical challenge resides in solving the
multiple-comparisons problem (Nichols and Hayasaka, 2003;
Nichols, 2012). From the StLe stance, however, it is the curse
of dimensionality and overfitting that statistical analyses need to
tackle (Friston et al., 2008; Domingos, 2012). Many neuroimaging
analyses based on ClSt methods can be viewed as testing a
particular hypothesis (i.e., the null hypothesis) repeatedly in a
large number of separate voxels. In contrast, testing whether
learning algorithm extrapolate to new brain data can be viewedas
searching through thousands of dierent hypotheses in a single
process (i.e., walking through the hypothesis space; cf. above)
(Shalev-Shwartz and Ben-David, 2014).
As common brain scans oers measurements of >100,000
brain locations, a mass-univariate GLM analysis typically entails
the same statistical test to be applied >100,000 times. The more
often the investigator tests a hypothesis of relevance for a brain
location, the more locations will be falsely detected as relevant
(false positive, Type I error), especially in the noisy neuroimaging
data. All dimensions in the brain data (i.e., voxel variables) are
implicitly treated as equally important and no neighborhoods of
most expected variation are statistically exploited (Hastie et al.,
2001). Hence, the absence of restrictions on observable structure
in the set of data variables during the statistical modeling of
neuroimaging data takes a heavy toll at the final inference
step. This is where random field theory comes to the rescue.
As noted above, this form of topological inference dispenses
with the problem of inferring which voxels are significant and
tries to identify significant topological features in the underlying
distributed responses. By definition, topological features like
maxima are sparse events and can be thought of as a form of
dimensionality reduction—not in data space but in the statistical
characterization of where neural responses occur.
This is contrasted by the high-dimensional StLe regime, where
the initial model family chosen by the investigator determines
the complexity restrictions to all data dimensions (i.e., all voxels,
not single voxels) that are imposed explicitly or implicitly by
the model structure. Model choice predisposes existing but
unknown low-dimensional neighborhoods in the full voxel space
to achieve the prediction task. Here, the toll is taken at the
beginning of the investigation because there are so many dierent
alternative model choices that would impose a dierent set of
complexity constraints to the high-dimensional measurements
in the brain. For instance, signals from “brain regions” are
likely to be well approximated by models that impose discrete,
locally constant compartments on the data (e.g., k-means or
spatially constrained Ward clustering). Instead, tuning model
choice to signals from macroscopical “brain networks” should
impose overlapping, locally continuous data compartments (e.g.,
independent component analysis or sparse principal component
analysis) (Yeo et al., 2014; Bzdok and Yeo, 2017; Bzdok et al.,
2017).
Exploiting such eective dimensions in the neuroimaging
data (i.e., coherent brain-behavior associations involving many
distributed brain voxels) is a rare opportunity to simultaneously
reduce the model bias and model variance, despite their typical
inverse relationship (Hastie et al., 2001). Model bias relates
to prediction failures incurred because the learning algorithm
can systematically not represent certain parts of the underlying
relationship between brain scans and experimental conditions
(formally, the deviation between the target function and the
average function space of the model). Model variance relates
to prediction failures incurred by noise in the estimation of
the optimal brain-behavior association (formally, the dierence
between the best-choice input-output relation and the average
function space of the model). A model that is too simple to
capture a brain-behavior association probably underfits due to
high bias. Yet, an overly complex model probably overfits due to
high variance. Generally, high-variance approaches are better at
approximating the “true” brain-behavior relation (i.e., in-sample
model estimation), while high-bias approaches have a higher
chance of generalizing the identified pattern to new observations
(i.e., out-of-sample model evaluation). The bias-variance tradeo
can be useful in explaining why applications of statistical models
intimately depend on (i) the amount of available data, (ii) the
typically not known amount of noise in the data, and (iii) the
unknown complexity of the target function in nature (Abu-
Mostafa et al., 2012).
Learning algorithms that overcome the curse of
dimensionality—extracting coherent patterns from all brain
voxels at once—typically incorporate an implicit bias for
anisotropic neighborhoods in the data (Hastie et al., 2001;
Frontiers in Neuroscience | www.frontiersin.org 9September 2017 | Volume 11 | Article 543
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
Bzdok Two St a tis t ica l C u lt u res i n N eur o im a g ing
Bach, 2014; Bzdok et al., 2015). Put dierently, prediction
models successful in the high-dimensional setting have an
in-built specialization to representing types of functions that
are compatible with the structure to be uncovered in the brain
data. Knowledge embodied in a learning algorithm suited to
a particular application domain can better calibrate the sweet
spot between underfitting and overfitting. When applying a
model without any complexity restrictions to high-dimensional
data generalization becomes dicult to impossible because all
directions in the data (i.e., individual brain voxels) are treated
equally with isotropic structure. At the root of the problem, all
data samples look virtually identical to the learning algorithm in
high-dimensional data scenarios (Bellman, 1961). The learning
algorithm will not be able to see through the idiosyncracies in
the data, will tend to overfit, and thus be unlikely to generalize
to new observations. Such considerations provide insight into
why the multiple-comparisons problem is more often an issue
in encoding studies, while overfitting is more closely related
to decoding studies (Friston et al., 2008). The juxtaposition of
ClSt and StLe views oers insights into why restricting neural
data analysis to an ROI with fewer voxels, rather than the whole
brain, simultaneously alleviates both the multiple-comparisons
problem (ClSt) and the curse of dimensionality (StLe).
As an practical summary, drawing classical inference in
neuroimaging data has largely been performed by considering
each voxel independently and by massive simultaneous testing
of a same null hypothesis in all observed voxels. This has
incurred a multiple-comparisons problem dicult enough
that common approaches may still be prone to incorrect
results (Efron, 2012). In contrast, aiming for generalization
of a pattern in high-dimensional neuroimaging data to new
observations in the brain incurs the equally challenging curse of
dimensionality. Successfully accounting for the high number of
input dimensions will probably depend on learning models that
impose neurobiologically justified bias and keeping the variance
under control by dimensionality reduction and regularization
techniques.
More broadly, asking at what point new neurobiological
knowledge is arising during ClSt and StLe investigations relies on
largely distinct theoretical frameworks that revolve around null-
hypothesis testing and statistical learning theory (Figure 4). Both
ClSt and StLe methods share the common goal of demonstrating
relevance of a given eect in the data beyond the sample
brain scans at hand. However, the attempt to show successful
extrapolation of a statistical relationship at the general population
is embedded in dierent mathematical contexts. Knowledge
generation in ClSt and StLe is hence rooted in dierent notions
of statistical inference.
ClSt laid down its most important inferential framework in
the Popperian spirit of critical empiricism (Popper, 1935/2005):
scientific progress is to be made by continuous replacement of
current hypotheses by ever more pertinent hypotheses using
falsification. The rationale behind hypothesis falsification is
that one counterexample can reject a theory by deductive
reasoning,whileanyquantityofevidencecannotconrma
given theory by inductive reasoning (Goodman, 1999). The
investigator verbalizes two mutually exclusive hypotheses by
domain-informed judgment. The alternative hypothesis should
be conceived as the outcome intended by the investigator and
to contradict the state of the art of the research topic. The
null hypothesis represents the devil’s advocate argument that
the investigator wants to reject (i.e., falsify) and it should
automatically deduce from the newly articulated alternative
hypothesis. A conventional 5%-threshold (i.e., equating with
roughly two standard deviations) guards against rejection due to
the idiosyncrasies of the sample that are not representative ofthe
general population. If the data have a probability of 5% given
the null hypothesis [P(result|H0)], it is evaluated to be significant.
Such a test for statistical significance indicates a dierence between
two means with a 5% chance of being a false positive finding.
If the null hypothesis can not be rejected (which depends on
power), then the test yields no conclusive result, rather than
anullresult(
Schmidt, 1996). In this way, classical hypothesis
testing continuously replaces currently embraced hypotheses
explaining a phenomenon in nature by better hypotheses with
more empirical support in a Darwinian selection process. Finally,
Fisher, Neyman, and Pearson intended hypothesis testing as a
marker for further investigation, rather than an o-the-shelf
decision-making instrument (Cohen, 1994; Nuzzo, 2014).
In StLe instead, answers to how neurobiological conclusions
can be drawn from a dataset at hand are provided by the Vapni k -
Chervonenkis dimensions (VC dimensions) from statistical
learning theory (Vapnik, 1989, 1996). The VC dimensions of
a pattern-learning algorithm quantify the probability at which
the distinction between the neural correlates underlying the
face vs. house conditions can be captured and used for correct
predictions in new, possibly later acquired brain scans from the
same cognitive experiment (i.e., out-of-sample generalization).
Such statistical approaches implement the inductive strategy to
learn general principles (i.e., the neural signature associated
with given cognitive processes) from a series of exemplary
brain measurements, which contrasts the deductive strategy of
rejecting a certain null hypothesis based on counterexamples
(cf. Tenenbaum et al., 2011; Bengio, 2014; Lake et al., 2015).
The VC dimensions measure how complicated the examined
relationship between brain scans and experimental conditions
could become—in other words, the richness of the representation
which can be instantiated by the used model, the complexity
capacity of its hypothesis space, the “wiggliness” of the decision
boundary used to distinguish examples from several classes, or,
more intuitively, the “currency” of learnability. VC dimensions
are derived from the maximal number of dierent brain scans
that can be correctly detected to belong to either the house
condition or the face condition by a given model. The VC
dimensions thus provide a theoretical guideline for the largest set
of brain scan examples fed into a learning algorithm such that
this model is able to guarantee zero classification errors.
As one of the most important results from statistical learning
theory, in any intelligent learning system, the opportunity to
derive abstract patterns in the world by reducing the discrepancy
between prediction error from training data (in-sample estimate)
and prediction error from independent test data (out-of-sample
estimate) decreases with the higher model capacity and increases
with the number of available training observations (Vapnik
Frontiers in Neuroscience | www.frontiersin.org 10 September 2017 | Volume 11 | Article 543
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
Bzdok Two St a tis t ica l C u lt u res i n N eur o im a g ing
FIGURE 4 | Key concepts in classical statistics and statistical learning. Schematic with statistical notions that are relativelymoreassociatedwithclassicalstatistical
methods (left column) or pattern-learning methods (right column). As there is a smooth transition between the classicalstatisticaltoolkitandlearningalgorithms,some
notions may be closely associated with both statistical cultures (middle column).
and Kotz, 1982; Vapnik, 1996). In brain imaging, a learning
algorithm is hence theoretically backed up to successfully predict
outcomes in future brain scans with high probability if the
choosen model ignores structure that is overly complicated, such
as higher-order non-linearities between many brain voxels,and
if the model is provided with a sucient number of training
brain scans. Hence, VC dimensions provide explanations why
increasing the number of considered brain voxels as input
features (i.e., entailing increased number of model parameters)
or using a more sophisticated prediction model, requires more
training data for successful generalization. Notably, the VC
dimensions (analogous to null-hypothesis testing) are unrelated
to the target function, as the “true” mechanisms underlying the
studied phenomenon in nature. Nevertheless, the VC dimensions
provide justification that a certain learning model can be usedto
approximate that target function by fitting a model to a collection
of input-output pairs. In short, VC dimensions is among the
best frameworks to derive theoretical errors bounds for predictive
models (Abu-Mostafa et al., 2012).
Further, some common invalidations of the ClSt and StLe
statistical frameworks are conceptually related. An often-raised
concern in neuroimaging studies performing classical inference is
double dipping or circular analysis (Kriegeskorte et al., 2009). This
occurs when, for instance, first correlating a behavioral measure
with brain activity and then using the identified subset of brain
voxels for a second correlation analysis with that same behavioral
measurement (Vul et al., 2008; Lieberman et al., 2009). In this
scenario, voxels are submitted to two statistical tests with the
same goal in a nested, non-independent fashion5(Freedman,
1983). This corrupts the validity of the null hypothesis on which
the reported test results conditionally depend. Importantly, this
case of repeating a same statistical estimation with iteratively
pruned data selections (on the training data split) is a valid
routine in the StLe framework, such as in recursive feature
extraction (Guyon et al., 2002; Hanson and Halchenko, 2008).
However, double-dipping or circular analysis in ClSt applications
to neuroimaging data have an analog in StLe analyses aiming at
out-of-sample generalization: data-snooping or peeking (Pereira
et al., 2009; Abu-Mostafa et al., 2012; Fithian et al., 2014).
This occurs, for instance, when performing simple (e.g., mean-
centering) or more involved (e.g., k-means clustering) target-
variable-dependent or -independent preprocessing on the entire
dataset if it should be applied separately to the training sets
5“If you torture the data enough, nature will always confess.” (Coase R. H., 1982
How should economists choose?)
Frontiers in Neuroscience | www.frontiersin.org 11 September 2017 | Volume 11 | Article 543
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
Bzdok Two St a tis t ica l C u lt u res i n N eur o im a g ing
and test sets. Data-snooping can lead to overly optimistic cross-
validation estimates and a trained learning algorithm that fails
on fresh data drawn from the same distribution (Abu-Mostafa
et al., 2012). Rather than a corrupted null hypothesis, it is
the error bounds of the VC dimensions that are loosened and,
ultimately, invalidated because information from the concealed
test set influences model selection on the training set.
In sum, statistical inference in ClSt is drawn by using the
entire data at hand to formally test for theoretically guaranteed
extrapolation of an eect to the general population. In stark
contrast, inferential conclusions in StLe are typically drawn by
fitting a model on a larger part of the data at hand (i.e., in-
sample model selection) and empirically testing for successful
extrapolation to an independent, smaller part of the data (i.e.,
out-of-sample model evaluation). As such, ClSt has a focus on
in-sample estimates and explained-variance metrics that measure
some form of goodness of fit, while StLe has a focus on out-of-
sample estimates and prediction accuracy.
CASE STUDY THREE: SIGNIFICANT
GROUP DIFFERENCES AND PREDICTING
THE GROUP OF PARTICIPANTS
Vignette: After isolating the neural correlates underlyingface
processing, the neuroimaging investigator wants to examine
their relevance in psychiatric disease. In addition to the 40
healthy participants, 40 patients diagnosed with schizophrenia
are recruited and administered the same experimental paradigm
and set of face and house pictures. In this clinical fMRI study
on group dierences, the investigator wants to explore possible
imaging-derived markers that index deficits in social-aective
processing in patients carrying a diagnosis of schizophrenia.
Question: Can metrics of statistical relevance from ClSt and
StLe be combined to corroborate a given candidate biomarker?
Many investigators in imaging neuroscience share a
background in psychology, biology, or medicine, which
includes training in traditional “textbook” statistics. Many
neuroscientists have thus adopted a natural habit of assessing
the quality of statistical relationships by means of p-values,
eect sizes, confidence intervals, and statistical power. These
are ubiquitously taught and used at many universities, although
they are not the only coherent set of statistical diagnostics
(Figure 5). These outcome metrics from ClSt may for instance
be less familiar to some scientists with a background in
computer science, physics, engineering, or philosophy. As
an equally legitimate and internally coherent, yet less widely
known diagnostic toolkit from the StLe community, prediction
accuracy, precision, recall, confusion matrices, F1 score, and
learning curves can also be used to measure the relevance of
statistical relationships (Abu-Mostafa et al., 2012; Yarkoni and
Westfall, 2016).
On a general basis, applications of ClSt and StLe methods
may not judge findings on identical grounds (Breiman, 2001;
Shmueli, 2010; Lo et al., 2015). There is an often-overlooked
misconception that models with high explanatory performance
do necessarily exhibit high predictive performance (Wu et al.,
2009; Lo et al., 2015; Yarkoni and Westfall, 2016). For instance,
brain voxels in ventral visual stream found to well explain the
dierence between face processing in healthy and schizophrenic
participants based on an ANOVA may not in all cases be the
best brain features to train a support vector machine to predict
this group eect in new participants. An important outcome
measure in ClSt is the quantified significance associated with a
statistical relationship between few variables given a pre-specified
model. ClSt tends to test for a particular structure in the brain
data based on analytical guarantees,informofasmathematical
convergence theorems about approximating the population
properties with increasing sample size. The outcome measure
for StLe is the quantified generalization of patterns between
many variables or, more generally, the robustness of special
structure in the data (Hastie et al., 2001). In the neuroimaging
literature, reports of statistical outcomes have previously been
noted to confuse diagnostic measures from classical statistics and
statistical learning (Friston, 2012).
For neuroscientists adopting a ClSt culture computing p-
values takes a central position. The p-value denotes the
probability of observing a result at least as extreme as a
test statistic, assuming the null hypothesis is true. Results are
considered significant when it is equal or below a pre-specified
value, like p=0.05 (Anderson et al., 2000). Under the condition
of suciently high power (cf. below), it quantifies the strength
of evidence against the null hypothesis as a continuous function
(Rosnow and Rosenthal, 1989). Counterintuitively, it is not an
immediate judgment on the alternative hypothesis H1preferred
by the investigator (Cohen, 1994; Anderson et al., 2000). P-
values do also not qualify the possibility of replication. It is
another important caveat that a finding in the brain becomes
more statistically significant (i.e., lower p-value) with increasing
sample size (Berkson, 1938; Miller et al., 2016).
The essentially binary p-value (i.e., significant vs. not
significant) is therefore often complemented by continuous eect
size measures for the importance of rejecting H0. The eect
size allows the identification of marginal eects that pass the
statistical significance threshold but are not practically relevant
in the real world. The p-value is a deductive inferential measure,
whereas the eect size is a descriptive measure that follows neither
inductive nor deductive reasoning. The (normalized) eect size
can be viewed as the strength of a statistical relationship—how
much H0deviates from H1, or the likely presence of an eect
in the general population (Chow, 1998; Ferguson, 2009; Kelley
and Preacher, 2012). This diagnostic measure is often unit-
free, sample-size independent, and typically standardized. As a
property of the actual statistical test, the eect size can be essential
to report for biological understanding, but has dierent names
and takes various forms, such as rho in Pearson correlation,
eta2in explained variances, and Cohen’s d in dierences between
group averages.
Additionally, the certainty of a point estimate (i.e., the
outcome is a value) can be expressed by an interval estimate
(i.e., the outcome is a value range) using confidence intervals
(Casella and Berger, 2002). These variability diagnostics indicate
arangeofvaluesbetweenwhichthetruevaluewillfall
a given proportion of the time (Estes, 1997; Nickerson,
Frontiers in Neuroscience | www.frontiersin.org 12 September 2017 | Volume 11 | Article 543
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
Bzdok Two St a tis t ica l C u lt u res i n N eur o im a g ing
FIGURE 5 | Key differences between measuring outcomes in classical statistics and statistical learning. Ten intuitions on quantifying statistical modeling outcomes
that tend to be relatively more true for classical statistical methods (blue) or pattern-learning methods (red). ClSt typically yields point estimates and interval estimates
(e.g., p-values, variances, confidence intervals), whereas StLe frequently outputs a function or a program that can yield point and interval estimates on new
observations (e.g., the k-means centroids or a trained classifier’s decision function can be applied to new data). In many cases, classical inferenceisajudgmentabout
an entire data sample, whereas a trained predictive model canobtainquantitativeanswersfromasingledatapoint.
2000; Cumming, 2009). Typically, a 95% confidence interval
is spanned around the population mean in 19 out of 20
cases across all observed samples. The tighter the confidence
interval, the smaller the variance of the point estimate of the
population parameter in each drawn sample. The estimation of
confidence intervals is influenced by sample size and population
variability. Confidence intervals may be asymmetrical (ignored
by Gaussianity assumptions; Efron, 2012), can be reported for
dierent statistics and with dierent percentage borders. Notably,
they can be used as a viable surrogate for formal tests of statistical
significance in many scenarios (Cumming, 2009).
Some confidence intervals can be computed in various data
scenarios and statistical regimes, whereas the power may be
especially meaningful within the culture of classical hypothesis
testing (Cohen, 1977, 1992; Oakes, 1986). To estimate power the
investigator needs to specify the true eect size and variance
under H1. The ClSt-minded investigator can then estimate the
probability for rejecting null hypotheses that should be rejected,
at the given threshold alpha and given that H1is true. A
high power thus ensures that statistically significant and non-
significant tests indeed reflect a property of the population
(Chow, 1998). Intuitively, a small confidence interval around
a relevant eect suggests high statistical power. False negatives
(i.e., Type II errors, beta error) become less likely with higher
power (=1—beta error) (cf. Ioannidis, 2005). Concretely, an
underpowered investigation means that the investigator is less
likely to be able to distinguish between H0and H1at the
specified significance threshold alpha. Power calculations depend
on several factors, including significance threshold alpha, the
eect size in the population, variation in the population, sample
size n, and experimental design (Cohen, 1992).
While neuroimaging studies based on classical statistical
inference ubiquitously report p-values and confidence intervals,
there have however been few reports of eect size in the
neuroimaging literature (Kriegeskorte et al., 2010). Eect sizes
are however necessary to compute power estimates. This
explains the even rarer occurrence of power calculations in
the neuroimaging literature (Yarkoni and Braver, 2010;butsee
Poldrack et al., 2017). Given the importance of p-values and eect
sizes, the goal of computing both these useful statistics, such as
for group dierences in the neural processing of face stimuli,
can be achieved based on two independent samples of these
experimental data (especially if some selection process has been
used). One sample would be used to perform statistical inference
on the neural activity change yielding a p-value and one sample
to obtain unbiased eect sizes. Further, it has been previously
emphasized (Friston, 2012)thatp-values and eect sizes reflect
in-sample estimates in a retrospective inference regime (ClSt).
These metrics find an analog in out-of-sample estimates issued
from cross-validation in a prospective prediction regime (StLe).
In-sample eect sizes are typically an optimistic estimate of
the “true” eect size (inflated by high significance thresholds),
whereas out-of-sample eect sizes are unbiased estimates of the
“true” eect size.
In the high-dimensional scenario, the StLe-minded
investigator analyzing “wide” neuroimaging data in our
Frontiers in Neuroscience | www.frontiersin.org 13 September 2017 | Volume 11 | Article 543
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
Bzdok Two St a tis t ica l C u lt u res i n N eur o im a g ing
case, computing, and judging statistical significance by p-
values can become challenging (Bühlmann and Van De Geer,
2011; Efron, 2012; James et al., 2013). Instead, classification
accuracy on fresh data is a frequently reported performance
metric in neuroimaging studies using learning algorithms. The
classification accuracy is a simple summary statistic that captures
the fraction of correct prediction instances among all performed
applications of a fitted model. Basing interpretation on accuracy
alone can be an insucient diagnostic because it is frequently
influenced by the number of samples, the local characteristics
of hemodynamic responses, eciency of experimental design,
data folding into train and test sets, and dierences in the
feature number p(Haynes, 2015). A potentially under-exploited
data-driven tool in this context is bootstrapping. The archetypical
example of computer-intensive statistical method enables
population-level inference of unknown distributions largely
independent of model complexity by repeated random draws
from the neuroimaging data sample at hand (Efron, 1979; Efron
and Tibshirani, 1994). This opportunity to equip various point
estimates by an interval estimate of certainty (e.g., the possibly
asymmetrical interval for the “true” accuracy of a classifier) is
unfortunately seldom embraced in neuroimaging today (but
see Bellec et al., 2010; Pernet et al., 2011; Vogelstein et al.,
2014). Besides providing confidence intervals, bootstrapping
can also perform non-parametric null hypothesis testing. This
may be one of few examples of a direct connection between
ClSt and StLe methodology. Alternatively, binomial tests have
been used to obtain a p-value estimate of statistical significance
from accuracies and other performance scores (Pereira et al.,
2009; Brodersen et al., 2013; Hanke et al., 2015)inthebinary
classification setting. It has frequently been employed to reject
the null hypothesis that two categories occur equally often.
There are however increasing concerns about the validity of this
approach if statistical independence between the performance
estimates (e.g., prediction accuracies from each cross-validation
fold) is in question (Pereira and Botvinick, 2011; Noirhomme
et al., 2014; Jamalabadi et al., 2016). Yet another option to derive
p-values from classification performances of two groups is label
permutation based on non-parametric resampling procedures
(Nichols and Holmes, 2002; Golland and Fischl, 2003). This
algorithmic significance-testing tool can serve to reject the null
hypothesis that the neuroimaging data do not contain relevant
information about the group labels in many complex data
analysis settings.
The neuroscientist who adopted a StLe culture is in the habit
of corroborating prediction accuracies using cross-validation: the
de facto standard to obtain an unbiased estimate of a model’s
capacity to generalize beyond the brain scans at hand (Hastie
et al., 2001; Bishop, 2006). Model assessment is commonly done
by training on a bigger subset of the available data (i.e., training
set for in-sample performance) and subsequent application of the
trained model to the typically smaller remaining part of data
(i.e., test set for out-of-sample performance), both assumed to
be drawn from the same distribution. Cross-validation typically
divides the sample into data splits such that the class label
(i.e., healthy vs. schizophrenic) of each data point is to be
predicted once. The pairs of model-predicted label and the
corresponding true label for each data point (i.e., brain scan)
in the dataset can then be submitted to the quality measures
(Powers, 2011), including prediction accuracy (inversely related
to prediction error), precision,recall,andF1 score.Accuracyand
the other performance metrics are often computed separately
on the training set and the test set. Additionally, the measures
from training and testing can be expressed by their inverse
(e.g., training error as in-sample error and test error as out-
of-sample error) because the positive and negative cases are
interchangeable.
The classification accuracy can be further decomposed into
group-wise metrics based on the so-called confusion matrix,the
juxtaposition of the true and predicted group memberships. The
precision measures (Tab l e 1 ) how many of the labels predicted
from brain scans are correct, that is, how many participants
predicted to belong to a certain class really belong to that class.
Put dierently, among the participants predicted to suer from
schizophrenia, how many have really been diagnosed with that
disease? On the other hand, the recall measures how many labels
are correctly predicted, that is, how many members of a class
were predicted to really belong to that class. Hence, among the
participants known to be aected by schizophrenia, how many
were actually detected as such? Precision can be viewed as a
measure of “exactness” and recall as a measure of “completeness
(Powers, 2011).
Neither accuracy, precision, or recall allow injecting subjective
importance into the evaluation process of the learning algorithm.
This disadvantage is addressed by the Fbeta score:aweighted
combination of the precision and recall prediction scores.
Concretely, the F1score would equally weigh precision and recall
of class predictions, while the F0.5 score puts more emphasis on
precision and the F2score more on recall. Moreover, applications
of recall, precision, and Fbeta scores have been noted to ignore the
true negative cases as well as to be highly susceptible to estimator
bias (Powers, 2011). Needless to say, no single outcome metric
can be equally optimal in all contexts.
Extending from the setting of healthy-diseased classification
to the multi-class setting (e.g., comparing healthy, schizophrenic,
bipolar, and autistic participants) injects ambiguity into the
interpretation of accuracy scores. Rather than reporting mere
better-than-chance findings in StLe analyses, it becomes more
important to evaluate the F1, precision and recall scores for
each class to be predicted in the brain scans (e.g., Brodersen
et al., 2011b; Schwartz et al., 2013). It is important to appreciate
that the sensitivity/specificity metrics, perhaps more frequently
reported in ClSt communities, and the precision/recall metrics,
probably more frequently reported in StLe communities, tell
slightly dierent stories about identical neuroscientific findings.
TAB LE 1 |
Q26
Q4
Notion Formula
Specificity true negative/(true negative +false positive)
Sensitivity/Recall t