Content uploaded by Danilo Bzdok

Author content

All content in this area was uploaded by Danilo Bzdok on Sep 25, 2017

Content may be subject to copyright.

Available via license: CC BY 4.0

Content may be subject to copyright.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

REVIEW

published: xx September 2017

doi: 10.3389/fnins.2017.00543

Frontiers in Neuroscience | www.frontiersin.org 1September 2017 | Volume 11 | Article 543

Edited by:

Yar os la v O. H al ch en ko ,

Dartmouth College, United States

Reviewed by:

Matthew Brett,

University of Cambridge,

United Kingdom

Jean-Baptiste Poline,

University of California, Berkeley,

United States

*Correspondence:

Danilo Bzdok

danilo.bzdok@rwth-aachen.de

Q2

Specialty section:

This article was submitted to

Brain Imaging Methods,

asectionofthejournal

Frontiers in Neuroscience

Received: 12 April 2017

Accepted: 19 September 2017

Published: xx September 2017

Citation:

Bzdok D (2017) Classical Statistics

and Statistical Learning in Imaging

Neuroscience.

Front. Neurosci. 11:543.

doi: 10.3389/fnins.2017.00543

Classical Statistics and Statistical

Learning in Imaging Neuroscience

Q1

Danilo Bzdok 1, 2, 3

*

1Department of Psychiatry, Psychotherapy and Psychosomatics, Medical Faculty, RWTH Aachen, Aachen, Germany, 2JARA, Q8

Jülich-Aachen Research Alliance, Translational Brain Medicine, Aachen, Germany, 3Parietal Team, INRIA, Gif-sur-Yvette,

France

Brain-imaging research has predominantly generated insight by means of classical

statistics, including regression-type analyses and null-hypothesis testing using t-test

and ANOVA. Throughout recent years, statistical learning methods enjoy increasing

popularity especially for applications in rich and complex data, including cross-validated

out-of-sample prediction using pattern classiﬁcation and sparsity-inducing regression.

This concept paper discusses the implications of inferential justiﬁcations and algorithmic

methodologies in common data analysis scenarios in neuroimaging. It is retraced how

classical statistics and statistical learning originated from different historical contexts,

build on different theoretical foundations, make differentassumptions,andevaluate

different outcome metrics to permit differently nuanced conclusions. The present

considerations should help reduce current confusion between model-driven classical

hypothesis testing and data-driven learning algorithms forinvestigatingthebrainwith

imaging techniques.

Keywords: neuroimaging, data science, epistemology, statistical inference, machine learning, p-value, rosetta

stone

“The trick to being a scientist is to be open to using a wide variety of tools.”

Breiman (2001)

INTRODUCTION

Among the greatest challenges humans face are cultural misunderstandings between individuals, Q5 Q6

groups, and institutions (Hall, 1976). The topic of the present paper is the culture clash between

knowledge generation based on null-hypothesis testing and out-of-sample pattern generalization

(Friedman, 1998; Breiman, 2001; Shmueli, 2010; Donoho, 2015). These statistical paradigms are

now increasingly combined in brain-imaging studies (Kriegeskorte et al., 2009; Varoquaux and

Thirion, 2014). Ensuing inter-cultural misunderstandings are unfortunate because the invention

and application of new research methods has always been a driving force in the neurosciences

(Greenwald, 2012; Yuste, 2015). Here the goal is to disentangle the contexts underlying classical

statistical inference and out-of-sample generalization by providing a direct comparison of their

historical trajectories, modeling philosophies, conceptual frameworks, and performance metrics.

During recent years, neuroscience has transitioned from qualitative reports of few patients with

neurological brain lesions to quantitative lesion-symptom mapping on the voxel level in hundreds

of patients (Gläscher et al., 2012). We have gone from manually staining and microscopically

inspecting single brain slices to 3D models of neuroanatomy at micrometer scale (Amunts

et al., 2013). We have also gone from experimental studies conducted by a single laboratory

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

Bzdok Two St a tis t ica l C u lt u res i n N eur o im a g ing

to automatized knowledge aggregation across thousands of

previously isolated neuroimaging ﬁndings (Yarkoni et al., 2011;

Fox et al., 2014). Rather than laboriously collecting in-house

data published in a single paper, investigators are now routinely

reanalyzing multi-modal data repositories (Derrfuss and Mar,

2009; Markram, 2012; Van Essen et al., 2012; Kandel et al., 2013;

Poldrack and Gorgolewski, 2014). The detail of neuroimaging

datasets is hence growing in terms of information resolution,

sample size, and complexity of meta-information (Van Horn an d

Toga, 2014; Eickhoﬀet al., 2016; Bzdok and Yeo, 2017). As a

consequence of the data demand of many pattern-recognition

algorithms, the scope of neuroimaging analyses has expanded

beyond the predominance of regression-type analyses combined

with null-hypothesis testing (Figure 1). Applications of statistical

learning methods (i) are more data-driven due to particularly

ﬂexible models, (ii) have scaling properties compatible with

high-dimensional data with myriads of input variables, and (iii)

follow a heuristic agenda by prioritizing useful approximations to

patterns in data (Jordan and Mitchell, 2015; LeCun et al., 2015;

Blei and Smyth, 2017). Statistical learning (Hastie et al., 2001)

henceforth comprises the umbrella of “machine learning,” “data

mining,” “pattern recognition,” “knowledge discovery,” “high-

dimensional statistics,” and bears close relation to “data science.”

From a technical perspective, one should make a note

of caution that holds across application domains such as

neuroscience: While the research question often precedes the

choice of statistical model, perhaps no single criterion exists

that alone allows for a clear-cut distinction between classical

statistics and statistical learning in all cases. For decades, the

two statistical cultures have evolved in partly independent

sociological niches (Breiman, 2001). There is currently a scarcity

of scientiﬁc papers and books that would provide an explicit

account on how concepts and tools from classical statistics and

statistical learning are exactly related to each other. Efron and

Hastie are perhaps among the ﬁrst to discuss the issue in their

book “Computer-Age Statistical Inference” (2016). The authors

cautiously conclude that statistical learning inventions,suchas

support vector machines, random-forest algorithms, and “deep”

neural networks, can not be easily situated in the classical

theory of twentieth century statistics. They go on to say that

“pessimistically or optimistically, one can consider this as a

bipolar disorder of the ﬁeld or as a healthy duality that is bound to

improve both branches” (Efron and Hastie, 2016, p. 447). In the

current absence of a commonly agreed-upon theoretical account

from the technical literature, the present concept paper examines

applications of classical statistics vs. statistical learning in the

concrete context of neuroimaging analysis questions.

More generally, ensuring that a statistical eﬀect discovered in

one set of data extrapolates to new observations in the brain can

take diﬀerent forms (Efron, 2012). As one possible deﬁnition,

“the goal of statistical inference is to say what we have learned

about the population Xfrom the observed data x” (Efron and

Tibshirani, 1994). In a similar spirit, a committee report to

the National Academies of the USA stated (Committee on the

Analysis of Massive Data et al., 2013, p. 8): “Inference is the

problem of turning data into knowledge, where knowledge often

is expressed in terms of variables [...] that are not present in the

data per se, but are present in models that one uses to interpret

the data.” According to these deﬁnitions, statistical inference

can be understood as encompassing not only the classical null-

hypothesis testing framework but also Bayesian model inversion

to compute posterior distributions as well as more recently

emerged pattern-learning algorithms relying on out-of-sample

generalization (cf. Gigerenzer and Murray, 1987; Cohen, 1990;

Efron, 2012; Ghahramani, 2015). The important consequence for

the present considerations is that classical statistics and statistical

learning can give rise to diﬀerent categories of inferential

thinking (Chamberlin, 1890; Platt, 1964; Efron and Tibshirani,

1994)—an investigator may ask an identical neuroscientiﬁc

question in diﬀerent mathematical contexts.

For a long time, knowledge generation in psychology,

neuroscience, and medicine has been dominated by classical

statistics with estimation of linear-regression-like models and

subsequent statistical signiﬁcance testing whether an eﬀect

exists in the sample. In contrast, computation-intensive pattern

learning methods have always had a strong focus on prediction

in frequently extensive data with more modest concern for

interpretability and the “right” underlying question (Hastie

et al., 2001; Ghahramani, 2015). In many statistical learning

applications, it is standard practice to quantify the ability of

a predictive pattern to extrapolate to other samples, possibly

in individual subjects. In a two-step procedure,alearning

algorithm is ﬁtted on a bigger amount of available data (training

data) and the ensuing ﬁtted model is empirically evaluated

on a smaller amount of independent data (test data). This

stands in contrast to classical statistical inference where the

investigator seeks to reject the null hypothesis by considering the

entirety of a data sample (Wasserstein and Lazar, 2016), typically

all available subjects. In this case, the desired relevance of a

statistical relationship in the underlying population is ensured

by formal mathematical proofs and is not commonly ascertained

by explicit evaluations on new data (Breiman, 2001; Wasserstein

and Lazar, 2016). As such, generating insight according to

classical statistics and statistical learning serves rather distinct

modeling purposes. Classical statistics and statistical learning

do therefore not judge data on the same aspects of evidence

(Breiman, 2001; Shmueli, 2010; Arbabshirani et al., 2017; Bzdok

and Yeo, 2017). The two statistical cultures perform diﬀerent

types of principled assessment for successful extrapolation of

a statistical relationship beyond the particular observations

at hand.

Taking an epistemological perspective helps appreciating

that scientiﬁc research is rarely an entirely objective process

but deeply depends on the beliefs and expectations of the

investigator. A new “scientiﬁc fact” about the brain is probably

not established in vacuo (Fleck et al., 1935;termsinquotes

taken from source). Rather, a research “object” is recognized

and accepted by the “subject” according to socially conditioned

“thought styles” that are cultivated among members of

“thought collectives.” A witnessed and measured neurobiological

phenomenon tends to only become “true” if not at odds with

the constructed “thought history” and “closed opinion system”

shared by that subject. The present paper will revisit and

reintegrate two such thought milieus in the context of imaging

Frontiers in Neuroscience | www.frontiersin.org 2September 2017 | Volume 11 | Article 543

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

Bzdok Two St a tis t ica l C u lt u res i n N eur o im a g ing

FIGURE 1 | Application areas of two statistical paradigms. Lists examples of research domains which apply relatively more classical statistics (blue) or learning

Q3 Q4

algorithms (red). The co-occurrence of increased computational resources, growing data repositories, and improving pattern-learning techniques have initiated a shift

toward less hypothesis-driven and more algorithmic methodologies. As a broad intuition, researchers in the empirical sciences on the left tend to use statistics to

evaluate a pre-assumed model on the data. Researchers in the application domains on the right tend to derive a model directly from the data: A new function with

potentially many parameters is created that can predict the output from the input alone without explicit programming model. One of the key differences becomes

apparent when thinking of the neurobiological phenomenon under study as a black box (Breiman, 2001). ClSt typically aims at modeling the black box by making a

set of formal assumptions about its content, such as the nature of the signal distribution. Gaussian distributional assumptions have been very useful in many instances

to enhance mathematical convenience and, hence, computational tractability. Instead, StLe takes a brute-force approach to model the output of the black box (e.g.,

tell healthy and schizophrenic people apart) from its input (e.g., volumetric brain measurements) while making a possible minimum of assumptions (Abu-Mostafa et al.,

2012). In ClSt the stochastic processes that generated the data is therefore treated as partly known, whereas in StLe the phenomenon is treated as complex, largely

unknown, and partly unknowable.

neuroscience: classical statistics (ClSt) and statisticallearning

(StLe).

DIFFERENT HISTORIES: THE ORIGINS OF

CLASSICAL HYPOTHESIS TESTING AND

PATTERN-LEARNING ALGORITHMS

One of many possible ways to group statistical methods is by

framing them along the lines of ClSt and StLe. The incongruent

historical developments of the two statistical communities are

even evident from their basic terminology. Inputs to statistical

models are usually called independent variables,explanatory

variables, or predictors in the ClSt community, but are typically

called features collected in a feature space in the StLe community.

The model outputs are typically called dependent variables,

explained variable, or responses in ClSt, while these are often

called target variables in StLe. It follows a summary of

characteristic events in the development of what can today be

considered as ClSt and StLe (Figure 2).

Around 1,900 the notions of standard deviation,goodness

of ﬁt,andthep<0.05 threshold emerged (Cowles and

Davis, 1982). This was also the period when William S. Gosset

published the t-test under the incognito name “Student” to

quantify production quality in Guinness breweries. Motivated

by concrete problems such as the interaction between potato

varieties and fertilizers, Ronald A. Fisher invented the analysis

of variance (ANOVA), null-hypothesis testing, promoted p-values,

and devised principles of proper experimental conduct (Fisher

and Mackenzie, 1923; Fisher, 1925, 1935). Another framework

by Jerzy Neyman and Egon S. Pearson proposed the alternative

hypothesis, which allowed for the statistical notions of power,

false positives and false negatives, but left out the concept

of p-values (Neyman and Pearson, 1933). This was a time

before electrical calculators emerged after World War II (Efron

and Tibshirani, 1991; Gigerenzer, 1993). Student’s t-test and

Fisher’s inference framework were institutionalized by American

psychology textbooks widely read in the 40s and 50s, while

Neyman and Pearson’s framework only became increasingly

known in the 50s and 60s. Today’s applied statistics textbooks

Frontiers in Neuroscience | www.frontiersin.org 3September 2017 | Volume 11 | Article 543

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

Bzdok Two St a tis t ica l C u lt u res i n N eur o im a g ing

FIGURE 2 | Developments in the history of classical statistics and statistical

learning. Examples of important inventions in statistical methodology. Roughly,

anumberofstatisticalmethodstaughtintoday’stextbooksin psychology and

medicine have emerged in the ﬁrst half of the twentieth century (blue). Instead,

many algorithmic techniques and procedures have emerged in the second half

of the twentieth century (red). “The postwar era witnessed a massive

expansion of statistical methodology, responding to the data-driven demands

fo modern scientiﬁc technology.” (Efron and Hastie, 2016)

have inherited a mixture of the Fisher and Neyman-Pearson

approaches to statistical inference.

It is a topic of current debate1,2,3 whether ClSt is a discipline

that is separate from StLe (e.g., Chambers, 1993; Breiman,

2001; Friedman, 2001; Bishop and Lasserre, 2007; Shalev-Shwartz

and Ben-David, 2014; Efron and Hastie, 2016)orif“statistics”

denotes a broader methodological class that includes both ClSt

and StLe tools as its members (e.g., Tukey, 1962; Cleveland,

2001; Jordan and Mitchell, 2015; Blei and Smyth, 2017). StLe

methods may be more often adopted by computer scientists,

physicists, engineers, and others who typically have less formal

statistical background and may be more frequently working in

industry rather than academia. In fact, John W. Tukey foresaw

many of the developments that led up to what one might

today call statistical learning (Tukey, 1962). He early proposed

a “peaceful collision of computing and statistics”. A modern

reformulation of the same idea states (Efron and Hastie, 2016):

“If the inference/algorithm race is a tortoise-and-hare aﬀair, then

modern electronic computation has bred a bionic hare.” Indeed,

1“Data Science and Statistics: diﬀerent worlds?” (Panel at RoyalStatisticalSociety

UK, March 2015) (https://www.youtube.com/watch?v=C1zMUjHOLr4”)

2“50 years of Data Science” (David Donoho, Tukey Centennial workshop, USA,

September 2015)

3“Are ML and Statistics Complementary?” (Max Welling, 6th IMS-ISBA meeting,

December 2015)

kernel methods, decision trees, nearest-neighbor algorithms,

graphical models, and various other statistical tools actually

emerged in the ClSt community, but largely continued to develop

in the StLe community (Friedman, 2001).

As often cited beginnings of statistical learning approaches,

the perceptron was an early brain-inspired computing algorithm

(Rosenblatt, 1958), and Arthur Samuel created a checker

board program that succeeded in beating its own creator

(Samuel, 1959). Such studies toward artiﬁcial intelligence (AI)

led to enthusiastic optimism and subsequent periodes of

disappointment during the so-called “AI winters” in the late 70s

and around the 90s (Russell and Norvig, 2002; Kurzweil, 2005;

Cox and Dean, 2014), while the increasingly available computers

in the 80s encouraged a new wave of statistical algorithms

(Efron and Tibshirani, 1991). Later, the use of StLe methods

increased steadily in many quantitative scientiﬁc domains as

they underwent an increase in data richness from classical “long

data” (samples n>variables p)toincreasinglyencountered

“wide data” (n<< p)(

Tibshirani, 1996; Hastie et al., 2015).

The emerging ﬁeld of StLe has received conceptual consolidation

by the seminal book “The Elements of Statistical Learning”

(Hastie et al., 2001). The coincidence of changing data properties,

increasing computational power, and cheaper memory resources

encouraged a still ongoing resurge in StLe research and

applications approximately since 2000 (Manyika et al., 2011;

UK House of Common S.a.T, 2016). For instance, over the last

15 years, sparsity assumptions gained increasing relevance for

statistical and computational tractability as well as for domain

interpretability when using supervised and unsupervised learning

algorithms (i.e., with and without target variables) in the high-

dimensional “n<< p” setting (

Bühlmann and Van De Geer,

2011; Hastie et al., 2015). More recently, improvements in

training very “deep” (i.e., many non-linear hidden layers) neural-

networks architectures (Hinton and Salakhutdinov, 2006)have

much improved automatized feature selection (Bengio et al.,

2013) and have exceeded human-level performance in several

application domains (LeCun et al., 2015).

In sum, “the biggest diﬀerence between pre- and post-war

statistical practice is the degree of automation” (Efron and

Tibshirani, 1994) up to a point where “almost all topics in twenty-

ﬁrst-century statistics are now computer-dependent” (Efron and

Hastie, 2016). ClSt has seen many important inventions in the

ﬁrst half of the twentieth century, which have often developed

at statistical departments of academic institutions and remain in

nearly unchanged form in current textbooks of psychology and

other empirical sciences. The emergence of StLe as a coherent

ﬁeld has mostly taken place in the second half of the twentieth

century as a number of disjoint developments in industry

and often non-statistical departments in academia (e.g., AT&T

Bell Laboratories), which lead for instance to artiﬁcial neural

networks, support vector machines, and boosting algorithms

(Efron and Hastie, 2016). Today, systematic education in StLe

is still rare at the large majority of universities, in contrast to

the many consistently oﬀered ClSt courses (Cleveland, 2001;

Vanderplas, 2013; Burnham and Anderson, 2014; Donoho, 2015).

In neuroscience, the advent of brain-imaging techniques,

including positron emission tomography (PET) and functional

Frontiers in Neuroscience | www.frontiersin.org 4September 2017 | Volume 11 | Article 543

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

Bzdok Two St a tis t ica l C u lt u res i n N eur o im a g ing

magnetic resonance imaging (fMRI), allowed for the in-vivo

characterization of the neural correlates underlying sensory,

cognitive, or aﬀective tasks. Brain scanning enabled quantitative

brain measurements with many variables per observation

(analogous to the advent of high-dimensional microarrays in

genetics; Efron, 2012). Since the inception of PET and fMRI,

deriving topographical localization of neural activity changes

was dominated by analysis approaches from ClSt, especially

the general linear model (Scheﬀé, 1959; Poline and Brett,

2012; GLM). The classical approach to neuroimaging analysis is

probably best exempliﬁed by the statistical parametric mapping

(SPM) software package that implements the GLM to provide a

mass-univariate characterization of regionally speciﬁc eﬀects.

As distributed information over voxels is less well captured by

many ClSt approaches, including common GLM applications,

StLe models were proposed early on for neuroimaging

investigations. For instance, principal component analysis

was used to distinguish globally distributed neural activity

changes (Moeller et al., 1987)aswellastostudyAlzheimer’s

disease (Grady et al., 1990). Canonical correlation analysis

was used to quantify complex relationships between task-free

neural activity and schizophrenia symptoms (Friston et al.,

1992). However, these ﬁrst approaches to “multivariate” brain-

behavior associations did not ignite a major research trend

(cf. Worsley et al., 1997; Friston et al., 2008). As a seminal

contribution, Haxby and colleagues devised an innovative

across-voxel correlation analysis to provide evidence against

the widely assumed face-speciﬁcity of neural responses in

the ventral temporal cortex (2001). This ClSt realization of

one-nearest neighbor classiﬁcation based on correlation distance

foreshadowed several important developments, including (i)

joint analysis of sets of brain locations to capture “distributed and

overlapping representations”, (ii) repeated analysis in diﬀerent

splits of the data sample to compare against chance performance,

and (iii) analysis across multiple stimulus categories to assess

the speciﬁcity of neural responses. The ﬁnding of distributed

face representation was conﬁrmed in independent, similar data

(Cox and Savoy, 2003)andbasedonneuralnetworkalgorithms

(Hanson et al., 2004).

The application of StLe methods in neuroimaging increased

further after rebranding as “mind-reading,” “brain decoding,”

and “MVPA” (Haynes and Rees, 2005; Kamitani and Tong,

2005). Note that “MVPA” initally referred to “multivoxel pattern

analysis” (Kamitani and Tong, 2005; Norman et al., 2006)

and later changed to “multivariate pattern analysis” (Haynes

and Rees, 2005; Hanke et al., 2009; Haxby, 2012). Up to that

point, the term prediction had less often been used by imaging

neuroscientists in the sense of out-of-sample generalization of

a learning algorithm and more often in the incompatible sense

of (in-sample) linear correlation such as using Pearson’s or

Spearman’s method (Shmueli, 2010; Gabrieli et al., 2015). While

there was scarce discussion of the position of “decoding” models

in formal statistical terms, growing interest was manifested in

ﬁrst review publications and tutorial papers on applying StLe

methods to neuroimaging data (Haynes and Rees, 2006; Mur

et al., 2009; Pereira et al., 2009). The interpretational gains of

this new access to the neural representation of behavior and

its disturbances in disease was ﬂanked by the availability of

necessary computing power and memory resources. Although

challenging to realize, “deep” neural network algorithms have

recently been introduced to neuroimaging research (Plis et al.,

2014; de Brebisson and Montana, 2015; Güçlü and van Gerven,

2015). These computation-intensive models might help in

approximating and deciphering the nature of neural processing

in brain circuits (Cox and Dean, 2014; Yamins and DiCarlo,

2016). As the dimensionality and complexity of neuroimaging

datasets are constantly increasing, neuroscientiﬁc investigations

will be always more likely to beneﬁt from StLe methods given

their natural scaling to large-scale data analysis (Efron, 2012;

Efron and Hastie, 2016; Blei and Smyth, 2017).

From a conceptual viewpoint (Figure 3), a large majority of

statistical methods can be situated somewhere on a continuum

between the two poles of ClSt and StLe (Committee on the

Analysis of Massive Data et al., 2013; Efron and Hastie,

2016; p. 61). ClSt was mostly fashioned for problems with

small samples that can be grasped by plausible models with

a small number of parameters chosen by the investigator in

an analytical fashion. StLe was mostly fashioned for problems

with many variables in potentially large samples with little

knowledge of the data-generating process that gets emulated by

a mathematical function derived from data in a heuristic fashion.

Tools from ClSt therefore typically assume that the data behave

according to certain known mechanisms, whereas StLe exploits

algorithmic techniques to avoid many a-priori speciﬁcations of

data-generating mechanisms. Neither ClSt or StLe nor any of the

other categories of statistical models can be considered generally

superior. This relativism is captured by the so-called no free

lunch theorem4(Wolpert, 1996): no single statistical strategy can

consistently do better in all circumstances (cf. Gigerenzer, 2004).

As a very general rule of thumb, ClSt preassumes and formally

tests a model for the data, whereas StLe extracts and empirically

evaluates a model from the data.

CASE STUDY ONE: COGNITIVE

CONTRAST ANALYSIS AND DECODING

MENTAL STATES

Vignette: A neuroimaging investigator wants to reveal the neural

correlates underlying face processing in humans. 40 healthy,

right-handed adults are recruited and undergo a block design

experiment run in a 3T MRI scanner with whole-brain coverage.

In a passive viewing paradigm, 60 colored stimuli of unfamiliar

faces are presented, which have forward head and gaze position.

The control condition presents colored pictures of 60 diﬀerent

houses to the participants. In the experimental paradigm, a

picture of a face or a house is presented for 2 s in each trial and

the interval between trials within each block is randomly jittered

varying from 2 to 7 s. The picture stimuli are presented in pseudo-

randomized fashion and are counterbalanced in each passively

4In the supervised setting, there is no a priori distinction between learning

algorithms evaluated by out-of-sample prediction error. In the optimization

setting of ﬁnite spaces, all algorithms searching an extremum perform identically

when averaged across possible cost functions. (http://www.no-free-lunch.org/)

Frontiers in Neuroscience | www.frontiersin.org 5September 2017 | Volume 11 | Article 543

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

627

628

629

630

631

632

633

634

635

636

637

638

639

640

641

642

643

644

645

646

647

648

649

650

651

652

653

654

655

656

657

658

659

660

661

662

663

664

665

666

667

668

669

670

671

672

673

674

675

676

677

678

679

680

681

682

683

684

Bzdok Two St a tis t ica l C u lt u res i n N eur o im a g ing

FIGURE 3 | Key differences in the modeling philosophy of classical statistics and statistical learning. Ten modeling intuitions that tend to be relatively more

characteristic for classical statistical methods (blue) or pattern-learning methods (red). In comparison to ClSt, StLe“isessentiallyaformofappliedstatisticswith

increased emphasis on the use of computers to statistically estimate complicated functions and a decreased emphasis on proving conﬁdence intervals around these

functions” (Goodfellow et al., 2016). Broadly, ClSt tends to be more analytical by imposing mathematical rigor on the phenomenon, whereas StLe tends to be more

heuristic by ﬁnding useful approximations. In practice, ClSt is probably more often applied to experimental data, where a set of target variables are systematically

controlled by the investigator and the brain system under studied has been subject to experimental perturbation. Instead, StLe is probably more often applied to

observational data without such structured inﬂuence and where the studied system has been left unperturbed. ClSt fully speciﬁes the statistical model at the beginning

of the investigation, whereas in StLe there is a bigger emphasis on models that can ﬂexibly adapt to the data (e.g., learningalgorithmscreatingdecisiontrees).

watching participant. Despite the blocked presentation of stimuli,

each experiment trial is modeled separately. The fMRI data are

analyzed using a GLM as implemented in the SPM software

package. Two task regressors are included in the model for the

face and house conditions based on the stimulus onsets and

viewing durations and using a canonical hemodynamic response

function. In the GLM design matrix, the face column and house

column are hence set to 1 for brain scans from the corresponding

task condition and set to 0 otherwise. Separately in each brain

voxel, the GLM parameters are estimated, which ﬁts betaface

and betahouse regression coeﬃcients to explain the contribution

of each experimental task to the neural activity increases and

decreases observed in that voxel. A t-test can then formally assess

whether the fMRI signal in the current voxel is signiﬁcantly

more involved in viewing faces as opposed to the house control

condition.

Question: What is the statistical diﬀerence between

subtracting the neural activity from the face vs. house conditions

and decoding the neural activity during face vs. house processing?

Computing cognitive contrasts is a ClSt approach that was and

still is routinely performed in the mass-univariate regime: it ﬁts a

separate GLM model for each individual voxel in the brain scans

and then tests for signiﬁcant diﬀerences between the obtained

condition coeﬃcients (Friston et al., 1994). Instead, decoding

cognitive processes from neural activity is a StLe approach thatis

typically performed in a multivariate regime:alearningalgorithm

is trained on a large number of voxel observations in brain scans

and then the model’s prediction accuracy is evaluated on sets of

new brain scans. These ClSt and StLe approaches to identifying

the neural correlates underlying cognitive processes of interest

are closely related to the notions of encoding models and decoding

models, respectively (Kriegeskorte, 2011; Naselaris et al., 2011;

Pedregosa et al., 2015;butseeGüçlüetal.,2015). Q9

Encoding models regress the brain data against a design

matrix with indicators of the face vs. house condition and

formally test whether the diﬀerence is statistically signiﬁcant.

Decoding models typically aim to predict these indicators by

training and empirically evaluating classiﬁcation algorithms on

diﬀerent splits from the whole dataset. In ClSt parlance, the

model explains the neural activity, the dependent or explained

variable, measured in each separate brain voxel, by the beta

coeﬃcients according to the experimental condition indicators

in the design matrix columns, the independent or explanatory

variables. That is, the GLM can be used to explain neural activity

changes by a linear combination of experimental variables

(Naselaris et al., 2011). Answering the same neuroscientiﬁc

question with decoding models in StLe jargon, the model

weights of a classiﬁer are ﬁtted on the training set of the

input data to predict the class labels,thetarget variables, and

are subsequently evaluated on the test set by cross-validation

to obtain their out-of-sample generalization performance. Here,

classiﬁcation algorithms are used to predict entries of the

Frontiers in Neuroscience | www.frontiersin.org 6September 2017 | Volume 11 | Article 543

685

686

687

688

689

690

691

692

693

694

695

696

697

698

699

700

701

702

703

704

705

706

707

708

709

710

711

712

713

714

715

716

717

718

719

720

721

722

723

724

725

726

727

728

729

730

731

732

733

734

735

736

737

738

739

740

741

742

743

744

745

746

747

748

749

750

751

752

753

754

755

756

757

758

759

760

761

762

763

764

765

766

767

768

769

770

771

772

773

774

775

776

777

778

779

780

781

782

783

784

785

786

787

788

789

790

791

792

793

794

795

796

797

798

Bzdok Two St a tis t ica l C u lt u res i n N eur o im a g ing

design matrix by identifying a linear or more complicated

combination between the many simultaneously considered brain

voxels (Pereira et al., 2009). More broadly, ClSt applications

in functional neuroimaging tend to estimate the presence of

cognitive processes from neural activity, whereas many StLe

applications estimate properties of neural activity from diﬀerent

cognitive tasks.

A key diﬀerence between many ClSt-mediated encoding

models and StLe-mediated decoding models thus pertains to

the direction of statistical estimation between brain space and

behavior space (Friston et al., 2008; Varoquaux and Thirion,

2014). It was noted (Friston et al., 2008) that the direction of

brain-behavior association is related to the question whether the

stimulus indicators in the model act as causes by representing

deterministic experimental variables of an encoding model or

consequences by representing probabilistic outputs of a decoding

model. Such considerations also reveal the intimate relationship

of ClSt models to the notion of forward inference, while StLe

methods are probably more often used for formal reverse

inference in functional neuroimaging (Poldrack, 2006; Eickhoﬀ

et al., 2011; Yarkoni et al., 2011; Varoquaux and Thirion, 2014).

On the one hand, forward inference relates to encoding models

by testing the probability of observing activity in a brain location

given knowledge of a psychological process. On the other hand,

reverse inference relates to brain decoding to the extent that

classiﬁcation algorithms can learn to distinguish experimental

fMRI data to belong to two psychological conditions and

subsequently be used to estimate the presence of speciﬁc

cognitive processes based on new neural activity observations (cf.

Poldrack, 2006). Finally, establishing a brain-behavior association

has been argued to be more important than the actual direction

of the mapping function (Friston, 2009). This author stated

that “showing that one can decode activity in the visual

cortex to classify [...] a subject’s percept is exactly the sameas

demonstrating signiﬁcant visual cortex responses to perceptual

changes” and, conversely, “all demonstrations of functionally

specialized responses represent an implicit mindreading.”

Conceptually, GLM-based encoding models follow a

localization agenda by testing hypotheses on regional eﬀects of

functional specialization in the brain (where?). A t-test is used

to compare pairs of neural activity estimates to statistically

distinguish the target face and the non-target house condition

(Friston et al., 1996). Essentially, this test for signiﬁcant

diﬀerences between the ﬁtted beta coeﬃcients corresponds to

two stimulus indicators based on well-founded arguments from

cognitive theory. This statistical approach assumes that cognitive

subtraction is possible, that is, the regional brain responses of

interest can be isolated by contrasting two sets of brain scans that

are believed to diﬀer in the cognitive facet of interest (Friston

et al., 1996; Stark and Squire, 2001). For one voxel location

at a time, an attempt is made to reject the null hypothesis of

no diﬀerence between the averaged neural activity level of a

target brain state and the averaged neural activity of a control

brain state. It is important to appreciate that the localization

agenda thus emphasizes the relative diﬀerence in fMRI signal

during tasks and may neglect the individual neural activity

information of each particular task (Logothetis et al., 2001).

Note that the univariate GLM analysis can be extended to more

than one output (dependent or explained) variable within the

ClSt regime by performing a multivariate analysis of covariance

(MANCOVA). This allows for tests of more complex hypotheses

but incurs multivariate normality assumptions (Kriegeskorte,

2011).

More generally, it is seldom mentioned that the standard

GLM would not have been solvable for unique solutions in the

high-dimensional “n<< p”regime,insteadofﬁttingonemodel

for each voxel in the brain scans. This is because the number

of brain voxels pexceed by far the number of data samples

n (i.e., leading to an under-determined system of equations),

which incapacitates many statistical estimators from ClSt (cf.

Giraud, 2014; Hastie et al., 2015). Regularization by sparsity-

inducing norms, such as in modern penalized regression analysis

using the LASSO and ElasticNet, emerged only later (Tibshirani,

1996; Zou and Hastie, 2005) as a principled StLe strategy to de-

escalate the need for dimensionality reduction or preliminary

ﬁltering of important voxels and to enable the tractability of the

high-dimensional analysis setting.

Because hypothesis testing for signiﬁcant diﬀerences between

beta coeﬃcients of ﬁtted GLMs relies on comparing the means of

neural activity measurements, the results from statistical tests are

not corrupted by the conventionally applied spatial smoothing

with a Gaussian ﬁlter. On the contrary, this image preprocessing

step even helps the correction for multiple comparisons based

on random ﬁelds theory (cf. below), alleviates inter-individual

neuroanatomical variability, and can thus increases sensitivity.

Spatial smoothing however discards ﬁne-grained neural activity

patterns spatially distributed across voxels that potentially carry

information associated with mental operations (cf. Kamitani and

Sawahata, 2010; Haynes, 2015). Indeed, some authors believe that

sensory, cognitive, and motor processes manifest themselvesas

“neuronal population codes” (Averbec k et al ., 2006 ). Relevance

of such population codes in human neuroimaging was for

instance suggested by revealing subject-speciﬁc neural responses

in the fusiform gyrus to facial stimuli (Saygin et al., 2012).

In applications of StLe models, the spatial smoothing step

is therefore often skipped because the “decoding” algorithms

precisely exploit the locally varying structure of the salt-and-

pepper patterns in fMRI signals.

In so doing, decoding models use learning algorithms in an

information agenda by showing generalization of robust patterns

to new brain activity acquisitions (Kriegeskorte et al., 2006; Mur

et al., 2009; de-Wit et al., 2016). Information that is weak in one

voxel but spatially distributed across voxels can be eﬀectively

harvested in a structure-preserving fashion (Haynes and Rees,

2006; Haynes, 2015). This modeling agenda is focused on the

whole neural activity pattern,incontrasttothelocalization

agenda dedicated to separate increases or decreases in neural

activity level. For instance, the default mode network typically

exhibits activity decreases at the onset of many psychological

tasks with visual or other sensory stimuli, whereas the induced

activity patterns in that less activated network may nevertheless

functionally subserve task execution (Bzdok et al., 2016; Christoﬀ

et al., 2016). Some brain-behavior associations might only emerge

when simultaneously capturing neural activity in a group of

Frontiers in Neuroscience | www.frontiersin.org 7September 2017 | Volume 11 | Article 543

799

800

801

802

803

804

805

806

807

808

809

810

811

812

813

814

815

816

817

818

819

820

821

822

823

824

825

826

827

828

829

830

831

832

833

834

835

836

837

838

839

840

841

842

843

844

845

846

847

848

849

850

851

852

853

854

855

856

857

858

859

860

861

862

863

864

865

866

867

868

869

870

871

872

873

874

875

876

877

878

879

880

881

882

883

884

885

886

887

888

889

890

891

892

893

894

895

896

897

898

899

900

901

902

903

904

905

906

907

908

909

910

911

912

Bzdok Two St a tis t ica l C u lt u res i n N eur o im a g ing

voxels but disappear in single-voxel approaches, such as mass-

univariate GLM analyses (cf. Davatzikos, 2004). Note that,

analogous to multivariate variants of the GLM, decoding could

also be replaced by classical statistical approaches (cf. Haxby

et al., 2001; Brodersen et al., 2011a). For many linear classiﬁcation

algorithm trained to predict face vs. house stimuli based on many

brain voxels, model ﬁtting typically searches iteratively through

the hypothesis space (=function space)ofthechosenlearning

model. In our case, the ﬁnal hypothesis selected by the linear

classiﬁer commonly corresponds to one speciﬁc combination of

model weights (i.e., a weighted contribution of individual brain

measurements) that equates with one mapping function from the

neural activity features to the face vs. house target variable.

Among other views, it has previously been proposed

(Brodersen, 2009) that four types of neuroscientiﬁc questions

become readily quantiﬁable through StLe applications to

neuroimaging: (i) Where is an information category neurally

processed? This can extend the interpretational spectrum from

increase and decrease of neural activity to the existence of

complex combinations of activity variations distributed across

voxels. For instance, across-voxel linear correlation could decode

object categories from the ventral temporal cortex even after

excluding the fusiform gyrus, which is known to be responsive

to object stimuli (Haxby et al., 2001). (ii) Whether a given

information category is reﬂected by neural activity? This

can extend the interpretational spectrum to topographically

similar but neurally distinct processes that potentially underlie

diﬀerent cognitive facets. For instance, linear classiﬁers could

successfully distinguish whether a subject is attending to

the ﬁrst or second of two simultaneously presented stimuli

(Kamitani and Tong, 2005). (iii) When is an information

category generated (i.e., onset), processed (i.e., duration), and

bound (i.e., alteration)? When applying classiﬁers to neural time

series, the interpretational spectrum can be extended to the

beginning, evolution, and end of distinct cognitive facets.For

instance, diﬀerent classiﬁers have been demonstrated to map

the decodability time structure of mental operation sequences

(King and Dehaene, 2014). (iv) More controversially, how is an

information category neurally processed? The interpretational

spectrum can be extended to computational properties of the

neural processes, including processing in brain regions vs. brain

networks or isolated vs. partially shared processing facets. For

instance, a classiﬁer trained for evolutionarily conservedeye

gaze movement was able to decode evolutionarily more recent

mathematical calculation processes as a possible case of “neural

recycling” in the human brain (Knops et al., 2009; Anderson,

2010). As an important caveat in interpreting StLe models, the

particular technical properties of a chosen learning algorithm

(e.g., linear vs. non-linear support vector machines) can probably

seldom serve as a convincing argument for reverse-engineering

mechanisms of neural information processing as measured by

fMRI scanning (cf. Misaki et al., 2010).

In sum, the statistical properties of ClSt and StLe methods

have characteristic consequences in neuroimaging analysis and

interpretation. They hence oﬀer diﬀerent access routes and

complementary answers to identical neuroscientiﬁc questions.

CASE STUDY TWO: SMALL VOLUME

CORRECTION AND SEARCHLIGHT

ANALYSIS

Vignette: The neuroimaging experiment from case study 1

successfully identiﬁed the fusiform gyrus of the ventral visual

stream to be more responsive to face stimuli than house

stimuli. However, the investigator’s initial hypothesis of also

observing face-responsive neural activity in the ventromedial

prefrontal cortex could not be conﬁrmed in the whole-brain

analyses. The investigator therefore wants to follow up with a

topographically focused approach that examines diﬀerences in

neural activity between the face and house conditions exclusively

in the ventromedial prefrontal cortex.

Question: What are the statistical implications of delineating

task-relevant neural responses in a spatially constrained search

space rather than analyzing brain measurements of the entire

brain?

A popular ClSt approach to corroborate less pronounced

neural activity ﬁndings is small volume correction. This region

of interest (ROI) analysis involves application of the mass-

univariate GLM approach only to the ventromedial prefrontal

cortex as a preselected biological compartment, rather than

considering the gray-matter voxels of the entire brain in a naïve,

topographically unconstrained fashion. Small volume correction

allows for signiﬁcant ﬁndings in the ROI that remain sub-

threshold after accounting for the tens of thousands of multiple

comparisons in the whole-brain GLM analysis. Small volume

correction is therefore a simple means to alleviate the multiple-

comparisons problem that motivated more than two decades of

still ongoing methodological developments in the neuroimaging

domain (Worsley et al., 1992; Smith et al., 2001; Friston,

2006; Nichols, 2012). Whole-brain GLM results were initially

reported as uncorrected ﬁndings without accounting for multiple

comparisons, then with Bonferroni’s family wise error (FWE)

correction, later by random ﬁeld theory correction using neural

activity height (or clusters), followed by false discovery rate

(FDR) (Genovese et al., 2002) and slowly increasing adoption

of cluster-thresholding for voxel-level inference via permutation

testing (Smith and Nichols, 2009). Rather than the isolated

voxel, it has early been discussed that a possibly better unit

of interest should be spatially neighboring voxel groups (see

here for an overview: Chumbley and Friston, 2009). The setting

of high regional correlation of neural activity was successfully

addressed by random ﬁeld theory that provide inferences not

about individual voxels but topological features in the underlying

(spatially continuous) eﬀects. This topological inference is used

to identify clusters of relevant neural activity changes from their

peak, size, or mass (Worsley et al., 1992). Importantly, the spatial

dependencies of voxel observations were not incorporated into

the GLM estimation step, but instead taken into account during

the subsequent model inference step to alleviate the multiple-

comparisons problem.

ArelatedcousinofsmallvolumecorrectionintheStLe

world would be to apply classiﬁcation algorithms to a subset

of voxels to be considered as input to the model (i.e., feature

Frontiers in Neuroscience | www.frontiersin.org 8September 2017 | Volume 11 | Article 543

913

914

915

916

917

918

919

920

921

922

923

924

925

926

927

928

929

930

931

932

933

934

935

936

937

938

939

940

941

942

943

944

945

946

947

948

949

950

951

952

953

954

955

956

957

958

959

960

961

962

963

964

965

966

967

968

969

970

971

972

973

974

975

976

977

978

979

980

981

982

983

984

985

986

987

988

989

990

991

992

993

994

995

996

997

998

999

1000

1001

1002

1003

1004

1005

1006

1007

1008

1009

1010

1011

1012

1013

1014

1015

1016

1017

1018

1019

1020

1021

1022

1023

1024

1025

1026

Bzdok Two St a tis t ica l C u lt u res i n N eur o im a g ing

selection). In particular, searchlight analysis is an increasingly

popular learning technique that can identify locally constrained

multivariate patterns in neural activity (Friman et al., 2001;

Kriegeskorte et al., 2006). For each voxel in the ventromedial

prefrontal cortex, the brain measurements of the immediate

neighborhood are ﬁrst collected (e.g., radius of 10 mm voxels).

In each such searchlight, a classiﬁcation algorithm, for instance

linear support vector machines, is then trained on one part

of the brain scans (training set) and subsequently applied to

determine the prediction accuracy in the remaining, unseen brain

scans (test set). In this StLe approach, the excess of brain voxels

is handled by performing pattern recognition analysis in only

dozens of locally adjacent voxel neighborhoods at a time. Finally,

the mean classiﬁcation accuracy of face vs. house stimuli across

all permutations over the brain data is mapped to the center of

each considered sphere. The searchlight is then moved through

the ROI until each seed voxel had once been the center voxel

of the searchlight. This yields a voxel-wise classiﬁcation map of

accuracy estimates for the entire ventromedial prefrontal cortex.

Consistent with the information agenda (cf. above), searchlight

analysis quantiﬁes the extent to which (local) neural activity

patterns can predict the diﬀerence between the house and face

conditions. It contrasts small volume correction that determines

whether one experimental condition exhibited a signiﬁcant

neural activity increase or decrease relative to a particular

other experimental condition, consistent with the localization

agenda. Further, searchlight analysis alleviates the burden of

abundant input variables by ﬁtting learning algorithms restricted

to the voxels in small sphere neighborhoods. However, the

searchlight procedure thus yields many prediction performances

for many brain locations, which motivates correction for multiple

comparisons across the considered neighborhoods.

When considering high-dimensional brain scans through

the ClSt lens, the statistical challenge resides in solving the

multiple-comparisons problem (Nichols and Hayasaka, 2003;

Nichols, 2012). From the StLe stance, however, it is the curse

of dimensionality and overﬁtting that statistical analyses need to

tackle (Friston et al., 2008; Domingos, 2012). Many neuroimaging

analyses based on ClSt methods can be viewed as testing a

particular hypothesis (i.e., the null hypothesis) repeatedly in a

large number of separate voxels. In contrast, testing whether

learning algorithm extrapolate to new brain data can be viewedas

searching through thousands of diﬀerent hypotheses in a single

process (i.e., walking through the hypothesis space; cf. above)

(Shalev-Shwartz and Ben-David, 2014).

As common brain scans oﬀers measurements of >100,000

brain locations, a mass-univariate GLM analysis typically entails

the same statistical test to be applied >100,000 times. The more

often the investigator tests a hypothesis of relevance for a brain

location, the more locations will be falsely detected as relevant

(false positive, Type I error), especially in the noisy neuroimaging

data. All dimensions in the brain data (i.e., voxel variables) are

implicitly treated as equally important and no neighborhoods of

most expected variation are statistically exploited (Hastie et al.,

2001). Hence, the absence of restrictions on observable structure

in the set of data variables during the statistical modeling of

neuroimaging data takes a heavy toll at the ﬁnal inference

step. This is where random ﬁeld theory comes to the rescue.

As noted above, this form of topological inference dispenses

with the problem of inferring which voxels are signiﬁcant and

tries to identify signiﬁcant topological features in the underlying

distributed responses. By deﬁnition, topological features like

maxima are sparse events and can be thought of as a form of

dimensionality reduction—not in data space but in the statistical

characterization of where neural responses occur.

This is contrasted by the high-dimensional StLe regime, where

the initial model family chosen by the investigator determines

the complexity restrictions to all data dimensions (i.e., all voxels,

not single voxels) that are imposed explicitly or implicitly by

the model structure. Model choice predisposes existing but

unknown low-dimensional neighborhoods in the full voxel space

to achieve the prediction task. Here, the toll is taken at the

beginning of the investigation because there are so many diﬀerent

alternative model choices that would impose a diﬀerent set of

complexity constraints to the high-dimensional measurements

in the brain. For instance, signals from “brain regions” are

likely to be well approximated by models that impose discrete,

locally constant compartments on the data (e.g., k-means or

spatially constrained Ward clustering). Instead, tuning model

choice to signals from macroscopical “brain networks” should

impose overlapping, locally continuous data compartments (e.g.,

independent component analysis or sparse principal component

analysis) (Yeo et al., 2014; Bzdok and Yeo, 2017; Bzdok et al.,

2017).

Exploiting such eﬀective dimensions in the neuroimaging

data (i.e., coherent brain-behavior associations involving many

distributed brain voxels) is a rare opportunity to simultaneously

reduce the model bias and model variance, despite their typical

inverse relationship (Hastie et al., 2001). Model bias relates

to prediction failures incurred because the learning algorithm

can systematically not represent certain parts of the underlying

relationship between brain scans and experimental conditions

(formally, the deviation between the target function and the

average function space of the model). Model variance relates

to prediction failures incurred by noise in the estimation of

the optimal brain-behavior association (formally, the diﬀerence

between the best-choice input-output relation and the average

function space of the model). A model that is too simple to

capture a brain-behavior association probably underﬁts due to

high bias. Yet, an overly complex model probably overﬁts due to

high variance. Generally, high-variance approaches are better at

approximating the “true” brain-behavior relation (i.e., in-sample

model estimation), while high-bias approaches have a higher

chance of generalizing the identiﬁed pattern to new observations

(i.e., out-of-sample model evaluation). The bias-variance tradeoﬀ

can be useful in explaining why applications of statistical models

intimately depend on (i) the amount of available data, (ii) the

typically not known amount of noise in the data, and (iii) the

unknown complexity of the target function in nature (Abu-

Mostafa et al., 2012).

Learning algorithms that overcome the curse of

dimensionality—extracting coherent patterns from all brain

voxels at once—typically incorporate an implicit bias for

anisotropic neighborhoods in the data (Hastie et al., 2001;

Frontiers in Neuroscience | www.frontiersin.org 9September 2017 | Volume 11 | Article 543

1027

1028

1029

1030

1031

1032

1033

1034

1035

1036

1037

1038

1039

1040

1041

1042

1043

1044

1045

1046

1047

1048

1049

1050

1051

1052

1053

1054

1055

1056

1057

1058

1059

1060

1061

1062

1063

1064

1065

1066

1067

1068

1069

1070

1071

1072

1073

1074

1075

1076

1077

1078

1079

1080

1081

1082

1083

1084

1085

1086

1087

1088

1089

1090

1091

1092

1093

1094

1095

1096

1097

1098

1099

1100

1101

1102

1103

1104

1105

1106

1107

1108

1109

1110

1111

1112

1113

1114

1115

1116

1117

1118

1119

1120

1121

1122

1123

1124

1125

1126

1127

1128

1129

1130

1131

1132

1133

1134

1135

1136

1137

1138

1139

1140

Bzdok Two St a tis t ica l C u lt u res i n N eur o im a g ing

Bach, 2014; Bzdok et al., 2015). Put diﬀerently, prediction

models successful in the high-dimensional setting have an

in-built specialization to representing types of functions that

are compatible with the structure to be uncovered in the brain

data. Knowledge embodied in a learning algorithm suited to

a particular application domain can better calibrate the sweet

spot between underﬁtting and overﬁtting. When applying a

model without any complexity restrictions to high-dimensional

data generalization becomes diﬃcult to impossible because all

directions in the data (i.e., individual brain voxels) are treated

equally with isotropic structure. At the root of the problem, all

data samples look virtually identical to the learning algorithm in

high-dimensional data scenarios (Bellman, 1961). The learning

algorithm will not be able to see through the idiosyncracies in

the data, will tend to overﬁt, and thus be unlikely to generalize

to new observations. Such considerations provide insight into

why the multiple-comparisons problem is more often an issue

in encoding studies, while overﬁtting is more closely related

to decoding studies (Friston et al., 2008). The juxtaposition of

ClSt and StLe views oﬀers insights into why restricting neural

data analysis to an ROI with fewer voxels, rather than the whole

brain, simultaneously alleviates both the multiple-comparisons

problem (ClSt) and the curse of dimensionality (StLe).

As an practical summary, drawing classical inference in

neuroimaging data has largely been performed by considering

each voxel independently and by massive simultaneous testing

of a same null hypothesis in all observed voxels. This has

incurred a multiple-comparisons problem diﬃcult enough

that common approaches may still be prone to incorrect

results (Efron, 2012). In contrast, aiming for generalization

of a pattern in high-dimensional neuroimaging data to new

observations in the brain incurs the equally challenging curse of

dimensionality. Successfully accounting for the high number of

input dimensions will probably depend on learning models that

impose neurobiologically justiﬁed bias and keeping the variance

under control by dimensionality reduction and regularization

techniques.

More broadly, asking at what point new neurobiological

knowledge is arising during ClSt and StLe investigations relies on

largely distinct theoretical frameworks that revolve around null-

hypothesis testing and statistical learning theory (Figure 4). Both

ClSt and StLe methods share the common goal of demonstrating

relevance of a given eﬀect in the data beyond the sample

brain scans at hand. However, the attempt to show successful

extrapolation of a statistical relationship at the general population

is embedded in diﬀerent mathematical contexts. Knowledge

generation in ClSt and StLe is hence rooted in diﬀerent notions

of statistical inference.

ClSt laid down its most important inferential framework in

the Popperian spirit of critical empiricism (Popper, 1935/2005):

scientiﬁc progress is to be made by continuous replacement of

current hypotheses by ever more pertinent hypotheses using

falsiﬁcation. The rationale behind hypothesis falsiﬁcation is

that one counterexample can reject a theory by deductive

reasoning,whileanyquantityofevidencecannotconﬁrma

given theory by inductive reasoning (Goodman, 1999). The

investigator verbalizes two mutually exclusive hypotheses by

domain-informed judgment. The alternative hypothesis should

be conceived as the outcome intended by the investigator and

to contradict the state of the art of the research topic. The

null hypothesis represents the devil’s advocate argument that

the investigator wants to reject (i.e., falsify) and it should

automatically deduce from the newly articulated alternative

hypothesis. A conventional 5%-threshold (i.e., equating with

roughly two standard deviations) guards against rejection due to

the idiosyncrasies of the sample that are not representative ofthe

general population. If the data have a probability of ≤5% given

the null hypothesis [P(result|H0)], it is evaluated to be signiﬁcant.

Such a test for statistical signiﬁcance indicates a diﬀerence between

two means with a 5% chance of being a false positive ﬁnding.

If the null hypothesis can not be rejected (which depends on

power), then the test yields no conclusive result, rather than

anullresult(

Schmidt, 1996). In this way, classical hypothesis

testing continuously replaces currently embraced hypotheses

explaining a phenomenon in nature by better hypotheses with

more empirical support in a Darwinian selection process. Finally,

Fisher, Neyman, and Pearson intended hypothesis testing as a

marker for further investigation, rather than an oﬀ-the-shelf

decision-making instrument (Cohen, 1994; Nuzzo, 2014).

In StLe instead, answers to how neurobiological conclusions

can be drawn from a dataset at hand are provided by the Vapni k -

Chervonenkis dimensions (VC dimensions) from statistical

learning theory (Vapnik, 1989, 1996). The VC dimensions of

a pattern-learning algorithm quantify the probability at which

the distinction between the neural correlates underlying the

face vs. house conditions can be captured and used for correct

predictions in new, possibly later acquired brain scans from the

same cognitive experiment (i.e., out-of-sample generalization).

Such statistical approaches implement the inductive strategy to

learn general principles (i.e., the neural signature associated

with given cognitive processes) from a series of exemplary

brain measurements, which contrasts the deductive strategy of

rejecting a certain null hypothesis based on counterexamples

(cf. Tenenbaum et al., 2011; Bengio, 2014; Lake et al., 2015).

The VC dimensions measure how complicated the examined

relationship between brain scans and experimental conditions

could become—in other words, the richness of the representation

which can be instantiated by the used model, the complexity

capacity of its hypothesis space, the “wiggliness” of the decision

boundary used to distinguish examples from several classes, or,

more intuitively, the “currency” of learnability. VC dimensions

are derived from the maximal number of diﬀerent brain scans

that can be correctly detected to belong to either the house

condition or the face condition by a given model. The VC

dimensions thus provide a theoretical guideline for the largest set

of brain scan examples fed into a learning algorithm such that

this model is able to guarantee zero classiﬁcation errors.

As one of the most important results from statistical learning

theory, in any intelligent learning system, the opportunity to

derive abstract patterns in the world by reducing the discrepancy

between prediction error from training data (in-sample estimate)

and prediction error from independent test data (out-of-sample

estimate) decreases with the higher model capacity and increases

with the number of available training observations (Vapnik

Frontiers in Neuroscience | www.frontiersin.org 10 September 2017 | Volume 11 | Article 543

1141

1142

1143

1144

1145

1146

1147

1148

1149

1150

1151

1152

1153

1154

1155

1156

1157

1158

1159

1160

1161

1162

1163

1164

1165

1166

1167

1168

1169

1170

1171

1172

1173

1174

1175

1176

1177

1178

1179

1180

1181

1182

1183

1184

1185

1186

1187

1188

1189

1190

1191

1192

1193

1194

1195

1196

1197

1198

1199

1200

1201

1202

1203

1204

1205

1206

1207

1208

1209

1210

1211

1212

1213

1214

1215

1216

1217

1218

1219

1220

1221

1222

1223

1224

1225

1226

1227

1228

1229

1230

1231

1232

1233

1234

1235

1236

1237

1238

1239

1240

1241

1242

1243

1244

1245

1246

1247

1248

1249

1250

1251

1252

1253

1254

Bzdok Two St a tis t ica l C u lt u res i n N eur o im a g ing

FIGURE 4 | Key concepts in classical statistics and statistical learning. Schematic with statistical notions that are relativelymoreassociatedwithclassicalstatistical

methods (left column) or pattern-learning methods (right column). As there is a smooth transition between the classicalstatisticaltoolkitandlearningalgorithms,some

notions may be closely associated with both statistical cultures (middle column).

and Kotz, 1982; Vapnik, 1996). In brain imaging, a learning

algorithm is hence theoretically backed up to successfully predict

outcomes in future brain scans with high probability if the

choosen model ignores structure that is overly complicated, such

as higher-order non-linearities between many brain voxels,and

if the model is provided with a suﬃcient number of training

brain scans. Hence, VC dimensions provide explanations why

increasing the number of considered brain voxels as input

features (i.e., entailing increased number of model parameters)

or using a more sophisticated prediction model, requires more

training data for successful generalization. Notably, the VC

dimensions (analogous to null-hypothesis testing) are unrelated

to the target function, as the “true” mechanisms underlying the

studied phenomenon in nature. Nevertheless, the VC dimensions

provide justiﬁcation that a certain learning model can be usedto

approximate that target function by ﬁtting a model to a collection

of input-output pairs. In short, VC dimensions is among the

best frameworks to derive theoretical errors bounds for predictive

models (Abu-Mostafa et al., 2012).

Further, some common invalidations of the ClSt and StLe

statistical frameworks are conceptually related. An often-raised

concern in neuroimaging studies performing classical inference is

double dipping or circular analysis (Kriegeskorte et al., 2009). This

occurs when, for instance, ﬁrst correlating a behavioral measure

with brain activity and then using the identiﬁed subset of brain

voxels for a second correlation analysis with that same behavioral

measurement (Vul et al., 2008; Lieberman et al., 2009). In this

scenario, voxels are submitted to two statistical tests with the

same goal in a nested, non-independent fashion5(Freedman,

1983). This corrupts the validity of the null hypothesis on which

the reported test results conditionally depend. Importantly, this

case of repeating a same statistical estimation with iteratively

pruned data selections (on the training data split) is a valid

routine in the StLe framework, such as in recursive feature

extraction (Guyon et al., 2002; Hanson and Halchenko, 2008).

However, double-dipping or circular analysis in ClSt applications

to neuroimaging data have an analog in StLe analyses aiming at

out-of-sample generalization: data-snooping or peeking (Pereira

et al., 2009; Abu-Mostafa et al., 2012; Fithian et al., 2014).

This occurs, for instance, when performing simple (e.g., mean-

centering) or more involved (e.g., k-means clustering) target-

variable-dependent or -independent preprocessing on the entire

dataset if it should be applied separately to the training sets

5“If you torture the data enough, nature will always confess.” (Coase R. H., 1982

How should economists choose?)

Frontiers in Neuroscience | www.frontiersin.org 11 September 2017 | Volume 11 | Article 543

1255

1256

1257

1258

1259

1260

1261

1262

1263

1264

1265

1266

1267

1268

1269

1270

1271

1272

1273

1274

1275

1276

1277

1278

1279

1280

1281

1282

1283

1284

1285

1286

1287

1288

1289

1290

1291

1292

1293

1294

1295

1296

1297

1298

1299

1300

1301

1302

1303

1304

1305

1306

1307

1308

1309

1310

1311

1312

1313

1314

1315

1316

1317

1318

1319

1320

1321

1322

1323

1324

1325

1326

1327

1328

1329

1330

1331

1332

1333

1334

1335

1336

1337

1338

1339

1340

1341

1342

1343

1344

1345

1346

1347

1348

1349

1350

1351

1352

1353

1354

1355

1356

1357

1358

1359

1360

1361

1362

1363

1364

1365

1366

1367

1368

Bzdok Two St a tis t ica l C u lt u res i n N eur o im a g ing

and test sets. Data-snooping can lead to overly optimistic cross-

validation estimates and a trained learning algorithm that fails

on fresh data drawn from the same distribution (Abu-Mostafa

et al., 2012). Rather than a corrupted null hypothesis, it is

the error bounds of the VC dimensions that are loosened and,

ultimately, invalidated because information from the concealed

test set inﬂuences model selection on the training set.

In sum, statistical inference in ClSt is drawn by using the

entire data at hand to formally test for theoretically guaranteed

extrapolation of an eﬀect to the general population. In stark

contrast, inferential conclusions in StLe are typically drawn by

ﬁtting a model on a larger part of the data at hand (i.e., in-

sample model selection) and empirically testing for successful

extrapolation to an independent, smaller part of the data (i.e.,

out-of-sample model evaluation). As such, ClSt has a focus on

in-sample estimates and explained-variance metrics that measure

some form of goodness of ﬁt, while StLe has a focus on out-of-

sample estimates and prediction accuracy.

CASE STUDY THREE: SIGNIFICANT

GROUP DIFFERENCES AND PREDICTING

THE GROUP OF PARTICIPANTS

Vignette: After isolating the neural correlates underlyingface

processing, the neuroimaging investigator wants to examine

their relevance in psychiatric disease. In addition to the 40

healthy participants, 40 patients diagnosed with schizophrenia

are recruited and administered the same experimental paradigm

and set of face and house pictures. In this clinical fMRI study

on group diﬀerences, the investigator wants to explore possible

imaging-derived markers that index deﬁcits in social-aﬀective

processing in patients carrying a diagnosis of schizophrenia.

Question: Can metrics of statistical relevance from ClSt and

StLe be combined to corroborate a given candidate biomarker?

Many investigators in imaging neuroscience share a

background in psychology, biology, or medicine, which

includes training in traditional “textbook” statistics. Many

neuroscientists have thus adopted a natural habit of assessing

the quality of statistical relationships by means of p-values,

eﬀect sizes, conﬁdence intervals, and statistical power. These

are ubiquitously taught and used at many universities, although

they are not the only coherent set of statistical diagnostics

(Figure 5). These outcome metrics from ClSt may for instance

be less familiar to some scientists with a background in

computer science, physics, engineering, or philosophy. As

an equally legitimate and internally coherent, yet less widely

known diagnostic toolkit from the StLe community, prediction

accuracy, precision, recall, confusion matrices, F1 score, and

learning curves can also be used to measure the relevance of

statistical relationships (Abu-Mostafa et al., 2012; Yarkoni and

Westfall, 2016).

On a general basis, applications of ClSt and StLe methods

may not judge ﬁndings on identical grounds (Breiman, 2001;

Shmueli, 2010; Lo et al., 2015). There is an often-overlooked

misconception that models with high explanatory performance

do necessarily exhibit high predictive performance (Wu et al.,

2009; Lo et al., 2015; Yarkoni and Westfall, 2016). For instance,

brain voxels in ventral visual stream found to well explain the

diﬀerence between face processing in healthy and schizophrenic

participants based on an ANOVA may not in all cases be the

best brain features to train a support vector machine to predict

this group eﬀect in new participants. An important outcome

measure in ClSt is the quantiﬁed signiﬁcance associated with a

statistical relationship between few variables given a pre-speciﬁed

model. ClSt tends to test for a particular structure in the brain

data based on analytical guarantees,informofasmathematical

convergence theorems about approximating the population

properties with increasing sample size. The outcome measure

for StLe is the quantiﬁed generalization of patterns between

many variables or, more generally, the robustness of special

structure in the data (Hastie et al., 2001). In the neuroimaging

literature, reports of statistical outcomes have previously been

noted to confuse diagnostic measures from classical statistics and

statistical learning (Friston, 2012).

For neuroscientists adopting a ClSt culture computing p-

values takes a central position. The p-value denotes the

probability of observing a result at least as extreme as a

test statistic, assuming the null hypothesis is true. Results are

considered signiﬁcant when it is equal or below a pre-speciﬁed

value, like p=0.05 (Anderson et al., 2000). Under the condition

of suﬃciently high power (cf. below), it quantiﬁes the strength

of evidence against the null hypothesis as a continuous function

(Rosnow and Rosenthal, 1989). Counterintuitively, it is not an

immediate judgment on the alternative hypothesis H1preferred

by the investigator (Cohen, 1994; Anderson et al., 2000). P-

values do also not qualify the possibility of replication. It is

another important caveat that a ﬁnding in the brain becomes

more statistically signiﬁcant (i.e., lower p-value) with increasing

sample size (Berkson, 1938; Miller et al., 2016).

The essentially binary p-value (i.e., signiﬁcant vs. not

signiﬁcant) is therefore often complemented by continuous eﬀect

size measures for the importance of rejecting H0. The eﬀect

size allows the identiﬁcation of marginal eﬀects that pass the

statistical signiﬁcance threshold but are not practically relevant

in the real world. The p-value is a deductive inferential measure,

whereas the eﬀect size is a descriptive measure that follows neither

inductive nor deductive reasoning. The (normalized) eﬀect size

can be viewed as the strength of a statistical relationship—how

much H0deviates from H1, or the likely presence of an eﬀect

in the general population (Chow, 1998; Ferguson, 2009; Kelley

and Preacher, 2012). This diagnostic measure is often unit-

free, sample-size independent, and typically standardized. As a

property of the actual statistical test, the eﬀect size can be essential

to report for biological understanding, but has diﬀerent names

and takes various forms, such as rho in Pearson correlation,

eta2in explained variances, and Cohen’s d in diﬀerences between

group averages.

Additionally, the certainty of a point estimate (i.e., the

outcome is a value) can be expressed by an interval estimate

(i.e., the outcome is a value range) using conﬁdence intervals

(Casella and Berger, 2002). These variability diagnostics indicate

arangeofvaluesbetweenwhichthetruevaluewillfall

a given proportion of the time (Estes, 1997; Nickerson,

Frontiers in Neuroscience | www.frontiersin.org 12 September 2017 | Volume 11 | Article 543

1369

1370

1371

1372

1373

1374

1375

1376

1377

1378

1379

1380

1381

1382

1383

1384

1385

1386

1387

1388

1389

1390

1391

1392

1393

1394

1395

1396

1397

1398

1399

1400

1401

1402

1403

1404

1405

1406

1407

1408

1409

1410

1411

1412

1413

1414

1415

1416

1417

1418

1419

1420

1421

1422

1423

1424

1425

1426

1427

1428

1429

1430

1431

1432

1433

1434

1435

1436

1437

1438

1439

1440

1441

1442

1443

1444

1445

1446

1447

1448

1449

1450

1451

1452

1453

1454

1455

1456

1457

1458

1459

1460

1461

1462

1463

1464

1465

1466

1467

1468

1469

1470

1471

1472

1473

1474

1475

1476

1477

1478

1479

1480

1481

1482

Bzdok Two St a tis t ica l C u lt u res i n N eur o im a g ing

FIGURE 5 | Key differences between measuring outcomes in classical statistics and statistical learning. Ten intuitions on quantifying statistical modeling outcomes

that tend to be relatively more true for classical statistical methods (blue) or pattern-learning methods (red). ClSt typically yields point estimates and interval estimates

(e.g., p-values, variances, conﬁdence intervals), whereas StLe frequently outputs a function or a program that can yield point and interval estimates on new

observations (e.g., the k-means centroids or a trained classiﬁer’s decision function can be applied to new data). In many cases, classical inferenceisajudgmentabout

an entire data sample, whereas a trained predictive model canobtainquantitativeanswersfromasingledatapoint.

2000; Cumming, 2009). Typically, a 95% conﬁdence interval

is spanned around the population mean in 19 out of 20

cases across all observed samples. The tighter the conﬁdence

interval, the smaller the variance of the point estimate of the

population parameter in each drawn sample. The estimation of

conﬁdence intervals is inﬂuenced by sample size and population

variability. Conﬁdence intervals may be asymmetrical (ignored

by Gaussianity assumptions; Efron, 2012), can be reported for

diﬀerent statistics and with diﬀerent percentage borders. Notably,

they can be used as a viable surrogate for formal tests of statistical

signiﬁcance in many scenarios (Cumming, 2009).

Some conﬁdence intervals can be computed in various data

scenarios and statistical regimes, whereas the power may be

especially meaningful within the culture of classical hypothesis

testing (Cohen, 1977, 1992; Oakes, 1986). To estimate power the

investigator needs to specify the true eﬀect size and variance

under H1. The ClSt-minded investigator can then estimate the

probability for rejecting null hypotheses that should be rejected,

at the given threshold alpha and given that H1is true. A

high power thus ensures that statistically signiﬁcant and non-

signiﬁcant tests indeed reﬂect a property of the population

(Chow, 1998). Intuitively, a small conﬁdence interval around

a relevant eﬀect suggests high statistical power. False negatives

(i.e., Type II errors, beta error) become less likely with higher

power (=1—beta error) (cf. Ioannidis, 2005). Concretely, an

underpowered investigation means that the investigator is less

likely to be able to distinguish between H0and H1at the

speciﬁed signiﬁcance threshold alpha. Power calculations depend

on several factors, including signiﬁcance threshold alpha, the

eﬀect size in the population, variation in the population, sample

size n, and experimental design (Cohen, 1992).

While neuroimaging studies based on classical statistical

inference ubiquitously report p-values and conﬁdence intervals,

there have however been few reports of eﬀect size in the

neuroimaging literature (Kriegeskorte et al., 2010). Eﬀect sizes

are however necessary to compute power estimates. This

explains the even rarer occurrence of power calculations in

the neuroimaging literature (Yarkoni and Braver, 2010;butsee

Poldrack et al., 2017). Given the importance of p-values and eﬀect

sizes, the goal of computing both these useful statistics, such as

for group diﬀerences in the neural processing of face stimuli,

can be achieved based on two independent samples of these

experimental data (especially if some selection process has been

used). One sample would be used to perform statistical inference

on the neural activity change yielding a p-value and one sample

to obtain unbiased eﬀect sizes. Further, it has been previously

emphasized (Friston, 2012)thatp-values and eﬀect sizes reﬂect

in-sample estimates in a retrospective inference regime (ClSt).

These metrics ﬁnd an analog in out-of-sample estimates issued

from cross-validation in a prospective prediction regime (StLe).

In-sample eﬀect sizes are typically an optimistic estimate of

the “true” eﬀect size (inﬂated by high signiﬁcance thresholds),

whereas out-of-sample eﬀect sizes are unbiased estimates of the

“true” eﬀect size.

In the high-dimensional scenario, the StLe-minded

investigator analyzing “wide” neuroimaging data in our

Frontiers in Neuroscience | www.frontiersin.org 13 September 2017 | Volume 11 | Article 543

1483

1484

1485

1486

1487

1488

1489

1490

1491

1492

1493

1494

1495

1496

1497

1498

1499

1500

1501

1502

1503

1504

1505

1506

1507

1508

1509

1510

1511

1512

1513

1514

1515

1516

1517

1518

1519

1520

1521

1522

1523

1524

1525

1526

1527

1528

1529

1530

1531

1532

1533

1534

1535

1536

1537

1538

1539

1540

1541

1542

1543

1544

1545

1546

1547

1548

1549

1550

1551

1552

1553

1554

1555

1556

1557

1558

1559

1560

1561

1562

1563

1564

1565

1566

1567

1568

1569

1570

1571

1572

1573

1574

1575

1576

1577

1578

1579

1580

1581

1582

1583

1584

1585

1586

1587

1588

1589

1590

1591

1592

1593

1594

1595

1596

Bzdok Two St a tis t ica l C u lt u res i n N eur o im a g ing

case, computing, and judging statistical signiﬁcance by p-

values can become challenging (Bühlmann and Van De Geer,

2011; Efron, 2012; James et al., 2013). Instead, classiﬁcation

accuracy on fresh data is a frequently reported performance

metric in neuroimaging studies using learning algorithms. The

classiﬁcation accuracy is a simple summary statistic that captures

the fraction of correct prediction instances among all performed

applications of a ﬁtted model. Basing interpretation on accuracy

alone can be an insuﬃcient diagnostic because it is frequently

inﬂuenced by the number of samples, the local characteristics

of hemodynamic responses, eﬃciency of experimental design,

data folding into train and test sets, and diﬀerences in the

feature number p(Haynes, 2015). A potentially under-exploited

data-driven tool in this context is bootstrapping. The archetypical

example of computer-intensive statistical method enables

population-level inference of unknown distributions largely

independent of model complexity by repeated random draws

from the neuroimaging data sample at hand (Efron, 1979; Efron

and Tibshirani, 1994). This opportunity to equip various point

estimates by an interval estimate of certainty (e.g., the possibly

asymmetrical interval for the “true” accuracy of a classiﬁer) is

unfortunately seldom embraced in neuroimaging today (but

see Bellec et al., 2010; Pernet et al., 2011; Vogelstein et al.,

2014). Besides providing conﬁdence intervals, bootstrapping

can also perform non-parametric null hypothesis testing. This

may be one of few examples of a direct connection between

ClSt and StLe methodology. Alternatively, binomial tests have

been used to obtain a p-value estimate of statistical signiﬁcance

from accuracies and other performance scores (Pereira et al.,

2009; Brodersen et al., 2013; Hanke et al., 2015)inthebinary

classiﬁcation setting. It has frequently been employed to reject

the null hypothesis that two categories occur equally often.

There are however increasing concerns about the validity of this

approach if statistical independence between the performance

estimates (e.g., prediction accuracies from each cross-validation

fold) is in question (Pereira and Botvinick, 2011; Noirhomme

et al., 2014; Jamalabadi et al., 2016). Yet another option to derive

p-values from classiﬁcation performances of two groups is label

permutation based on non-parametric resampling procedures

(Nichols and Holmes, 2002; Golland and Fischl, 2003). This

algorithmic signiﬁcance-testing tool can serve to reject the null

hypothesis that the neuroimaging data do not contain relevant

information about the group labels in many complex data

analysis settings.

The neuroscientist who adopted a StLe culture is in the habit

of corroborating prediction accuracies using cross-validation: the

de facto standard to obtain an unbiased estimate of a model’s

capacity to generalize beyond the brain scans at hand (Hastie

et al., 2001; Bishop, 2006). Model assessment is commonly done

by training on a bigger subset of the available data (i.e., training

set for in-sample performance) and subsequent application of the

trained model to the typically smaller remaining part of data

(i.e., test set for out-of-sample performance), both assumed to

be drawn from the same distribution. Cross-validation typically

divides the sample into data splits such that the class label

(i.e., healthy vs. schizophrenic) of each data point is to be

predicted once. The pairs of model-predicted label and the

corresponding true label for each data point (i.e., brain scan)

in the dataset can then be submitted to the quality measures

(Powers, 2011), including prediction accuracy (inversely related

to prediction error), precision,recall,andF1 score.Accuracyand

the other performance metrics are often computed separately

on the training set and the test set. Additionally, the measures

from training and testing can be expressed by their inverse

(e.g., training error as in-sample error and test error as out-

of-sample error) because the positive and negative cases are

interchangeable.

The classiﬁcation accuracy can be further decomposed into

group-wise metrics based on the so-called confusion matrix,the

juxtaposition of the true and predicted group memberships. The

precision measures (Tab l e 1 ) how many of the labels predicted

from brain scans are correct, that is, how many participants

predicted to belong to a certain class really belong to that class.

Put diﬀerently, among the participants predicted to suﬀer from

schizophrenia, how many have really been diagnosed with that

disease? On the other hand, the recall measures how many labels

are correctly predicted, that is, how many members of a class

were predicted to really belong to that class. Hence, among the

participants known to be aﬀected by schizophrenia, how many

were actually detected as such? Precision can be viewed as a

measure of “exactness” and recall as a measure of “completeness”

(Powers, 2011).

Neither accuracy, precision, or recall allow injecting subjective

importance into the evaluation process of the learning algorithm.

This disadvantage is addressed by the Fbeta score:aweighted

combination of the precision and recall prediction scores.

Concretely, the F1score would equally weigh precision and recall

of class predictions, while the F0.5 score puts more emphasis on

precision and the F2score more on recall. Moreover, applications

of recall, precision, and Fbeta scores have been noted to ignore the

true negative cases as well as to be highly susceptible to estimator

bias (Powers, 2011). Needless to say, no single outcome metric

can be equally optimal in all contexts.

Extending from the setting of healthy-diseased classiﬁcation

to the multi-class setting (e.g., comparing healthy, schizophrenic,

bipolar, and autistic participants) injects ambiguity into the

interpretation of accuracy scores. Rather than reporting mere

better-than-chance ﬁndings in StLe analyses, it becomes more

important to evaluate the F1, precision and recall scores for

each class to be predicted in the brain scans (e.g., Brodersen

et al., 2011b; Schwartz et al., 2013). It is important to appreciate

that the sensitivity/speciﬁcity metrics, perhaps more frequently

reported in ClSt communities, and the precision/recall metrics,

probably more frequently reported in StLe communities, tell

slightly diﬀerent stories about identical neuroscientiﬁc ﬁndings.

TAB LE 1 |

Q26

Q4

Notion Formula

Speciﬁcity true negative/(true negative +false positive)

Sensitivity/Recall t