Page 1

Bayesian Model Selection for Group Studies

Klaas Enno Stephan1,2, Will D. Penny1, Jean Daunizeau1, Rosalyn J. Moran1, and Karl J.

Friston1

1Wellcome Trust Centre for Neuroimaging, Institute of Neurology, University College London,

London, UK 2Laboratory for Social and Neural Systems Research, Institute for Empirical

Research in Economics, University of Zurich, Switzerland

Abstract

Bayesian model selection (BMS) is a powerful method for determining the most likely among a

set of competing hypotheses about the mechanisms that generated observed data. BMS has

recently found widespread application in neuroimaging, particularly in the context of dynamic

causal modelling (DCM). However, so far, combining BMS results from several subjects has

relied on simple (fixed effects) metrics, e.g. the group Bayes factor (GBF), that do not account for

group heterogeneity or outliers. In this paper, we compare the GBF with two random effects

methods for BMS at the between-subject or group level. These methods provide inference on

model-space using a classical and Bayesian perspective respectively. First, a classical (frequentist)

approach uses the log model evidence as a subject-specific summary statistic. This enables one to

use analysis of variance to test for differences in log-evidences over models, relative to inter-

subject differences. We then consider the same problem in Bayesian terms and describe a novel

hierarchical model, which is optimised to furnish a probability density on the models themselves.

This new variational Bayes method rests on treating the model as a random variable and

estimating the parameters of a Dirichlet distribution which describes the probabilities for all

models considered. These probabilities then define a multinomial distribution over model space,

allowing one to compute how likely it is that a specific model generated the data of a randomly

chosen subject as well as the exceedance probability of one model being more likely than any

other model. Using empirical and synthetic data, we show that optimising a conditional density of

the model probabilities, given the log-evidences for each model over subjects, is more informative

and appropriate than both the GBF and frequentist tests of the log-evidences. In particular, we

found that the hierarchical Bayesian approach is considerably more robust than either of the other

approaches in the presence of outliers. We expect that this new random effects method will prove

useful for a wide range of group studies, not only in the context of DCM, but also for other

modelling endeavours, e.g. comparing different source reconstruction methods for EEG/MEG or

selecting among competing computational models of learning and decision-making.

Keywords

Random effects; variational Bayes; hierarchical models; model evidence; Bayes factor; model

comparison; dynamic causal modelling; DCM; fMRI; EEG; MEG; source reconstruction

Address for correspondence: Klaas Enno Stephan, Wellcome Trust Centre for Neuroimaging, Institute of Neurology, UCL, 12 Queen

Square, London, UK, WC1N 3BG, Tel (44) 20 7833 7472, Fax (44) 20 7813 1420, Email k.stephan@fil.ion.ucl.ac.uk.

Software note

The method described in this paper is freely available to the community as part of the open-source software package Statistical

Parametric Mapping (SPM8; http://www.fil.ion.ucl.ac.uk/spm).

Europe PMC Funders Group

Author Manuscript

Neuroimage. Author manuscript; available in PMC 2009 July 15.

Published in final edited form as:

Neuroimage. 2009 July 15; 46(4): 1004–1017. doi:10.1016/j.neuroimage.2009.03.025.

Europe PMC Funders Author Manuscripts

Europe PMC Funders Author Manuscripts

Page 2

Introduction

Model comparison and selection is central to the scientific process, in that it allows one to

evaluate different hypotheses about the way data are caused (Pitt & Myung 2002). Nearly all

scientific reporting rests upon some form of model comparison, which represents a

probabilistic statement about the beliefs in one hypothesis relative to some other(s), given

observations or data. The fundamental Neyman-Pearson lemma states that the best statistic

upon which to base model selection is simply the probability of observing the data under one

model, divided by the probability under another model (Neyman & Pearson 1933). This is

known as a log-likelihood ratio. In a classical (frequentist) setting, the distribution of the

log-likelihood ratio, under the null hypothesis that there is no difference between models,

can be computed relatively easily for some models. Common examples include Wilk’s

Lambda for linear multivariate models and the F- and t-statistics for univariate models. In a

Bayesian setting, the equivalent to the log-likelihood ratio is the log-evidence ratio, which is

commonly known as a Bayes factor (Kass & Raftery 1995). An important property of Bayes

factors are that they can deal both with nested and non-nested models. In contrast,

frequentist model comparison can be seen as a special case of Bayes factors where, under

certain hierarchical restrictions on the models, their null distribution is readily available.

In this paper, we will consider the general case of how to use the model evidence for

analyses at the group level, without putting any constraints on the models compared. These

models can be nonlinear, possibly dynamic and, critically, do not necessarily bear a

hierarchical relationship to each other, i.e. they are not necessarily nested. The application

domain we have in mind is the comparison of dynamic causal models (DCMs) for fMRI or

electrophysiological data (Friston et al. 2003; Stephan et al. 2007a) that have been inverted

for each subject. However, the theoretical framework described in this paper can be applied

to any model, for example when comparing different source reconstruction methods for

EEG/MEG or selecting among competing computational models of learning and decision-

making.

This paper is structured as follows. First, to ensure this paper is self-contained, particularly

for readers without an in-depth knowledge of Bayesian statistics, we summarise the concept

of log-evidence as a measure of model goodness and review commonly used approximations

to it, i.e. the Akaike Information Criterion (AIC; Akaike 1974), the Bayesian Information

Criterion (BIC; Schwarz 1978), and the negative free-energy (F). These approximations

differ in how they trade-off model fit against model complexity. Given any of these

approximations to the log-evidence, we then consider model comparison at the group level.

We address this issue both from a classical and Bayesian perspective. First, in a frequentist

setting, we consider classical inference on the log-evidences themselves by treating them as

summary statistics that reflect the evidence for each model for a given subject.

Subsequently, using a hierarchical model and variational Bayes (VB), we describe a novel

technique for inference on the conditional density of the models per se, given data (or log-

evidences) from all subjects. This rests on treating the model as a random variable and

estimating the parameters of a Dirichlet distribution, which describes the probabilities for all

models considered. These probabilities then define a multinomial distribution over model

space, allowing one to compute how likely it is that a specific model generated the data of a

subject chosen at random.

We compare and contrast these random effects approaches to the conventional use of the

group Bayes factor (GBF), an approach for model comparison at the between-subject level

that has been used extensively in previous group studies in neuroimaging. For example, the

GBF has been used frequently to decide between competing dynamic causal models fitted to

fMRI (Acs & Greenlee 2008; Allen et al. 2008; Grol et al. 2007; Heim et al. 2008; Kumar et

Stephan et al.

Page 2

Neuroimage. Author manuscript; available in PMC 2009 July 15.

Europe PMC Funders Author Manuscripts

Europe PMC Funders Author Manuscripts

Page 3

al. 2007; Leff et al. 2008; Smith et al. 2006; Stephan et al. 2007b, 2007c; Summerfield &

Koechlin 2008) and EEG data (Garrido et al. 2007, 2008). While the GBF is a simple and

straightforward index for model comparison at the group level, it assumes that all subjects’

data are generated by the same model (i.e. a fixed effects approach) and can be influenced

adversely by violations of this assumption.

The novel Bayesian framework presented in this paper does not suffer from these

shortcomings: it can quantify the probability that a particular model generated the data for

any randomly selected subject, relative to other models, and it is robust to the presence of

outliers. In the analyses below, we illustrate the advantages of this new approach using

synthetic and empirical data. We show that computing a conditional density of the model

probabilities, given the log-evidences for all subjects, can be superior to both the GBF and

frequentist tests applied to the log-evidences. In particular, we found that our Bayesian

approach is markedly more robust than either of the other approaches in the presence of

outlying subjects.

Methods

THE MODEL EVIDENCE AND ITS APPROXIMATIONS

The model evidence p(y | m) is the probability of obtaining observed data y given a

particular model m. It can be considered the holy grail of any model inversion and is

necessary to compare different models or hypotheses. The evidence for some models can be

computed relatively easily (e.g., for linear models); however, in general, computing the

model evidence entails integrating out any dependency on the model parameters ϑ:

(1)

In many cases, this integration is analytically intractable and numerically difficult to

compute. Usually, it is therefore necessary to use computationally tractable approximations

to the model evidence (or the log-evidence1). A detailed description of some of the most

common approximations is contained by Appendix A.

A systematic evaluation of the relative usefulness of different approximations to the log-

evidence is not at the focus of this paper and will be presented in forthcoming work. This

article deals with a different question, namely: Given a particular approximation to the log-

evidence and a number of inverted models, how can we infer which of several competing

models is most likely to have generated the data from a group of subjects? In other words,

how can we make inference on model space at the group level, taking into account potential

heterogeneity across the group?

INFERENCE ON MODEL SPACE

In this section, we consider inference at the group level, using subject-specific model-

evidences obtained by inverting a generative model for each subject. We will first describe a

classical approach, testing the null hypothesis that there are no differences among the

relative log-evidences for various models over subjects. We then move on to more formal

Bayesian inference on model space per se. In contrast to the GBF, which, as described

above, represents a fixed effects analysis, both the classical and Bayesian approaches are

random effects procedures and thus consider inter-subject heterogeneity explicitly.

1Due to the monotonic nature of the logarithmic function, model comparisons yield equivalent results regardless whether one

maximises the model evidence or the log-evidence. Since the latter is numerically easier, it is usually the preferred metric.

Stephan et al. Page 3

Neuroimage. Author manuscript; available in PMC 2009 July 15.

Europe PMC Funders Author Manuscripts

Europe PMC Funders Author Manuscripts

Page 4

Classical (frequentist) inference—A straightforward random effects procedure to

evaluate the between-subject consistency of evidence for one model relative to others is to

use the log-evidences across subjects as the basis for a classical log-likelihood ratio statistic,

testing the null hypothesis that no single model is better (in terms of their log-evidences)

than any other. This essentially involves performing an ANOVA, using the log-evidence as

a summary statistic of model adequacy for each subject. This ANOVA then compares the

differences among models to the differences among subjects with a classical F-statistic. If

this statistic is significant one can then compare the best model with the second best using a

post hoc t-test. Effectively, this tests for differences between models that are consistent and

large in relation to differences within models over subjects. The most general

implementation would be a repeated-measures ANOVA, where the log-evidences for the

different models represent the repeated measure. At its simplest, the comparison of just two

models over subjects reduces to a simple paired t-test on the log-evidences (or a one-sample

t-test on the log-evidence differences). Log-evidences tend to be fairly well behaved, and the

residuals of a simple ANOVA model, or tests of normality like Kolmogorov-Smirnoff,

usually indicate that parametric assumptions are appropriate. In those cases when they are

not, e.g. due to outlier subjects, one can use robust regression methods that are less sensitive

to violations of normality (Diedrichsen et al. 2005; Wager et al. 2005) or non-parametric

tests that do not make any distributional assumptions (e.g. a Wilcoxon signed rank test; see

one of our examples below).

This classical random effects approach is simple to implement, straightforward and easily

interpreted. In this sense, there seems little reason not to use it. However, as shown in the

empirical examples below, this type of inference can be affected markedly by group

heterogeneity, even when the distribution of log-evidence differences is normal. A more

robust analysis obtains by quantifying the density on model space itself, using a Bayesian

approach as described in the next section.

Bayesian inference on model space—Previously, we have suggested the use of a

group Bayes factor (GBF) that is simply the product of Bayes factors over N subjects

(Stephan et al. 2007b). This is equivalent to a fixed effects analysis that rests on multiplying

the likelihoods over subjects to furnish the probability of the multi-subject data, conditioned

on each model:

(2)

Here, the subscripts i,j refer to the models being compared, and the bracketed superscript

refers to the n-th subject. The reason one can simply multiply the probabilities (or add the

log-evidences) is that the measured data can be regarded as conditionally independent

samples over subjects. However, this does not represent a formal evaluation of the

conditional density of a particular model given data from all subjects. Furthermore, it rests

upon a very particular generative model for group data: first, select one of K models from a

multinomial distribution and then generates data, under this model, for each of the N

subjects. This is fundamentally different from a generative model which treats subjects as

random effects: here we would select a model for each subject by sampling from a

multinomial distribution, and then generate data under that subject-specific model. The

distinction between these two generative models is illustrated graphically in Figure 1.

In short, the GBF encodes the relative probability that the data were generated by one model

relative to another, assuming the data were generated by the same model for all subjects.

What we often want, however, is the density from which models are sampled to generate

Stephan et al.

Page 4

Neuroimage. Author manuscript; available in PMC 2009 July 15.

Europe PMC Funders Author Manuscripts

Europe PMC Funders Author Manuscripts

Page 5

subject-specific data. In other words, we seek the conditional estimates of the multinomial

parameters, i.e. the model probabilities r=[r1,...,rK], that generate switches or indicator

variables, mn=[mn1,...,mnK], where mnk {0,1} for any given subject n {1,..., N}, and only

one of these switches is equal to one; i.e.,

model for the n-th subject; where p(mnk=1)=rk. In the following, we describe a hierarchical

Bayesian model that can be inverted to obtain an estimate of the posterior density over r.

. These indicator variables prescribe the

A variational Bayesian approach for inferring model probabilities—We will deal

with K models with probabilities r=[r1,...,rK] that are described by a Dirichlet distribution

(3)

Here, α=[α1,...,αK] are related to the unobserved “occurrences” of models in the population;

i.e. αk -1 can be thought of as the effective number of subjects in which model k generated

the observed data. Given the probabilities r, the distribution of the multinomial variable mn

describes the probability that model k generated the data of subject n:

(4)

For any given subject n, we can sample from this multinomial distribution to obtain a

particular model k. The marginal likelihood of the data in the n-th subject, given this model

k, is then obtained by integrating over the parameters of the model selected

(5)

The graphical model summarising the dependencies among r, m and y as described by

Equations 3-5 is shown in Figure 1B and 1C. Our goal is to invert this hierarchical model

and estimate the posterior distribution over r.

Given the structure of the hierarchical model in Figure 1, the joint probability of the

parameters and the data y can be written as:

(6)

The log joint probability is therefore given by

(7)

The inversion of our hierarchical model relies on the following variational Bayesian (VB)

approach in which we assume that an approximate posterior density q can be described by

the following mean-field factorisation:

Stephan et al.

Page 5

Neuroimage. Author manuscript; available in PMC 2009 July 15.

Europe PMC Funders Author Manuscripts

Europe PMC Funders Author Manuscripts

Page 6

(8)

Here, I(r) and I(r) are variational energies for the mean-field partition. The mean-field

assumption in Equation 8 means that the VB posterior will only be approximate but, as we

shall see, it provides a particularly simple and intuitive algorithm (c.f. Equation 14). This

algorithm provides precise estimates of the parameters α defining the approximate Dirichlet

posterior q(r) ≈ p(r | y); this was verified by comparisons with a sampling method which is

described in Appendix B.

To obtain the approximate posterior q(m) ≈ p(m | y), we have to do two things: first,

compute I(m) and second, determine the normalising constant or partition function for

exp(I(m)), which renders q(m) a probability density. Making use of the log joint probability

in Equation 7 and omitting terms that do not depend on m, the variational energy is:

(9)

Here

and Ψ is the digamma function2

(10)

The next step is to obtain the approximate posterior, q(m): If gnk is our (normalized)

posterior belief that model k generated the data from subject n, i.e. gnk=q(mnk=1), then

Equation 9 tells us that

(11)

where unk is the equivalent (non-normalized) belief and un is the partition function for

exp(I(m)) that ensures that the posterior probabilities sum to one.

We now repeat the above procedure but this time for the approximate posterior over r. By

substituting in the log joint probability from Equation 7 and omitting terms that do not

depend on r, we have

(12)

2See Appendix B in Bishop (2006) concerning the use of the digamma function in Equation 10.

Stephan et al.Page 6

Neuroimage. Author manuscript; available in PMC 2009 July 15.

Europe PMC Funders Author Manuscripts

Europe PMC Funders Author Manuscripts

Page 7

Here,

by model k. Now, from Equation 8 we have log q(r)=I(r)+... and from Equation 3 we see

is the expected number of subjects whose data we believe were generated

that the log of a Dirichlet density is given by

comparing terms we see that the approximate posterior q(r)=Dir(r;α) where:

Hence, by

(13)

In short, Equation 13 simply adds the ‘data counts’, β, to the ‘prior counts’, α0. This is an

example of a free-form VB approximation, where the optimal form of the approximate

posterior (in this case a Dirichlet), has been derived rather than assumed before-hand (c.f.

fixed-form VB approximations; Friston et al. 2007). It should be stressed, however, that due

to the mean-field assumption used by our VB approach (see Equation 8), q(r) is only an

approximate posterior and the true posterior distribution p(r | y) does not have the exact form

of a Dirichlet distribution.

The above equations can be implemented as an optimisation algorithm which updates

estimates of α iteratively until convergence. By combining Equations 11, 12 and 13 we get

the following pseudo-code of a simple algorithm that gives us the parameters of the

conditional density we seek, i.e. q(r)=Dir(r;α)

Until convergence

(14)

end

We make the usual assumption that, a priori, no models have been “seen” (i.e. the Dirichlet

prior is α0 = [1,...,1]).3 Critically, this scheme requires only the log-evidences over models

and subjects (c.f. Equation 11):

Using the Dirichlet density p(r | y;α) for model comparison—After the above

optimization of the Dirichlet parameters, α, the Dirichlet density p(r | y;α) can be used for

model comparisons at the group level. There are several ways to report this comparison that

result in equivalent model rankings. The simplest option is to report the estimates of the

Dirichlet parameter estimates α. Another possibility is to use those estimates to compute the

expected multinomial parameters rk and thus the expected likelihood of obtaining the k-

th model, i.e. p(mnk=1|r)=Mult(m;1,r), for any randomly selected subject: 4

(15)

3Note that this choice of Dirichlet prior is a “flat” prior, assigning uniform probabilities to all models. In contrast, a Dirichlet prior

with elements below unity results in a highly concave probability density that concentrates the probability mass around zero and one,

respectively.

4For the special case of “drawing” a single “sample” (model), the multinomial distribution of models reduces to p(mnk=1 | r)=rk.

Therefore, for any given subject, rk represents the conditional expectation that the k-th model generated the subject’s data.

Stephan et al.Page 7

Neuroimage. Author manuscript; available in PMC 2009 July 15.

Europe PMC Funders Author Manuscripts

Europe PMC Funders Author Manuscripts

Page 8

A third option is to use the conditional model probability p(r | y;α) to quantify an

exceedance probability, i.e. our belief that a particular model k is more likely than any other

model (of the K models tested), given the group data:

(16)

The exceedance probabilities

intuitive when comparing two models (or model subsets, see below). In this case, because

k sum to one over all models tested. They are particularly

the conditional probabilities of the models rk also sum to one, the exceedance probability

of one model, compared to another, can be written as

(17)

The analyses of empirical data below include several examples where two models are

compared; the associated exceedance probabilities are shown in Figures 3, 6, 9 and 13.

Either the Dirichlet parameter estimates α, the conditional expectations of model

probabilities rk or the exceedance probabilities

group level. In the next section, we present several practical examples of our method,

applying it to both synthetic and empirical data. In this paper, we focus on comparing two

models (or two model subsets) and largely rely on exceedance probabilities when discussing

the results of our analyses. However, for each analysis we also report the estimates of α and

k can be used to rank models at the

the conditional expectations of model probabilities, rk ; these are shown in the figures.

Model space partitioning—A particular strength of the approach presented in this paper

is that it can not only be used to compare specific models, but also to compare particular

classes or subsets of models, resulting from a partition of model space. For example, one

may want to compute the probability that a specific model attribute, say the presence vs.

absence of a particular connection in a DCM, improves or reduces model performance,

regardless of any other differences among the models considered. This type of inference

rests on comparing two (or more) subsets of model space, pooling information over all

models in these subsets. This effectively removes uncertainty about any aspect of model

structure, other than the attribute of interest (which defines the partition). Heuristically, this

sort of analysis can be considered a Bayesian analogue of tests for “main effects” in classical

ANOVA.

Within our framework this type of analysis can be performed by exploiting the

agglomerative property of the Dirichlet distribution. Generally, for any partition of model

space into J disjoint subsets, N1,N2,...,NJ, this property ensures that

(18)

In other words, once we have estimates of the Dirichlet parameters αk for all K models, it is

easy to evaluate the relative importance of different model subspaces: For any given

partition of model space, a new Dirichlet density reflecting this partition can be defined by

simply adding αk for all models k belonging to the same subset. The resulting Dirichlet can

Stephan et al.

Page 8

Neuroimage. Author manuscript; available in PMC 2009 July 15.

Europe PMC Funders Author Manuscripts

Europe PMC Funders Author Manuscripts

Page 9

then be used to compare different subsets of model space in exactly the same way as one

compares individual models, e.g. using exceedance probabilities. An example of this

application is shown in Figures 12 and 13.

Results

In what follows, we compare classical inference, the GBF (fixed effects) and inference on

model space (random effects) using both synthetic and real data. These data have been

previously published and have been analysed in various ways, including group level model

inference using GBFs (Stephan et al. 2007b, 2007c; Stephan et al. 2008).

Synthetic data: nonlinear vs. bilinear modulation

To demonstrate the face validity of our method, we used simulated data, where the true

model was known. Specifically, we used one of the synthetic data sets described by Stephan

et al. (2008), consisting of twenty synthetic BOLD time-series that were generated using a

three-area nonlinear DCM with fixed parameters and adding Gaussian observation noise to

achieve a signal-to-noise ratio (SNR) of two. Each time-series consisted of 100 data points

that were obtained by sampling the model output at a frequency of 1 Hz over a period of 100

seconds. For each time-series, we fitted (i) a nonlinear DCM with the same model structure

as the model that generated the data (“correct model” in Fig. 2, model m1), and (ii) a second

DCM that was similar in structure but included a bilinear (instead of a nonlinear)

modulatory influence (“incorrect model” in Fig. 2, model m2). Using the negative free-

energy approximation to the log-evidence, the differences in log-evidences for all twenty

time-series are plotted in the lower part of Fig. 2. It can be seen that in 17 out of 20 cases the

nonlinear model was correctly identified as the more likely model. The overall GBF (9 ×

1014) was also clearly in favour of the correct model.

Here, we revisit this synthetic data set using random effects BMS procedures. We first used

classical inference, applying a paired t-test to the log-evidences of the two models. This test

rejected the null hypothesis of no difference in model goodness (t = 4.615, df = 19, p <

10-4). Applying the novel hierarchical BMS approach gave an even clearer (and arguably

also more useful) answer: the exceedance probability

more likely model than m2, was 100% (Figure 3). In other words, using the exceedance

probability as a criterion, the correct model was identified perfectly, given all twenty data

sets and the chosen level of noise. To further corroborate this result, we compared the result

from our VB algorithm to an independent method which estimates the parameters α by

sampling from the approximate Dirichlet posterior q(r) ≈ p(r | y) . This comparison showed

that the VB estimate of α resulted in an estimate of the negative free-energy F(y,α) ≤ ln p(y

| α) that was consistent with the results from the sampling approach (Figure 4). This

provides an additional validation of our VB technique. We used this sampling approach to

verify the correctness of our VB estimates in all subsequent analyses.

1, i.e. the probability of m1 being a

It should be noted that this simulation study concerned the extreme case that only one model

had generated all data, i.e. r1=100% and r2=0%, making it easy to intuitively understand the

performance of the proposed model selection procedure. However, this simulation did not

probe the robustness of our method when randomly sampling from a heterogeneous

population of subjects whose data had been generated by different models. We will revisit

this scenario in a later section of this paper once we have introduced and compared two

alternative DCMs of inter-hemispheric interactions using empirical data.

Stephan et al.

Page 9

Neuroimage. Author manuscript; available in PMC 2009 July 15.

Europe PMC Funders Author Manuscripts

Europe PMC Funders Author Manuscripts

Page 10

Comparing different six-area DCMs of the ventral visual stream

As a first empirical application, we investigated a case we had encountered in our previous

research (Stephan et al. 2007b) and which had actually triggered our interest in developing

more powerful group level inference about models. This model comparison concerned

DCMs describing alternative mechanisms of inter-hemispheric integration in terms of

context-dependent modulation of connections. In one of the analyses of the original report

(Stephan et al. 2007b), competing DCMs had been constructed for the ventral stream of the

visual system by systematically changing which of the experimentally controlled conditions

modulated the intra- and/or the inter-hemispheric connections.

First, we focused on the six-area model of the ventral stream, comprising the lingual gyrus

(LG), middle occipital gyrus (MOG) and fusiform gyrus (FG) in both hemispheres, and

revisit the comparison of the best two models as indexed by the GBF. In the first model, m1,

inter-hemispheric connections were modulated by a letter decision task, but conditional on

the visual field of stimulus presentation (LD|VF); intra-hemispheric connections were

modulated by LD alone (see right side of Figure 5). In the second model, m2, these

modulations were reversed: inter-hemispheric connections were modulated by LD and intra-

hemispheric connections were modulated by LD|VF (see left side of Figure 5). The

distribution of log-evidence differences (approximated by AIC/BIC, following the procedure

suggested by Penny et al. 2004) is shown in the centre of Figure 5: Although m1 was

robustly superior in 11 of the 12 subjects, a single outlier was so extreme that the GBF

indicated an overall superiority of m2 (GBF=15 in favour of m2). In contrast, model

comparison using our novel Bayesian method was not affected by this outlier: the

exceedance probability in favour of m1 was very high (

expectation r1 that m1 generated the data of any randomly selected subject was 84.3%

(Figure 6). The estimates of our VB method were confirmed by the sampling approach

(Figure 7).

1 = 99.7%), and the conditional

For comparison, we also applied frequentist statistics to the log-evidences as described

above. The single outlier subject made the distribution of the log-evidence differences non-

normal (Kolmogorov-Smirnov test: p < 10-7, DN = 0.822), and thus prevented detection of a

significant difference between the two models by a one-tailed paired t-test (t = 0.073, df =

11, p = 0.471). Given this deviation from normality, we applied a nonparametric Wilcoxon

signed rank test which makes no distributional assumptions; this test was indeed able to find

a significant difference between the models (p = 0.034).

Comparing different four-area DCMs of the ventral visual stream

Next, we investigated a variant of the previous case where the distribution of log-evidences

across subjects was more heterogeneous. This model comparison was essentially identical to

the previous one, except that the models in question only contained four areas (LG and FG

in both hemispheres), instead of six. Visual inspection of the distribution of log-evidence

differences (Figure 8) shows that the same subject as in the previous example favoured m2,

albeit far less strongly; in addition three more subjects showed evidence in favour of m2,

albeit only weakly. Given this constellation, the original analysis by Stephan et al. (2007b)

only found a relatively weak superiority of m1 (GBF = 8). In contrast, the VB method gave a

exceedance probability of

1 = 92.8% in favour of m1, indicating more clearly that m1 is a

superior model (Figure 9). As above, the estimates of our VB method were confirmed by

sampling (Figure 10).

When comparing this result to the frequentist random effects approach, a one-tailed paired t-

test was unable to detect a significant difference between the two models (t = 0.165, df = 11,

p = 0.436). In contrast to the previous example, this failure was not due to outlier-induced

Stephan et al.

Page 10

Neuroimage. Author manuscript; available in PMC 2009 July 15.

Europe PMC Funders Author Manuscripts

Europe PMC Funders Author Manuscripts

Page 11

deviations from normality: a Kolmogorov-Smirnov test applied to the log-evidences was

unable to reject the null hypothesis that they were normally distributed (p = 0.743). Here, the

between-subject variability, while in accordance with normality assumptions, was simply

too large to reject the null hypothesis with the classical t-test. A nonparametric Wilcoxon

signed rank test did not fare any better (p = 0.266).

Synthetic data: randomly sampling from a heterogeneous population

In a second simulation study, we examined the robustness of our method when randomly

sampling from a heterogeneous population of subjects. Specifically, we dealt with a

population in which 70% of subjects showed brain responses as generated by model m1

shown in Figure 8, whereas brain activity in the remaining 30% of the population was

generated by model m2. We randomly sampled 20 subjects from this population and

generated synthetic fMRI data by integrating the state equations of the associated models

with fixed parameters and inputs5 and adding Gaussian observation noise to achieve an SNR

of two. Each synthetic data set had exactly the same structure as the empirical data described

in the previous section (700 data points, TR = 3 s). Both m1 and m2 were then fitted to all 20

synthetic data sets, and the resulting log-evidences were used to perform both fixed effects

BMS and random effects BMS, using the VB method described in this paper. This sampling

and data generation procedure was repeated 20 times, resulting in a total of 400 generated

data sets and 800 fitted models. For each of the 20 sets of 20 subjects, we computed the

different indices provided by random effects BMS (i.e., α, r , ) and fixed effects BMS

(log GBF). The means of these indices are plotted in Figure 11, together with 95%

confidence intervals (CI). If our random effects BMS method were perfect in uncovering the

underlying structure of the population we sampled from, one would expect to find the

following average estimates: (i) α1=22×0.7=15.4,α2=22×0.3=6.6 for the Dirichlet

parameters, (ii) r1 =0.7, r2 =0.3 for the posterior expectations of model probabilities,

and (iii)

1=1,

the posterior model probability itself, but a statement of belief about the posterior

probability of one model being higher than the posterior probability of any other model).

The actual estimates of the BMS indices for the simulated data were (i) α1 = 15.4 (CI: 14.1 -

16.7) and α2 = 6.6 (CI: 5.3 - 7.9), (ii) r1 =0.7 (CI: 0.64 - 0.76) and r2 =0.3 (CI: 0.24 -

0.36), and (iii)

1=0.89 (CI: 0.83 - 0.96) and

average log GBF in favour of model m1 was 548.9 (CI: 446.2 - 651.6).

2=0 as exceedance probabilities (note that the exceedance probability is not

1=0.11 (CI: 0.04 - 0.17). For comparison, the

In conclusion, while our random effects BMS method provides a slightly overconservative

estimate of exceedance probabilities for the chosen sample size, it shows very good

performance overall, providing BMS indices that accurately reflect the structure of the

population we sampled from. In particular, the Dirichlet parameters and posterior

expectations of model probabilities (which represent the expected probability of obtaining

the k-th model when randomly selecting a subject) were estimated very precisely. This result

not only validates the results obtained for the empirical data set described above, but

demonstrates more generally that our BMS procedure is robust when randomly sampling

from a heterogeneous population of subjects.

5The coupling parameters of all endogenous connections were set to 0.1 s-1, except for the inhibitory self-connections whose

strengths were set to -1 s-1. Furthermore, the strengths of all modulatory and driving inputs were set to 0.3 s-1. The input functions

were the same as in the empirical dataset described above.

Stephan et al.Page 11

Neuroimage. Author manuscript; available in PMC 2009 July 15.

Europe PMC Funders Author Manuscripts

Europe PMC Funders Author Manuscripts

Page 12

Comparing different hemodynamic models by model space partitioning

Finally, we revisited a comparison of DCMs, which were identical in network architecture

(the same as m1 in Figure 8) but differed in the hemodynamic forward model employed

(Stephan et al. 2007c). A three-factor design was used to construct 8 different models: (i)

nonlinear vs. linear BOLD equations, (ii) classical vs. revised coefficients of the BOLD

equation, and (iii) free vs. fixed parameter (ε) for the ratio of intra- and extravascular signal

changes. In the original analysis by Stephan et al. (2007c), the GBF (based on the negative

free-energy approximation) was used to establish the best among the eight models. The best

model, abbreviated as RBMN(ε) in Figure 12, was characterised by (i) a nonlinear BOLD

equation, (ii) revised coefficients of the BOLD equation, and (iii) free ε. The difference of

its summed log-evidence compared to the second-best model, its linear counterpart

RBML(ε), was 5.26, corresponding to a GBF of 192 in favour of the nonlinear model. The

summed log-evidences for all 8 models are shown in Figure 12A.

Here, we demonstrate how one can use the agglomerative property of the Dirichlet

distribution (Equation 18) to go beyond selective comparisons of specific models and

instead examine the relative importance of particular model attributes or model subspaces.

Given the three factors above, we focussed on the importance of nonlinearities: what is the

posterior probability that nonlinear BOLD equations improve the model compared to linear

BOLD equations, regardless of any other dimensions of model space (i.e., classical vs.

revised coefficients and free vs. fixed ε)?

Following Equation 18, this question is addressed easily. In a first step, the VB procedure

was applied to the entire set of eight models, yielding posterior estimates of the Dirichlet

parameters α1,...,α8 (see Figure 12B). Subsequently, a new Dirichlet density reflecting the

partition of model space into nonlinear and linear subspaces was computed by summing αk

separately for the nonlinear and linear models (Figure 12C; for simplicity the ordering of the

models in Figure 12 has been chosen such that the first four models are nonlinear [left of the

dashed line], whereas the last four models are linear [right of the dashed line]) The resulting

Dirichlet can then be used to compare nonlinear and linear models in exactly the same way

as one compares two models; e.g. using exceedance probabilities. Figure 13 shows the result

of this comparison: the probability that nonlinear hemodynamic models are better than linear

models, regardless of other model attributes, was

1 = 98.6%.

For comparison, we also used classical inference, applying a repeated-measure ANOVA

(with Greenhouse-Geisser correction for non-sphericity) to the log-evidences of the eight

models. The result of this test was compatible with the above analysis, rejecting the null

hypothesis that linear and nonlinear models were equal in log-evidence (F = 24.330, df =

1,11, p < 0.0004).

Discussion

In this paper, we have introduced a novel approach for model selection at the group level.

Provisional experience suggests that this approach represents a more powerful way of

quantifying one’s belief that a particular model is more likely than any other at the group

level, relative to the conventional GBF. Critically, this variational Bayesian approach rests

on treating the model switches mi as a random variable, within a full hierarchical model for

multi-subject data (see Figure 1), and thus accommodates random effects at the between-

subject level. Notably, this inference procedure needs only the log-evidences for each model

and subject.

In the empirical examples above, we showed two cases where frequentist tests failed to

indicate clear differences between models, while the novel Bayesian approach succeeded. In

Stephan et al.

Page 12

Neuroimage. Author manuscript; available in PMC 2009 July 15.

Europe PMC Funders Author Manuscripts

Europe PMC Funders Author Manuscripts

Page 13

one case (the six-area ventral stream model), a strong outlier subject made the distribution of

log-evidences non-normal and thus rendered the t-test (but not a non-parametric test) unable

to find a significant difference between models. In another case (the four-area ventral stream

model), the distribution of log-evidences was normal, but with a between-subject variance

that was big enough to prevent significant results by frequentist tests (parametric or non-

parametric). It should be noted, however, that the frequentist and Bayesian approaches do

not test the same thing. The frequentist approach tries to reject the null hypothesis that there

are no differences in log-evidence across models. In contrast, the Bayesian approach

estimates the models’ probabilities, given the data, and enables inference in terms of

exceedance probabilities: the exceedance probability

model k is more likely than any other model (of the K models tested). Furthermore, we can

k is the probability that a given

compute the posterior probabilities of the models themselves: rk is the expected

probability that the k-th model generated the data for a randomly selected subject.

The exceedance probability of a model differs in a subtle but important way from the

conventional posterior probability of a model in Bayesian model comparison: Because we

have a hierarchical model, the posterior probability that any particular model caused the data

from a subject chosen at random, is itself a random variable (r in the derivations above).

This means that the exceedance probability is a statement of belief about the posterior

probability, not the posterior probability itself. So, for example, when we say that the

exceedance probability is 98%, we mean that we can be 98% confident that the favoured

model has a greater posterior probability than any other model tested. This is not the same as

saying that the posterior probability of the favoured model is 98%. The advantage of using

exceedance probabilities is that they are sensitive to the confidence in the posterior

probability and easily interpretable (since they sum to unity over all models tested).

As can be seen from Equations 9 and 11, our method is sensitive to both the distribution and

the magnitude of log-evidence differences. The same is true for frequentist tests applied to

log-evidence differences, e.g. t-tests. However, a critical difference between these

frequentist approaches and the VB method is that for the latter the influence of outliers has a

natural bound. There is a simple and intuitive reason for this nice property of the VB

method: if we keep increasing the log-evidence of model k for a particular subject n, our

posterior belief that k generated the data of subject n (that is, gnk=q(mnk=1); see Eq. 11) will

asymptote to one. Once it has reached unity (which corresponds to complete certainty), any

further increase in the log-evidence of model k for subject n has no further influence. This is

because the model probabilities are distributed according to the approximate posterior

Dirichlet Dir(r;α0+β)=q(r), where βk represents the conditional expectation of the number of

subjects whose data we believe were generated by model k and is simply the sum of the

subject-specific posterior probabilities that model k generated their individual data. In

contrast, frequentist tests like t-tests do not show this bounded behaviour with regard to

outliers. This is because the sample variance increases monotonically with the magnitude of

the outlier, leading to a monotonic decrease of the t-statistic. We demonstrated this

difference between frequentist approaches and our VB method by two empirical examples

with outliers.

Another important advantage of the method proposed here is that it can go beyond the

selective comparison of specific models and enables one to assess the importance of changes

along any specific dimension of model space. This type of inference, which could be seen as

a Bayesian analogue of testing for “main effects” in classical ANOVA, rests on comparing

two (or more) subsets of models (i.e., model subspaces). These partitions would typically

reflect those components of model structure that one seeks inference about; e.g. whether a

specific connection should be included in the model or not, whether a particular connection

Stephan et al.

Page 13

Neuroimage. Author manuscript; available in PMC 2009 July 15.

Europe PMC Funders Author Manuscripts

Europe PMC Funders Author Manuscripts

Page 14

is modulated by one experimental condition or another, or whether certain effects are linear

or nonlinear. We used this approach to demonstrate that hemodynamic models with

nonlinear BOLD equations are superior to those with linear ones. This result is in

accordance with previous studies that highlight the importance of nonlinearities in the

BOLD signal (Deneux & Faugeras 2006; Friston et al. 2000; Miller et al. 2001; Stephan et

al. 2007c; Vazquez & Noll 1998; Wager et al. 2005). However, in these earlier studies, this

conclusion was based on comparisons of specific and single instances of linear and

nonlinear hemodynamic models.

The inferential advance achieved by the present method is that arbitrarily large set of models

can be considered together, allowing one to integrate out uncertainty over any aspect of

model structure, other than the one of interest.

At first glance, it may appear surprising that the hierarchical model described above has

been introduced as a generative model for the data y, given its inversion does not need the

data but the model evidence, p(y | m). This apparent contradiction could be resolved by

noting that the log-evidence is a function of the data and represents a sufficient ‘summary

statistic’. To generate data, one would need to introduce the model parameters ϑk to the

graphical model shown in Figure 1B,C. In the context of DCM, for example, once one has

drawn a model k from the multinomial distribution for a specific subject n (i.e., generated a

label mnk = 1), one could generate fMRI time-series by drawing model parameters ϑk from

their prior distributions and adding some observation error. However, because the model

evidence p(y | m) results from integrating out the influence of the parameters ϑk on the data

y (see Equation 1), this component is unnecessary during inversion of the generative model.

One property of the method proposed in this paper is that for each subject n our posterior

beliefs about model k having generated their data sum to one over all models that are

considered, that is

which model k is most likely to have generated the data for a given subject n is a function of

the entire set of models considered. This means that reducing or extending model space can

change our inference about which model is most likely at the group level. Although this is a

fairly trivial corollary, it should not be forgotten when using this method in practice. In

short, one should infer the most likely model by comparing the entire set of plausible models

at once, instead of selectively analysing subparts of model space.

(c.f. Equation 11). In other words, our posterior belief about

To our knowledge, there has been relatively little work on group level methods for Bayesian

model comparison so far. In addition to the GBF (Stephan et al. 2007b), we had previously

suggested a metric called the “positive evidence ratio” (PER; Stephan et al. 2007b, 2007c).

Based on the conventional definition of “positive evidence” as a Bayes factor larger than

three (Kass & Raftery 1995), the PER is simply the number of subjects where there is

positive (or stronger) evidence for model 1 divided by the number of subjects with positive

(or stronger) evidence for model 2. While the PER is insensitive to outliers, it is also

insensitive to the magnitude of the differences across subjects. More importantly, however,

it is only a descriptive index that does not allow for probabilistic inference in a

straightforward manner. In the approach described in this paper, the sufficient statistics for

the model frequencies are the posterior estimates of the Dirichlet parameters (α). When the

differences in model evidences are very strong, these simply boil down to the number of

subjects with positive (and more) evidence in favour of a particular model. In that case

where for each subject there is one highly superior model, the expected model frequencies

become identical to the PER. From this perspective, the present approach can be considered

a (probabilistic) generalisation of the PER.

Stephan et al.

Page 14

Neuroimage. Author manuscript; available in PMC 2009 July 15.

Europe PMC Funders Author Manuscripts

Europe PMC Funders Author Manuscripts

Page 15

The only other work on group level methods for Bayesian model comparison that we are

aware of is a recent paper by Li et al. (2008) who suggested a “group-level BIC score”. This

score is derived by summing the BIC for each model across subjects. As explained earlier in

this paper, the BIC is a well-known approximation to the log-evidence (Schwarz 1978). The

group-level BIC score by Li et al. (2008) thus approximates the sum of log-evidences and

simply corresponds to the log GBF. Effectively, the analysis by Li et al. (2008) thus used a

fixed effects analysis across models that is formally identical to that used in reports of DCM

studies (e.g. Acs & Greenlee 2008; Allen et al. 2008; Grol et al. 2007; Heim et al. 2008;

Kumar et al. 2007; Smith et al. 2006; Stephan et al. 2007a,b; Summerfield & Koechlin

2008).

Finally, it should be noted that a random effects model selection approach is not necessarily

preferable to a fixed effects approach. The choice between fixed and random effects BMS

depends on the specific scientific question addressed. In the context of basic mechanisms

that are unlikely to differ across subjects, the conventional GBF is both sufficient and

appropriate. For example, it is unlikely that subjects differ with regard to basic physiological

mechanisms such as the involvement of sodium ion channels in action potential generation

or the presence of certain types of connections in the brain. In this context, it is perfectly

tenable to assume that all subjects generate data under the same model; and the data from all

subjects can be pooled to select this model in the usual way. In contrast, whenever subjects

can exhibit different models or functional architectures, the random effects BMS technique

presented in this paper is a more appropriate method. For example, there is evidence that

many higher cognitive functions can rely on more than one neurobiological system (Price &

Friston 2002). Also, it is likely that in some mental diseases, e.g. schizophrenia, patients

with identical symptoms show heterogeneity with regard to the pathophysiological processes

involved (Stephan et al. 2006).

In summary, in contrast to the GBF and other established approaches for group-level model

comparison, the approach suggested in this paper rests on a hierarchical model for multi-

subject data that accommodates random effects at the between-subject level (Figure 1) and

thus provides a generic framework for hypothesis testing. We expect this method to be a

useful tool for group studies, not only in the context of dynamic causal modelling, but also

for a range of other modelling endeavours; for example, comparing different source

reconstruction methods for EEG/MEG at the group level (Henson et al. 2007; Litvak &

Friston 2008; Mattout et al. 2007), or selecting among competing computational models of

learning and decision-making, given data from a group of subjects (Brodersen et al. 2008;

Hampton et al. 2006).

Acknowledgments

This work was funded by the Wellcome Trust (KES, WDP, RJM, KJF) and the University Research Priority

Program “Foundations of Human Social Behaviour” at the University of Zurich (KES). JD is funded by Marie

Curie Fellowship. We are very grateful to Marcia Bennett for helping prepare this manuscript, to the FIL Methods

Group, particularly Justin Chumbley, for useful discussions and to Jon Roiser and Dominik Bach for helpful

comments on practical applications. Finally, we would like to thank the two anonymous reviewers for their

constructive comments which have greatly helped to improve this paper.

Appendix A: Approximations to the log model evidence

With the exception of some special cases (e.g., linear models), the integral expression for the

model evidence (Equation 1) is analytically intractable and numerically difficult to compute.

Under these circumstances, people generally adopt a bound approach where, instead of

evaluating the integral above, one optimises a bound on the integral using iterative sampling

or analytic techniques. The most common approach of the latter kind is variational Bayes. In

Stephan et al.

Page 15

Neuroimage. Author manuscript; available in PMC 2009 July 15.

Europe PMC Funders Author Manuscripts

Europe PMC Funders Author Manuscripts