Variability in fMRI: A Re-Examination of
Stephen M. Smith,1*Christian F. Beckmann,1Narender Ramnani,1
Mark W. Woolrich,1Peter R. Bannister,1Mark Jenkinson,1
Paul M. Matthews,1and David J. McGonigle2
1Oxford Centre for Functional Magnetic Resonance Imaging of the Brain (FMRIB), Department of
Clinical Neurology, Oxford University, John Radcliffe Hospital, Headington,
Oxford, United Kingdom
2Laboratoire de Neurosciences Cognitives et Imagerie Ce ´re ´brale, Ho ˆpital de la Salpe ˆtrie `re,
CNRS UPR 640-LENA, Paris, France
Abstract: We revisit a previous study on inter-session variability (McGonigle et al. : Neuroimage
11:708–734), showing that contrary to one popular interpretation of the original article, inter-session
variability is not necessarily high. We also highlight how evaluating variability based on thresholded
single-session images alone can be misleading. Finally, we show that the use of different first-level
preprocessing, time-series statistics, and registration analysis methodologies can give significantly differ-
ent inter-session analysis results. Hum Brain Mapp 24:248–257, 2005.
© 2005 Wiley-Liss, Inc.
Key words: fMRI; session variability; reproducibility; longitudinal studies
The blood oxygenation level-dependent (BOLD) effect in
functional magnetic resonance imaging (fMRI), a marker of
neuronal activation, is often only of similar magnitude to the
noise present in the measured signal. To increase power and
to allow conclusions to be made about subject populations,
it is common practice to combine data from multiple sub-
jects. It is also common to take multiple sessions from each
subject, again to increase sensitivity to activation, or for
other experimental design reasons such as tracking changes
in function over time. It is therefore important that inter-
session variability present in fMRI data is understood, and
in response, McGonigle et al.  presented an in-depth
study of this issue.
In designing both multi-subject and single-subject multi-
session studies, it is critical for the experimenter to have
some idea of the relative sizes of within-session variance and
inter-session variance. For example, if inter-session variance
is large, it could be difficult to detect longitudinal experi-
mental effects (e.g., in studies of learning [Ungerleider et al.,
2002] and poststroke recovery [Johansen-Berg et al., 2002]).
If fMRI is to be used in presurgical mapping [e.g., Fernandez
et al., 2003], which by its nature will involve only a single
subject, correct interpretation will be dependent on an ap-
preciation of the potential uncertainty due simply to a ses-
sion effect. In multi-subject studies, it is advantageous to
have some idea of the expected inter-session variance, as this
will contribute to the observed inter-subject variance.
To investigate how well a single-session dataset from a
single subject typified the subject’s responses across multi-
ple sessions, McGonigle et al.  carried out the same
Contract grant sponsor: Medical Research Council (UK); Contract
grant sponsor: Engineering and Physical Sciences Research Council
(UK); Contract grant sponsor: EPSRC Medical Images and Signals
Collaboration; Contract grant sponsor: GSK.
*Correspondence to: Stephen M. Smith, FMRIB (Oxford Centre for
Functional Magnetic Resonance Imaging of the Brain), Department
of Clinical Neurology, Oxford University, John Radcliffe Hospital,
Headington, Oxford OX3 9DU, United Kingdom.
Received for publication 21 January 2004; Accepted 7 July 2004
Published online in Wiley InterScience (www.interscience.wiley.
? Human Brain Mapping 24:248–257(2005) ?
© 2005 Wiley-Liss, Inc.
fMRI protocol on 33 separate days. On each day, three
paradigms were run (visual, motor, and cognitive), and the
variation in activation was studied. The study drew three
main conclusions: (1) the use of voxel-counting on thresh-
olded statistical maps was not an ideal way to examine
reproducibility in fMRI; (2) a reasonably large number of
repeated sessions was essential to properly estimate inter-
session variability; and (3) the results of a single session
from a single subject should be treated with care if nothing
was known about inter-session variability.
Although McGonigle et al.  noted the presence of
between-session variability in their experiment, they did not
attempt to assess systematically the causes of this variance.
There are a number of potential contributors, such as phys-
iologic variance (subject), acquisition variance (scanner), and
differences in analysis methodology and implementation.
As noted in their original article, “it is possible that spatial
preprocessing (for example) may affect inter-session vari-
ance quite independently of underlying physical or physio-
logical variability.” This view is supported by Shaw et al.
, where analysis methodology is shown to affect ap-
parent inter-session variance. In the present study, we revisit
the analysis of data from McGonigle et al.  and con-
sider session variability in the light of the effects that differ-
ent first-level processing methods can have.
Others have taken from McGonigle et al.  the simple
broadbrush conclusion that there was a “large amount of
session variability” [e.g., Beisteiner et al., 2001; Chee et al.,
2003]. One of the purposes here is to address this miscon-
ception; for example, we show that for this dataset, inter-
session variability was of similar magnitude to within-ses-
We start with a brief theoretical overview of the compo-
nents of variance present in multiple-session data. We then
describe the original data and analysis, as well as the new
analyses carried out for this study, with explanation of the
measures used in this study to assess session variability. We
then present the variability results as found from these data,
centering around the use of mixed-effects Z values in rele-
vant voxels as the primary measure of interest. We also
show qualitatively why it is dangerous to judge variability
through the use of thresholded single-session images.
Researchers often refer to different group analyses, the
most common being fixed-effects and mixed-effects. What
these terms are actually referring to are different inter-ses-
sion (or inter-subject) noise (variance) models. We now sum-
marize what the terms and associated models mean.
We start with the equation for the t-statistic:
i.e., we are asking how big the mean effect size is compared
to the noise (the mean’s standard deviation1). The standard
deviation is the square root of either the fixed-effects vari-
ance of the mean or the random-effects variance of the mean.
With fixed-effects modeling, we assume that we are only
interested in the factors and levels present in the study, and
therefore our higher-level fixed-effects variance FV is de-
rived from pooling2the first-level (within-session) variances
(of first-level effect size mean) FVi, according to:
, DoFFV??DoFFVi, (2)
where DoF is the degrees of freedom, which is usually large
in the case of fMRI time series. This modeling therefore
ignores the cross-session (or cross-subject) variance com-
pletely and the results cannot be generalized outside of the
group of sessions/subjects involved in the study.
With simple mixed-effects3modeling, we derive the
mixed-effects variance MV directly from the variance of the
first-level parameter estimates PEi(effect sizes) or contrasts
of parameter estimates:
, DoFMV? n ? 1,(3)
with a (normally) much smaller DoF than with fixed-effects.
The modeling thus uses the cross-session (or cross-subject)
variance, and the results (which are generally more conservative
than with a fixed-effects analysis) are relevant to the whole pop-
ulation from which the group of sessions/subjects was taken.
The mixed-effects variance is the sum of the fixed-effects
(within-session) variance and random-effects (pure inter-
session) variance, although simple estimation methods cal-
culate this directly, as above, and do not explicitly use the
fixed-effects variance. The estimated mixed-effects variance
therefore should in theory and in practice be larger than the
fixed-effects variance. We expect that when there is large
inter-session variance, there will be a large difference be-
tween fixed- and mixed-effects analyses.
There have been recent significant developments in group-
level analysis. For example, it has been shown [Beckmann et
al., 2003a] that there is value in carrying up lower-level vari-
ances to higher-level analyses of mixed-effects variance, and
one implementation of this using Bayesian modeling/estima-
tion methodology has been reported [Woolrich et al., 2003].
Whereas the dataset used in this study may well prove useful
1Note that in the simplest cases the variance of the mean is the
variance of the residuals divided by the number of data points.
2The first factor of 1/n in FV comes from taking the mean of the
first-level variances, i.e., pooling them, and the second factor comes
from converting this higher level variance from a variance of resid-
uals into the variance of the (higher-level) mean [for more detail, see
Leibovici and Smith, 2000].
3Note that the terms “mixed effects” and “random effects” are often
(incorrectly) used interchangeably.
?Inter-Session Variability in fMRI?
? 249 ?
in investigating these developments further, this is beyond the
scope of this article. Instead, we concentrate primarily on two
other questions, namely the magnitude of session variability,
and the effect that first-level analysis methodologies can have
on its apparent magnitude. For mixed-effects analyses in the
present work, we therefore have only used ordinary least-
squares (OLS) estimators (see equation  and [Holmes and
MATERIAL AND METHODS
Original Experiments and Analysis
We describe here the experiment and original analysis car-
ried out by McGonigle et al. . A healthy, 23-year-old,
right-handed male was scanned on 33 separate days (over 2
months) with as many factors as possible held constant. On
each day, three block-design paradigms were run (all using
block lengths for rest ? 24.6 s and activation ? 24.6 s): visual
(8-Hz reversing black-white checkerboard, 36 time points after
deleting the first two); motor (finger tapping, right index finger
at 1.5 Hz, 78 time points); and cognitive (0.66-Hz random
number generating vs. counting, 78 time points), with the
paradigm order randomized. The data were collected on a
Siemens Vision at 2 T (repetition time [TR] ? 4.1 s, 64 ? 64
? 48, 3 ? 3 ? 3 mm voxels). A single T1-weighted 1.5 ? 1 ? 1
mm structural scan was taken.
Original analysis was carried out using SPM99 (online at
http://www.fil.ion.ucl.ac.uk/spm). All 99 sessions were re-
aligned (motion-corrected) to the same target (the first scan
of the first session of the first day) and then a mean over all
99 sessions was created. This was used to find normalization
(to a T2-weighted target in MNI space [Evans et al., 1993])
parameters for all 99 sessions (using 12-parameter affine
followed by 7 ? 8 ? 7 basis-function nonlinear registration).
Sinc interpolation on final output was used.
Sessions containing “obvious movement artefacts” were
identified by eye and removed from consideration (three
motor, two visual, and three cognitive). Cross-session anal-
ysis was carried out for voxels in standard space that were
present in all sessions. Spatial filtering with a Gaussian
kernel of full-width half-maximum (FWHM) 6 mm was
applied. Each volume of each session was intensity normal-
ized (rescaled) so that all had the same mean intensity.
Voxel time-series analysis was carried out using general
linear modeling (GLM). The data was first precolored by
temporally smoothing the data with a Gaussian of 6 s
FWHM. Slow drifts in the data were removed by including
drift terms in the model (a set of cosine basis functions
effectively removing signals of period longer than 96 s).
For presentation of within-session results, voxel-wise
thresholding (P ? 0.05) was used, correcting for multiple
comparisons using Gaussian random field theory (GRF)
[Friston et al., 1994].
Both fixed- and mixed-effects analyses were carried out to
examine the effects of using different variance components,
and an extra-sum-of-squares (ESS) F-test was performed
across all sessions of each paradigm to assess the presence of
significant inter-session variance.
We now describe the analysis approaches used for this arti-
cle. The two packages used for our investigations were
SPM99b (Statistical Parametric Mapping) and FSL v1.3 (FMRIB
Software Library; online at http://www.fmrib.ox.ac.uk/fsl,
June 2001). Both are available freely and used widely.
SPM includes a motion-correction (realignment) tool, a
tool for registration (normalization) to standard space,
GLM-based time-series statistics [Worsley and Friston,
1995], and GRF-based inference [Friston et al., 1994]. SPM
carries out standard-space registration before time-series
statistics. The SPM99b time series statistics correct for tem-
poral smoothness by precoloring [Friston et al., 2000].
GLM-based analysis in FSL is carried out with the fMRI
Expert Analysis Tool (FEAT), which uses other FSL tools
such as Brain Extraction Tool (BET [Smith, 2002]), an affine
registration tool, FMRIB’s Linear Image Registration Tool
(FLIRT [Jenkinson and Smith, 2001; Jenkinson et al., 2002]),
and a motion-correction tool based on FLIRT (MCFLIRT
[Jenkinson et al., 2002]). FEAT carries out standard-space
registration after time-series statistics. FSL time-series statis-
tics correct for temporal smoothness by applying prewhit-
ening [Woolrich et al., 2001].
Six different, complete analyses were carried out with
various combinations of preprocessing and time-series sta-
tistics options to allow a variety of comparisons to be made.
In tests A, C, and G, FSL was used for preprocessing and
registration whereas in tests D, E, and F, SPM was used. For
tests A, D, and G, FEAT time-series statistics was used
whereas for C, E, and F, SPM time-series statistics was used.
In tests A–E, the various controlling parameters were kept as
similar as possible, both to each other and to default settings in
the relevant software packages. Tests A versus D and C versus
E hold the statistics method constant while comparing spatial
methods, therefore showing the relative merits of the spatial
components (motion correction and registration). Tests A ver-
sus C and D versus E hold the spatial method constant while
comparing statistical components, thus showing the relative
merits of the statistical components (time-series analysis). A
versus E tests pure-FSL against pure-SPM. F and G test pure-
SPM and pure-FSL, respectively, with these analyses set up to
match the specifications of the original analyses in McGonigle
et al.  as closely as possible, including turning on inten-
sity normalization in both cases. For a summary, see Table I.
(For B, model-free independent component analysis (ICA) was
carried out; the model-free results are not included here but
will be presented elsewhere.)
Because the methods for high-pass temporal filtering in
FSL and SPM are intrinsically different, they cannot be set to
act in exactly the same way (within A–E and within F and G)
by choosing the same cutoff period in each; instead, the
cutoff choices were made to match as closely as possible the
extent to which the relevant signal and noise frequencies
were attenuated by the different methods. For the purposes
of the present work, high-pass temporal filtering is consid-
ered part of the temporal statistics, where it fits most natu-
?Smith et al.?
? 250 ?
rally. The non-default “Adjust for sampling errors” motion-
correction option in SPM was not used.
Eight sessions (of the 99) were excluded from the original
analysis in McGonigle et al.  due to “obvious move-
ment artefacts.” These were included in our analyses, how-
ever, as we did not consider that there was sufficient objec-
tive reason to exclude them. The estimated motions for these
sessions were not, in general, significant outliers relative to
the average motion across sessions and any apparent (acti-
vation map) motion artefacts were not in general signifi-
cantly different from most of the sessions. The quantitative
results below were in fact recalculated without these eight
sessions, i.e., reproducing the same dataset as used in
McGonigle et al. , but without any significant change
in results, and therefore are not reported here.
Inter-Session Evaluation Methods
For all paradigms and analysis methods, simple fixed-
effects (FE) and OLS mixed-effects (ME) Z-statistics were
formed. For each paradigm, a mask of voxels that FE con-
sidered potentially activated (Z ? 2.3) was created. This
contains voxels in which an ME analysis is potentially inter-
ested (given that ME generally gives lower Z-statistics than
does FE4). This mask was averaged over A, C, D, and E to
balance across the various methods, and then eroded
slightly (2 mm in 3D) to avoid possible problems due to
different brain mask effects.
We initially investigated the size of inter-session variance
by estimating the ratio of random-effects variance to fixed
effects, averaged over the voxels of interest as defined
above. Given that ME variance is the sum of FE and RE
variance, we estimated the RE (inter-session) variance by
subtracting the FE variance from the ME variance. We then
took the ratio image of RE to FE variance, and averaged over
the masks described above. This ratio would be 0 if there
were no inter-session variability and rises as the contribu-
tion by inter-session variability increases. A ratio of 1 occurs
when inter- and intra-session variabilities make similar con-
tributions to the overall measured ME variance.
We next investigated whether session variability is indeed
Gaussian distributed. If it is not, then inference based on the
OLS method used for ME modeling and estimation in the
present work would need a much more complicated inter-
pretation (as also would be the case with many other group-
level methods used in the field). We used the Lilliefors
modification of the Kolmogorov-Smirnov test [Lilliefors,
1967] to measure in what fraction of voxels the session effect
was significantly non-Gaussian.
The variance ratio figures do not take into account estimated
effect size, which in general will vary between methods, and so
the primary quantification in this study uses the mixed-effects
Z (ME-Z). This is roughly proportional to the mean effect size
and inversely proportional to the inter-session variability. This
makes ME-Z a good measure with which to evaluate session
variability; it is affected directly by the variability while being
weighted higher for voxels of greater interest (i.e., voxels con-
taining activation). We are not particularly interested in vari-
ability in voxels that contain no mean effect. We therefore base
our cross-subject quantitation on ME-Z comparisons within
regions of interest (defined above).5
If one of the analysis methods tested here results in in-
creased ME-Z, then this implies reduced overall method-
4We are attempting to identify voxels of potential interest in ME-Z
scaled down by a factor related to session variance, this seems like a
good way of choosing voxels which have the potential to be activated
in the ME-Z image, depending on the session variance. To investigate
the dependency of this approach on the FE-Z threshold chosen, we
re-ran the tests leading to the ME-Z plots presented in Figure 8, having
determined the regions of interest using a lower FE-Z threshold (Z
relative) results were identical to those presented in Figure 8.
5Although we are primarily investigating analysis efficiency and ses-
sion variance by looking at regions of potential activation, note that it
is also necessary to ensure that the non-activation (null) part of the ME
distribution is valid, i.e., not producing incorrect numbers of false
positives. This investigation/correction of the ME null distribution is
addressed below and uses the whole ME-Z image, not just the regions
of potential activation.
TABLE I. Different analyses carried out
FSL (MCFLIRT spat ? 5 intnorm ? n)
FSL (MCFLIRT spat ? 5 intnorm ? n)
SPM (SPM-mc&norm spat ? 5 intnorm ? n)
SPM (SPM-mc&norm spat ? 5 intnorm ? n)
SPM match  (SPM-mc&norm spat ? 6 intnorm ? y)
FSL match  (MCFLIRT spat ? 6 intnorm ? y)
FSL (FEAT) (hp-FSL ? 40)
SPM (hp-cos ? 72)
FSL (FEAT) (hp-FSL ? 40)
SPM (hp-cos ? 72)
SPM (hp-cos ? 98.4)
FSL (FEAT) (hp-FSL ? 53)
SPM (done in preproc)
SPM (done in preproc)
SPM (done in preproc)
Spat, spatial filtering with full-width-half-maximum given in mm.
intnorm, intensity normalization (the intensity rescaling of each volume in a 4-D fMRI dataset so that all have the same mean within-brain intensity).
hp-FSL, FSL’s high-pass temporal filtering with cutoff period given in seconds.
hp-cos, high-pass temporal filtering (in seconds) via cosine basis functions.
?Inter-Session Variability in fMRI?
? 251 ?
related error (increased accuracy) in the method, because
unrelated variances add. Although a single-session analysis
cannot eliminate true inter-session variance intrinsic to the
data, it can add (induce) variance to the effective inter-
session variance due to failings in the method itself (for
example, poor estimates of first-level effect/variance, or reg-
istration inaccuracies). The best methods should therefore
give ME-variance that approaches (from above) the true,
intrinsic inter-session ME-variance. Recall that the same sim-
ple OLS second-level estimation method was used for all
analyses carried out, and it is only the first-level processing
that is varied.
Mean ME-Z was then calculated within the FE-derived
masks. As well as reporting these uncorrected mean ME-Z
values, we also report the mean values after adjusting the
ME-Z images for the fact that in their histograms (suppos-
edly a combination of a null and an activation distribution)
the null part, ideally a zero mean and unit standard devia-
tion Gaussian, was often significantly shifted away from
having the null peak at zero. This makes Z-values incompa-
rable across methods, and needed to be corrected for. The
causes of this effect include spatially structured noise in the
data and in differences in the success between the different
methods for correcting for temporal smoothness, a problem
enhanced potentially for all methods given the unusually
low number of time points in the paradigms.
We used two methods to correct ME-Z for null-distribu-
tion imperfections, and report results for both methods.
With hand-corrected peak shift correction, the peak of the ME-Z
distribution was identified by eye and assumed to be the
mean of the null distribution; the ME-Z image then had this
value subtracted. With mixture-model-based null shift correc-
tion, a nonspatial histogram mixture model was automati-
cally fitted to the data using expectation-maximization. This
involved a Gaussian for the null part, and gammas for the
activation and deactivation parts [Beckmann et al., 2003b].
The center of the Gaussian fit was then used to correct the
ME-Z image. The advantage of the hand-corrected method
is that it is potentially less sensitive to failings in the as-
sumed form of the mixture components; the advantage of
the mixture-model-corrected fit is that it is fully automated
and therefore more objective.
It is not yet standard practice (with either SPM or FSL) to
correct for null-Z shifts in ME-Z histograms; the most com-
mon method of inference is to use simple null hypothesis
testing on uncorrected T or Z maps (typically via Gaussian
random field theory). By correcting for the shifts, we are able
to investigate the effects of using the different individual
analysis components in the absence of confounding effects
of null distribution imperfections.
Figure 1 shows an example ME-Z histogram including the
estimates (by eye and mixture-modeling) of the null mode.
The estimated ME-Z shifts that were applied to the mean
ME-Z values before comparing methods are plotted for all
analysis methods and all paradigms in Figure 2. The shift is
clearly more related to the choice of time-series statistics
method than to the choice of spatial processing method
(motion correction and registration), but there is no clear
indication of one statistics method giving a greater shift
extent than another. The two correction methods are largely
in agreement with each other.
RESULTS AND DISCUSSION
Fixed-Effects Activation Maps
The FE-based mask images (used to define the voxels used
in the quantitative analyses reported below) are shown in
Figure 3 as overlaid onto the MNI152 standard head image.
Inter-Session Effect Size Plots
For analysis methods A and E, we now show the effect size
and its (fixed-effects, within-session) temporal standard devi-
ation, as a function of session number. Both the effect size and
the temporal standard deviation are estimated as means over
interesting voxels, as defined above. The plots were normal-
this to be unity, scaling the standard deviation by the same
factor, and demeaning the effect size plot (Figs. 4 and 5). These
plots show (as does the following section) that the within-
session variance has similar magnitude to the inter-session
variance. They also show that variability in effect size is higher
than variability in its standard deviation (although the impli-
cation of this fact is not necessarily important to the primary
points in the present work). The results presented here corre-
spond to the uncorrected plot in Figure 8.
Quantification of Inter-Session Variance
To quantitate better the size of inter-session variance, we
estimated the mean ratio of RE (ME minus FE) to FE vari-
ance. Any comparison between the RE and FE variance will
be dependent on the number of time points in each session,
with a larger number of time points leading to an increase in
the RE:FE ratio.
Example ME-Z histogram showing null-distribution shift, from
analysis A of the visual paradigm.
?Smith et al.?
? 252 ?
The results are shown in Table II. The interpretation of this is
simple yet important: in these datasets, inter-session variability
is not large compared to within-session variability.6
We cannot make very useful interpretations of the varia-
tions across methods of the variance ratio, particularly with-
out also taking into account the estimated effect size; hence
the use of mixed-effects Z for the main method comparison
results shown below.
Test for Gaussianity of Inter-Session Variability
Using the results of analysis A, for each paradigm we
tested whether the session variability was Gaussian. At each
voxel in standard space, we took the (first-level) parameter
estimates (effect sizes) from the relevant voxel in each of the
33 relevant first-level analyses, (i.e., the same data that was
fed into the group-level ME analysis). The variance of these
is the ME variance. For each set of 33 first-level parameter
estimates, we ran the Lilliefors modification of the Kolmog-
orov-Smirnov test [Lilliefors, 1967] for non-Gaussianity,
with a significance threshold of 0.05. In null data, we would
therefore expect rejection of the Gaussianity null hypothesis
at this 5% rate by random chance.
We calculated the fraction of voxels failing the normality
test across the whole brain and within the FE-derived masks
described above. In both cases and for all three paradigms,
the fraction of failed tests was less than 7.5% (range, 4.5–
7.3%), which is very close to the expected 5% rate of null
hypothesis rejections if in fact all the data is normal. This
provides strong quantitative evidence for the normality of
the session variability in this data. Qualitatively, the voxels
where the null hypothesis was rejected were scattered ran-
domly through the images, not clumped, again suggesting
that they were rejected by pure random chance rather than
because of some true underlying non-Gaussian process.
On (Not) Drawing Conclusions About Session
Variability Based on Thresholded
McGonigle et al.  does not include any such state-
ment as “session variability is high,” or even any quantifi-
cation explicitly suggesting in a simple way that session
6Noting the much greater variability (across methods) in these ratios
than in the plots in Figure 8, and by looking in detail at separate ME
and FE variances, it is clear that the variation in these figures across
methods is due primarily to variation in FE variance. This is possi-
bly caused by methodologic differences in correcting for temporal
autocorrelation at first level.
Estimated ME-Z null distribution shifts. Different tasks: C, cogni-
tive; M, motor; V, visual. Different correction methods: h, hand-
shifted; m, mixture-model-shifted.
Masks of potentially activated voxels, within which mean ME-Z was calculated for each analysis
method. Red, visual; orange, motor; yellow, cognitive.
?Inter-Session Variability in fMRI?
? 253 ?
variability is a serious problem. Nevertheless, unfortunately,
many researchers [e.g., Beisteiner et al., 2001; Chee et al.,
2003] seem to have taken these messages from the study.
One of the causes of this is the apparent variability in Fig-
ures 2–4 in McGonigle et al. , which show for each
paradigm each session’s thresholded activation image (as a
single sagittal slice maximum intensity projection). All three
figures give the impression of large inter-session variability,
even for the strong visual paradigm.
The most important point to make with respect to this
issue is that it is not safe to judge inter-session variability by
looking at variability in thresholded statistic images. It is
perfectly possible for two unthresholded activation images
to not be significantly different statistically and yet one
contains activation just over threshold and the other just
under, giving the false impression of large variability. The
fact that thresholds are in any case chosen arbitrarily in-
creases the weakness of this method of judging variability.
To illustrate these issues, Figures 6 and 7 show single-
session thresholded images from analysis F of the visual
experiment. Figure 6 is created using the same threshold
as that used in McGonigle et al. , namely P ? 0.05,
corrected for multiple comparisons using Gaussian ran-
dom field theory [Friston et al., 1994]. In contrast, Figure
7 is created using a reduced threshold (the t threshold
used in Figure 6 is reduced by 33%). Obviously there is
more apparent activation when the threshold is reduced
(although it has clearly not been reduced so far that there
is generally a huge amount of spatially variable noise
activation caused by this). The interesting point, however,
is that the subjective impression of inter-session variabil-
ity is much reduced.
Finally, a question arises as to why Figure 6, which
should match the original figure in McGonigle et al. 
having been processed in the same manner, seems to
show less variability than that in the original figures. This
was found to be because suboptimal timing was used in
the original model generation (caused by a particular
default setting of the point within a TR that the model is
sampled, which also corresponds to the point during a TR
when that time point’s whole fMRI volume is assumed to
have been sampled instantaneously; this default was
changed between SPM99 and SPM99b). The reanalysis
was more efficient at estimating activation as better-
matched models were used, causing less apparent inter-
As part of the investigation of this effect, we tested the
variability in peak Z-values as the model timing was changed
slightly. The mean-across-sessions (max-across-space[Z]) value
for five different phase shifts of the model (?1 TR to ?1 TR)
were found to be 6.6, 7.5, 7.9, 7.5, 6.9 (model timing running
from earlier to later, respectively). This is quite a large effect for
these phase shifts, given that the paradigm is a block design.
Mean first-level effect size and its (within-session) standard devi-
ation, as a function of session number, for analysis A.
Mean first-level effect size and its (within-session) standard devi-
ation, as a function of session number, for analysis E.
TABLE II. Mean estimated ratio of RE (inter-session)
variance to FE (within-session pooled) variance
?Smith et al.?
? 254 ?
This is another illustration of the danger of judging variability
solely based on thresholded results.
Mean ME-Z plots
Mean ME-Z plots are shown in Figure 8. Higher ME-Z
implies less analysis-induced inter-session variance, or
viewed another way, greater robustness to session effects.
Before discussing these plots it is instructive to get a
feeling for what constitutes significant difference in the
plots. Suppose that in these figures, two ME-Z maps were
separated by a Z difference of 0.25. This would correspond
to a general relative scaling between the two maps of ap-
proximately 0.25/6 ? 4%. We are interested in the effect that
this difference has on the final thresholded activation map.
We can therefore estimate this effect by thresholding an
ME-Z map at a standard level and at this level scaled by 4%.
Thresholding at P ? 0.05, when corrected (using Gaussian
random field theory) for multiple comparisons, corresponds
to a Z threshold of approximately 5. We therefore thresh-
olded the three ME-Z images from analysis F at levels of
Z ? 5 and Z ? 5.2. For the cognitive, motor, and visual ME-Z
images, this resulted in reductions in suprathreshold voxel
counts by 11, 8, and 6%, respectively. These are not small
percentages; we conclude that a difference in 0.25 between
the various plots can be considered significant in terms of
the effect on the final reported mixed-effects activation
maps. Note that these different thresholdings were carried
out with two threshold levels on the same ME-Z image for
each comparison, hence the previous criticism of not com-
paring thresholded maps is not relevant here.
We consider here plots A, C, D, and E, the various tests
that attempted to match all settings both to each other and to
default usage. Firstly, consider comparisons that show the
relative merits of the spatial components (motion correction
Visual paradigm; analysis F single-session thresholded maximum intensity projections, thresholded
with the t threshold reduced from the “P ? 0.05 GRF-corrected” level by 33%. Note that as well
as the obvious increase in reported activation, “apparent variability” is decreased significantly.
Visual paradigm; analysis F single-session thresholded maximum intensity projections, P ? 0.05
GRF-corrected. Each image corresponds to a different day’s dataset.
?Inter-Session Variability in fMRI?
? 255 ?
and registration): A versus D and C versus E hold the
statistics method constant while comparing spatial methods.
Next, consider comparisons that show the relative merits of
the statistical components (time-series analysis): A versus C
and D versus E hold the spatial method constant while
comparing statistical components. Finally, A versus E tests
pure-FSL against pure-SPM.
Plots F and G test pure-SPM and pure-FSL, respectively,
with these analyses set up to match the specifications of the
original analyses in McGonigle et al. , including turn-
ing on intensity normalization in both cases.
The results show that both time-series statistics and spa-
tial components (primarily head motion correction and reg-
istration to standard space) add to apparent session variabil-
ity. Overall, with respect both to spatial alignment
processing and time-series statistics, FSL induced less error
than did SPM, i.e., was more efficient with respect to higher-
level activation estimation.
The experiments used a block-design, and as such are not
expected to show up the increased estimation efficiency of
prewhitening over precoloring [Woolrich et al., 2001]. In a
study similar to that presented here [Bianciardi et al., 2003],
first-level statistics were obtained using SPM99 and FSL (i.e.,
only time-series statistics were compared, not different
alignment methods). The data were primarily event related
and, as in this work, simple second-level mixed-effects anal-
ysis was used to compare efficiency of the different methods.
The results showed that prewhitening was not just more
efficient at first-level, but also gave rise to increased effi-
ciency in the second-level analysis.
FEAT offers the option of intensity normalization of all
volumes in each time series to give constant mean volume
intensity over time; however, this option is turned off by
default, as it is considered that this is an oversimplistic
approach to a complicated problem [see for example De
Luca et al., 2002].
We investigated the effect on inter-session variance of
turning intensity normalization on. It was found that this
preprocessing step does reduce the overall fixed- and ran-
dom-effects variance (on average by about 10%), and there-
fore slightly increases the fixed- and random-effects Z-val-
ues (again giving on average approximately a 10% increase
in the number of suprathreshold voxels).
One- or Two-Step Registration
FEAT does not transform the fMRI data directly into
standard space but carries all statistics out in the original
(low-resolution) space and then transforms the final statis-
tics images into standard space. The transformation from
original space into standard space is carried out normally
(automatically) in a two-step process. First, an example func-
tional image (the one used as the reference in the motion
correction) is registered to the subject’s structural image
(normally a T1-weighted image that has been brain-ex-
tracted using BET [Smith, 2002]) and then the structural
image is registered to a standard space template (normally
the MNI152). The two resulting transformations are concat-
enated resulting in a single transform that takes the low-
resolution statistic images into standard space. This is the
default FEAT registration procedure and is what was used
for the analyses presented above.
We investigated whether for this data FEAT’s two-step
process (using FLIRT) is an improvement over registering
the example functional image directly into standard space
(using FLIRT). The two-step registration resulted in a slight
decrease in cross-session fixed- and random-effects overall
variance (by approximately 3%). The number of activated
voxels in general stayed the same, but the peak Z-statistic
improved (again by approximately 3%) when two-step reg-
istration was used and the activation seemed qualitatively to
Mean mixed-effects Z-values, uncorrected and with both Z-shift correction methods.
?Smith et al.?
? 256 ?
contain more structural detail (i.e., was less blurred). The
conclusion therefore is that even for this within-subject
across-session analysis, the two-step registration approach
was of value in the FEAT analyses.
Inter-session variability is an important consideration in
power calculations for the design of fMRI experiments. It is
also a critical issue for interpretation of studies that allow for
only single observations, e.g., in many clinical applications
of fMRI. We have provided here quantitative data confirm-
ing that inter-session variability in fMRI is not large relative
to within-session variability. We also emphasize that inter-
session variability should not be judged by apparent vari-
ability in thresholded activation maps.
There are several mechanisms by which inter-session vari-
ability can be minimized. Although considerable attention
has been paid in the past to hardware and experimental
design factors, we have shown here that additional benefits
can come with optimization of analysis methodology, as
analysis methods add extra variance to the true inter-session
variance, causing an apparent increase in inter-session vari-
ance. It was found that with respect both to spatial align-
ment processing and time-series statistics, FSL v1.3 induced
less error than did SPM99b, i.e., was more efficient with
respect to higher-level activation estimation.
We thank C. Freemantle for retrieval and transfer of the
data from the Functional Imaging Lab, London, UK.
Beckmann CF, Jenkinson M, Smith SM (2003a): General multilevel
linear modeling for group analysis in fMRI. Neuroimage 20:
Beckmann CF, Woolrich MW, Smith SM (2003b): Gaussian/gamma
mixture modelling of ICA/GLM spatial maps. Neuroimage
Beisteiner R, Windischberger C, Lanzenberger R, Edward V, Cun-
nington R, Erdler M, Gartus A, Streibl B, Moser E, Deecke L
(2001): Finger somatotopy in human motor cortex. Neuroimage
Bianciardi M, Cerasa A, Hagberg G (2003): How experimental de-
sign and first-level filtering influence efficiency in second-level
analysis of event-related fMRI data. Neuroimage 19(Suppl):785.
Chee MW, Lee HL, Soon CS, Westphal C, Venkatraman V (2003):
Reproducibility of the word frequency effect: comparison of
signal change and voxel counting. Neuroimage 18:468–482.
De Luca M, Beckmann CF, Behrens T, Clare S, Matthews PM, De
Stefano N, Woolrich M, Smith SM (2002): Low frequency signals
in FMRI—“resting state networks” and the “intensity normal-
isation problem.” In: Proc Int Soc Magn Reson Med, 10th Annual
meeting, Honolulu, USA.
Evans AC, Collins DL, Mills SR, Brown ED, Kelly RL, Peters TM
(1993): 3D statistical neuroanatomical models from 305 MRI
volumes. In: Proc IEEE-Nuclear Science Symposium and Medi-
cal Imaging Conference. p 1813–1817.
Fernandez G, Specht K, Weis S, Tendolkar I, Reuber M, Fell J, Klaver
P, Ruhlmann J, Reul J, Elger CE (2003): Intrasubject reproduc-
ibility of presurgical language lateralization and mapping using
fMRI. Neurology 60:969–975.
Friston KJ, Josephs O, Zarahn E, Holmes AP, Rouquette S, Poline JB
(2000): To smooth or not to smooth? Neuroimage 12:196–208.
Friston KJ, Worsley KJ, Frackowiak RSJ, Mazziotta JC, Evans AC
(1994): Assessing the significance of focal activations using their
spatial extent. Hum Brain Mapp 1:214–220.
Holmes AP, Friston KJ (1998): Generalisability, random effects and
population inference. Neuroimage 7(Suppl):754.
Jenkinson M, Bannister PR, Brady JM, Smith SM (2002): Improved
optimisation for the robust and accurate linear registration and
motion correction of brain images. Neuroimage 17:825–841.
Jenkinson M, Smith SM (2001): A global optimisation method for
robust affine registration of brain images. Med Image Anal
Johansen-Berg H, Dawes H, Guy C, Smith SM, Wade DT, Matthews
PM (2002): Correlation between motor improvements and al-
tered fMRI activity after rehabilitative therapy. Brain 125:2731–
Leibovici DG, Smith S (2000): Comparing groups of subjects in fMRI
studies: a review of the GLM approach. Technical Report
TR00DL1, Oxford Centre for Functional Magnetic Resonance
Imaging of the Brain, Department of Clinical Neurology, Oxford
University, Oxford, UK. Available at www.fmrib.ox.ac.uk/
analysis/techrep for downloading.
Lilliefors L (1967): On the Kolmogorov-Smirnov test for normality
with mean and variance unknown. J Am Stat Assoc 62:399–402.
McGonigle DJ, Howseman AM, Athwal BS, Friston KJ, Frackowiak
RSJ, Holmes AP (2000): Variability in fMRI: an examination of
intersession differences. Neuroimage 11:708–734.
Shaw ME, Strother SC, Gavrilescu M, Podzebenko K, Waites A,
Watson J, Anderson J, Jackson G, Egan G (2003): Evaluating
subject specific preprocessing choices in multisubject fMRI data
sets using data-driven performance metrics. Neuroimage 19:
Smith SM (2002): Fast robust automated brain extraction. Hum
Brain Mapp 17:143–155.
Ungerleider LG, Doyon J, Karni A (2002): Imaging brain plasticity
during motor skill learning. Neurobiol Learn Mem 78:553–564.
Woolrich MW, Behrens TEJ, Beckman CF, Jenkinson M, Smith SM
(2004): Multi-level linear modelling for fMRI group analysis
using Bayesian inference. Neuroimage 21:1732–1747.
Woolrich MW, Ripley BD, Brady JM, Smith SM (2001): Temporal
autocorrelation in univariate linear modelling of FMRI data.
Worsley KJ, Friston KJ (1995): Analysis of fMRI time series revisited—
again. Neuroimage 2:173–181.
?Inter-Session Variability in fMRI?
? 257 ?