Page 1

Linked Independent Component Analysis for Multimodal

Data Fusion

Adrian R. Groves∗,1, Christian F. Beckmann1,2, Steve M. Smith1,

and Mark W. Woolrich1,3

1FMRIB Centre, University of Oxford, UK.

2Imperial College, London, UK.

3Oxford Centre for Human Brain Activity, University of Oxford, UK.

∗Corresponding author: adriang@fmrib.ox.ac.uk.

Note: this document is a preprint, incorporating all revisions from peer review.

The full article is in NeuroImage, doi:10.1016/j.neuroimage.2010.09.073.

Abstract

In recent years, neuroimaging studies have increasingly been acquiring multiple modal-

ities of data and searching for task- or disease-related changes in each modality separately.

A major challenge in analysis is to find systematic approaches for fusing these differing data

types together to automatically find patterns of related changes across multiple modali-

ties, when they exist. Independent Component Analysis (ICA) is a popular unsupervised

learning method that can be used to find the modes of variation in neuroimaging data

across a group of subjects. When multimodal data is acquired for the subjects, ICA is typi-

cally performed separately on each modality, leading to incompatible decompositions across

modalities. Using a modular Bayesian framework, we develop a novel “Linked ICA” model

for simultaneously modelling and discovering common features across multiple modalities,

which can potentially have completely different units, signal- and contrast-to-noise ratios,

voxel counts, spatial smoothnesses and intensity distributions. Furthermore, this general

model can be configured to allow tensor ICA or spatially-concatenated ICA decompositions,

or a combination of both at the same time. Linked ICA automatically determines the opti-

mal weighting of each modality, and also can detect single-modality structured components

when present. This is a fully probabilistic approach, implemented using Variational Bayes.

We evaluate the method on simulated multimodal data sets, as well as on a real data set

of Alzheimer’s patients and age-matched controls that combines two very different types of

structural MRI data: morphological data (grey matter density) and diffusion data (fractional

anisotropy, mean diffusivity, and tensor mode).

1Introduction

One of the greatest strengths of MR neuroimaging is its flexibility; by using different pulse

sequences in a single scanning session, one can acquire information about the subject’s tissue

volume and morphology (using high-resolution structural scans), functional activity (using BOLD

FMRI), white matter integrity (using diffusion-weighted imaging), perfusion (using ASL), and

other distinct acquisition types. The result of this is that many recent studies have acquired

these multimodal MRI data sets for each subject and analysed them separately to find changes

in different aspects of the brain. For example, several recent studies have used structural and

1

Page 2

diffusion tensor imaging (DTI) to find changes in grey matter density and white matter tracts that

are related to schizophrenia (Douaud et al., 2007) or learning (Scholz et al., 2009). Other possible

combinations are DTI and task-related FMRI (Watkins et al., 2008) or structural, diffusion, and

resting-state FMRI (Filippini et al., 2009).

A major challenge is to find systematic approaches for fusing data across multiple MRI

modalities, in order to find any patterns of related change that may be present. We develop

a model based on Bayesian ICA to extract linked components from multimodal data, using

as inputs the subject-wise contrast images from modality-specific analyses. For example, these

inputs could be GLM contrasts from FMRI, cortical-thickness or VBM maps from structural

MRI, and skeletonised tensor measures from diffusion-weighted imaging. ICA is a particularly

effective model for finding meaningful, spatially-independent components in an unsupervised

setting because it searches for non-Gaussian spatial sources that are likely to represent real

structured features in the data.This is because linear mixing processes tend to turn non-

Gaussian independent sources into more Gaussian observed signals, so seeking non-Gaussianity

is an unsupervised way of isolating the original independent sources.

Standard ICA decompositions treat the input data as a 2D matrix, typically voxels × time-

points or voxels × subjects. Multimodal data does not naturally fit into this form and there

are a number of different configurations one could consider for performing combined ICA on

multimodal data:

• Separate ICA analysis of each modality reveals the salient features for each modality.

Since some of these features are caused by distributed neurological variations they could

be visible (to varying degrees) in all modalities, with similar subject-courses.

Corresponding components can then be matched up using heuristics; however there is no

guarantee that components with strongly-correlated subject-courses will be extracted, for

example a single component in one modality might be explained as a mixture of components

in another. When potential matches are found, it can be difficult to determine if they are

simply noisy estimates of the same subject-course or whether the underlying subject-courses

are different but correlated.

A slightly more sophisticated approach to this is the Parallel ICA method described by

Liu et al. (2009) which runs separate ICAs on each modality simultaneously; when cor-

related components are detected, it adds terms to the cost function to encourage these

components to become more correlated in later iterations. This relies on a number of tun-

able constraints (learning rates and weights) to ensure convergence and balance between

modalities. Furthermore, it is still not clear how to interpret paired components where the

subject-courses are significantly, but not perfectly, correlated.

• Spatial concatenation has also been used for analysing multimodal data, combining all

of the data from each subject into a single dataset with more voxels. This “joint ICA”

method has been used before for simultaneously analysing functional maps and gray matter

maps (Calhoun et al., 2006), and has been used to extract correlations in structural grey-

matter/white-matter density data (Xu et al., 2009). Since concatenation is a preprocessing

step, the ICA model is completely unaware of which voxels belong to which modality.

However, different modalities may have different spatial source histograms. ICA effectively

assumes that each component has a single, non-Gaussian histogram as the prior distribu-

tion for all voxels in its spatial map. If this map consists of voxels from several different

modalities, the modelled histogram (which is effectively an estimate of the source distribu-

tion) may have to compromise. For example, this can occur if one modality has a small area

of strong activation (or signal change in the case of structural modalities), while the other

2

Page 3

has a large region of weak activation. This can cause sub-optimal estimates of intensities

in spatial maps.

A related problem is that the contribution each modality makes to the ICA cost function

greatly depends on the scaling. One of the difficulties of concatenating multimodal data

is that the modalities may have different noise levels and different numbers of voxels. If

the scaling is mismatched, unsupervised methods such as PCA and ICA will be dominated

by the largest-variance modalities, or those with the most voxels. Typically these concate-

nation methods also require the same resolution and smoothing for all modalities, rather

than using optimized values for each.

There is also an issue of noise co-variance, for example due to spatial smoothing; in partic-

ular, adding more smoothing to one modality reduces the noise level but leaves the number

of voxels unchanged. The proposed method deals with this explicitly using a precalculated

correction for the number of effective degrees of freedom (eDOF), which is closely related

to the number resolution elements (RESELs) in the image (Worsley et al., 1995).

We also expect that some of the structured signals modelled by ICA will be observable

in only one modality, and may be extremely weak or even absent in some of the other

modalities. It would therefore be useful for sources to be “switched off” in the models

where they are not needed, just as it is important to eliminate unneeded components in

the single-modality Bayesian ICA model (Choudrey and Roberts, 2001).

• Tensor ICA stacks the modalities to create a 3D data matrix. This has been used for

multi-subject FMRI analysis, with dimensions of voxels × time × subjects (Beckmann and

Smith, 2005). In the multimodal scenario this would most likely translate into voxels ×

subjects × modalities. This is related to the PARAFAC model (see Nielsen 2004 for a

VB-based implementation) but with the addition of spatial-independence priors.

method assumes that each component has a single spatial map for all modalities, applied

to each modality with different weightings. This can be a beneficial feature because it

avoids unnecessary duplication of the spatial maps and can allow them to be inferred more

accurately when the assumption holds. However this is effectively a strong prior on the

nature of the spatial distribution and it may be inappropriate, for example if the number

of voxels is different or if the spatial maps in different modalities are not similar.

This

Using a modular Bayesian framework, we have developed a novel “Linked ICA” general model

that allows for either tensor ICA or spatially-concatenated ICA, or a combination of both at the

same time. The same subject loading matrix is shared between all of the modalities, so each

component consists of a single subject-course and one spatial map in each of the modalities.

The subject weighting matrix automatically balances information from all of the modalities.

This novel Linked ICA method will be applied to a data set with four different modalities, ac-

quired from 93 subjects (probable-Alzheimer’s patients and age-matched controls). One of these

modalities is a grey matter partial volume map (“GM”) derived from Voxel-Based Morphome-

try (VBM) methods (Ashburner and Friston, 2000), and the other three are measures of white

matter integrity: Fractional Anisotropy (FA), Mean Diffusivity (MD), and an orthogonal Tensor

Mode (MO) described in Ennis and Kindlmann (2006). These last three modalities have been

projected onto a two-dimensional white matter surface (the “skeleton”) using a Tract-Based

Spatial Statistics (TBSS) analysis (Smith et al., 2006).

3

Page 4

(a) Linked ICA matrix diagram (b) Spatially-concatenated ICA

Figure 1:

that the same subject loading matrix H is used for all of the modality groups, but otherwise

they are K separate Tensor ICAs, each with separate data dimensions Nk× Tk× R (voxels ×

modalities × subjects). Each of the modality groups contains one or more modalities stacked

together, expressed in terms of spatial maps X(k), modality weights W(k), a shared subject-

weighting matrix H, and additive noise E(k). (b) The spatially-concatenated ICA configuration,

for comparison. This model is almost identical to a standard Bayesian ICA.

(a) The main matrices of the Linked ICA that models multimodal data Y. Note

2 Theory

2.1Linked ICA Model for Multimodal Data Sets

We assume that the data set is from a group of R subjects, each scanned using several different

modalities. It should be noted that the proposed method has the potential to be applied in

any situation where multiple modalities have been collected across a single shared dimension

(subjects, trials, timepoints, etc.). Each of the scans is prepared for analysis using whatever

methods are recommended for a linear regression analysis (or a single-modality ICA) of the

group data. This produces maps for each modality, which can have different spatial masks and

different numbers of voxels. In this model, “modality” is defined as referring to a single contrast

image (per subject) that refers to a particular output extracted from the data. Typically, different

modalities will have different units, different scalings and different noise levels. In some cases,

a single analysis may result in several different contrast images; for example, a diffusion tensor

imaging (DTI) analysis can produce maps of FA (fractional anisotropy), MD (mean diffusivity)

and MO (tensor mode).These are treated as separate modalities as they contain distinct,

complementary, biophysical information.

However, to maintain the benefits of tensor ICA (inferring the same spatial patterns across

modalities) as much as possible, similar modalities can be collected into K “modality groups”.

Modalities in the same modality group must be observations of the same points in space; this

means the modalities must be spatially aligned to each other and have the same spatial mask, and

should also have similar spatial properties (for example, the same amount of smoothing). A good

example of this are multiple diffusion-derived measures projected onto a white matter skeleton

using TBSS. The data can then be packed into a set of 3D arrays Y(k)∈ RNk×Tk×R, where

Nkis the number of voxels in the shared spatial map and Tk≥ 1 is the number of modalities

in the kthmodality group. Each modality group is modelled using a Bayesian tensor ICA

model. This general configuration is shown in figure 1. Note that the Bayesian ICA differs from

4

Page 5

standard methods like FastICA (Hyv¨ arinen and Oja, 2000) in that it incorporates dimensionality-

reduction into the ICA method itself by the use of automatic relevance determination (ARD)

priors on components (Choudrey and Roberts, 2001; Bishop, 1999). The model works on the

full-dimensionality data directly and has an additive noise model. The Bayesian ICA also models

an explicitly parametrized non-Gaussian source model (in this case a Gaussian mixture model)

instead of maximizing negentropy (as used in FastICA).

2.2Bayesian Tensor ICA Model

Within each modality-group k the data is modelled as a sum of components using a tensor

decomposition. Each component i = 1...L can be expressed as the tensor product of one

spatial map, one subject-course, and one modality-course. These model the data in modality

group k = 1 : K, modality t = 1 : Tk, subject r = 1 : R and voxel n = 1 : Nkas

Y(k)

n,t,r

=

L

?

i=1

X(k)

n,iW(k)

t,iHi,r+ E(k)

n,t,r

(1)

where X(k)

weightings for component i in modality t (of modality group k), and Hi,r are the weights for

component i in subject r. For simplicity this model is used even when Tk= 1, so that W(k)

just a scalar. Crucially, the same H matrix is shared between all of the modality-groups; this

forms a link between the different modality groups, which are otherwise modelled completely

separately. The ithcomponent has the same subject weightings across modality groups but each

group has its own spatial map. Thus the number of repeats R and the maximum number of

components L must be the same everywhere, because these dimensions are shared, while Nkand

Tk are not. Uncorrelated Gaussian residuals are assumed, with the modality-dependent noise

precision (inverse variance) λ(k)

n,iare the spatial maps for component i in modality group k, W(k)

t,iare the modality

·,iis

t :

E(k)

n,t,r

∼

N(0,1/λ(k)

t ).

(2)

Note that this assumes the same noise variance for each voxel, while in the original data there

may actually be large (orders-of-magnitude) differences in the white noise intensity. To correct

for this, we rely on a robust preprocessing method called variance normalization which is widely

used for ICA on functional MRI (Beckmann and Smith, 2004).

The sketch of the Linked ICA matrices is shown in figure 1, and figure 2 shows how these

variables fit into the full Linked ICA graphical model; this also includes the hyperparameters

explained in the next two sections. Aside from the shared matrix H, linked ICA model is identical

to performing separate tensor ICA analyses: each modality group k has its own separate source

mixture model, as well as having its own noise model and separate ARD priors to drive different

patterns of sparsity. Note that r indexes the “repeats” dimension and is the dimension that is

shared across modality groups, for example r indexes subjects in the multi-subject application.

2.2.1 Adaptive Modality-weighting

The tensor model (equation 1) implies that the same spatial sources X(k)

different maps t ∈ 1..Tk, with weightings given by W(k)

(Beckmann and Smith, 2005), this t dimension indexes over repeats of the same scan, such as in

a multi-subject FMRI data from a study with identical stimulus timings. In that case it makes

sense to assume the same noise level for all timepoints, i.e. use only a scalar λ(k). Instead, the

·,iare used for all of the

t,i. In previous tensor ICA applications

5

Page 6

Figure 2: The graphical model showing the relationships between all parameters and hyperpa-

rameters. Plates represent replicated variables or matrix sizes, with dimension in a top corner;

for example, H is L×R, and Y consists of K arrays of dimension Nk×Tk×R. Fixed hyperpriors

have been omitted.

tensor model is being used here to consolidate several different contrast maps produced by the

analysis of a single MRI scan. Since in this case each t refers to different underlying modalities

of data, it needs to model differences in scaling and noise levels, hence the modality-specific λ(k)

is used.

To adapt to different scalings of the signal in each modality, an ARD prior (Wipf and Na-

garajan, 2008) is used on the modality-courses (W). Just like the noise level, the relative scaling

of the data in each t needs to be determined independently; thus independent ARDs are placed

on each element of W.

t

P

?

W(k)?

=

Tk

?

t=1

L

?

i=1

N(W(k)

t,i|0,(ω(k)

t,i)−1)

(3)

with an approximately scale-free prior on each ω(k)

a source from that timepoint by forcing W(k)

the number of sources is not explicitly chosen, but the method automatically determines the

number of sources that are needed to optimally describe the data. We start with a full set of

sources and allow the model to gradually downweight and eliminate sources that are too weak.

This means that it is now possible to eliminate a source from some modalities while keeping

it in others, so it is possible to model effects like single-modality structured noise/artefacts. This

means that the subject-course no longer needs an ARD, so it has a simple fixed prior:

t,i; as ω(k)

t,i→ ∞ that will effectively eliminate

t,ito zero with very high precision. In this approach,

P(H)=

L

?

i=1

R

?

r=1

N(Hi,r|0,1).

(4)

This dimensionless prior on H is analogous to the fixed unit variance priors used in variational

PCA (Bishop, 1999). When a source has not been eliminated, the ARD priors on W will tend to

balance with this fixed-scale prior to keep the rows of H close to a (root-mean-squared) amplitude

of one. This means that each column of H is a dimensionless vector summarising everything that

6

Page 7

varies between different subjects’ scans (apart from residual noise). This means that H models

normalized variability over subjects with the overall scale of this variability in the data being

modelled elsewhere, in W. This modality-independent hidden state provides a probabilistic link

between separate ICAs.

There are situations in which one modality in itself consists of several distinct timepoints;

for example, multi-subject whole FMRI scans with synchronised stimuli (Beckmann and Smith,

2005) or identical structural scans acquired longitudinally to image individual neurodegeneration.

These can easily be modelled in this framework by returning to a single ω(k)

timepoints of that component.

i

and λ(k)for all

2.2.2Independent Spatial Sources

The driving force behind an ICA decomposition is that the data is derived from a number

of statistically independent spatial sources; these are the spatial maps (X(k)

By the central limit theorem, linear mixing of independent sources will produce output that is

more Gaussian than the sources. An approach for finding this non-Gaussianity is to explicitly

fit a non-Gaussian distribution to each source by assuming a particular functional form. This

is the approach taken here, using an M-component Gaussian mixture model as proposed for

independent factor analysis by Attias (1998).

This models the elements of each spatial source (X(k)

Gaussian mixture model with means µi,m, precisions βi,m, and component proportions πi,m.

This is a good approximate model for a variety of underlying distributions (Choudrey and

Roberts, 2001). The Gaussian mixture model prior on the spatial maps can be expressed as

·,ifor i = 1...L).

·,i) as being drawn from an M-component

P

?

X(k)

n,i

???µ,β,π

?

=

M

?

m=1

π(k)

i,mN

?

X(k)

n,i

???µ(k)

i,m,1/β(k)

i,m

?

(5)

and a hidden mixture membership variable q(k)

was drawn from.

For simplicity, the model presented in this paper uses a fixed M = 3 mixture components. In

practice, using 2 or 3 mixture components seems to extract the non-Gaussian sources from noisy

simulated data reasonably well, although a Gamma-Gaussian mixture model may actually be a

more appropriate model (Makni et al., 2006; Beckmann and Smith, 2004; Woolrich et al., 2005;

Hartvig and Jensen, 2000).

The relatively-uninformative priors on µ, β, and π are given in appendix D. It is worth noting

that there is theoretically a scale ambiguity in this model, in that simultaneously rescaling µ,

β−1/2, W−1and ω1/2can result in the same model fit.

parameter W and its ARD prior ω respond much more quickly to produce overall changes in

component weight, while this adaptation occurs far more slowly (if at all) when the GMM is

relied upon for component scaling and elimination.

n,i= m indicates which mixture component X(k)

n,i

However, in practice, the scaling

2.3 Variational Bayesian Inference

We fit this model using Variational Bayes (VB), which is a fast iterative approach for approximate

Bayesian inference (Attias, 2000). The full posterior distribution is intractable, so the mean field

7

Page 8

approximation is used and the posterior distribution is factorized as

P(Y|H,β,µ,π,W,ω,λ,X,q)

≈

P?(H)P?(β)P?(µ)P?(π)P?(W)P?(ω)P?(λ)

K

?

k=1

L

?

i=1

P?(X(k,i),q(k,i)) (6)

Notice that this explicitly factorizes the spatial sources X over components i. The solution

is still analytic without this factorization, but the number of components in the joint mixture

model grows as ML(see Attias 1998). This is an approximation, but a reasonable one because

the component sources are assumed to be statistically independent of one another.

For all details of these updates and the free energy F, see appendix C. The free energy F

was used to validate the VB updates (by ensuring that ∆F ≥ 0 after each update) and also to

monitor convergence. The analysis software is implemented in MATLAB.

2.3.1 Precision Contributions

When a component represents some real underlying variation between subjects, it can fuse in-

formation across several modalities. We might expect these variations to show up more clearly

in some modalities than in others, this simply being a reflection of possibly marked changes in

contrast-to-noise ratio (CNR) of each particular feature. This subsection describes a simple mea-

sure for assessing the relative influence of each different modality in defining the subject-course

of each component.

The latent space H is shared between all modalities, so its posterior P?(H) combines contri-

butions from all of the modalities (as well as the prior). This is the only time that estimates

from across modalities are combined, so it is an appropriate place to look in order to find out

which modalities are driving a particular component.

The technical details are explained in appendix E. Basically, each modality provides its own

estimate of H and the posterior ?H? is a precision-weighted average of these. To find the dominant

modalities in estimating each source’s subject-course, it is informative to look at these precisions.

In the results section the figures will show these “precision contributions” normalized so that

the sum of all contributions is 1 for each component i. This makes it easy to see if a component

is dominated by one modality or is informed by a combination of several modalities. The prior is

also included in this sum, so if a component is eliminated, the dominant precision contribution

will be from the prior.

2.3.2Spatial smoothness correction

Most MRI data used in real analyses will have had some level of spatial smoothing applied as

a deliberate preprocessing step. This is primarily because the signals tend to be extended (in

the case of BOLD FMRI) or are not perfectly aligned across subjects (in the case of a VBM-

style analysis). In these cases, spatially smoothing improves the SNR of the signals of interest.

However, we are using an uncorrelated white noise model (equation 2), which means that the

model believes it has more independent data points than there are actual degrees of freedom in

the data.

There are several approaches available to correct for smoothness. The most direct approach

is to use an explicit model of spatially-smooth noise, e.g. using a Gaussian process as the noise

model. This can adaptively determine the amount of smoothness in the noise and deconvolve

the smoothing kernel. However, this approach is prohibitively slow and can introduce severe

numerical problems.

8

Page 9

A simpler approach for reducing the impact of this eDOF/voxel mismatch is to decimate the

data, reducing the number of voxels while retaining as many independent measurements as possi-

ble. There are practical problems with this approach, particularly in the choice of exactly which

voxels to remove and the fact that some information will always be lost (and some correlations

will always remain).

Instead of decimating the data to remove this spatial correlation structure, we perform a “vir-

tual decimation” by downweighting the effective number of voxels in the VB update equations.

This keeps all of the original voxels but downweights any summation over voxels to reflect the

effective degrees of freedom (eDOF) νkinstead of the actual number of voxels Nk.

By keeping this decimation factor fixed, the VB updates and free energy F remain consistent,

even permitting model comparison as usual.

preprocessing step and kept constant across all methods. This approach is fully described in

appendix A and its effect will be demonstrated on simulated data in the Results section.

At present, the smoothness is estimated as a

2.3.3 Preprocessing

Each modality’s data is de-meaned in the subject dimension. This removes the mean spatial

map to emphasize differences between subjects. The mean level of each map (e.g. each subject’s

total grey matter density) is not removed because this contains important information that

distinguishes subjects.

A more serious issue is that the variance can vary enormously from voxel to voxel. This is

especially true in FMRI data (where there can be a two order-of-magnitude difference in noise

variance between CSF and white matter voxels), and it is also true (to a lesser extent) for the

structural modalities used in this paper. The current model assumes the same noise precision λ(k)

for all voxels in a modality. In principle it is possible to estimate per-voxel noise levels but instead

we use the well-established empirical method for variance normalization from probabilistic ICA

(Beckmann, 2004). This attempts to estimate the per-voxel scaling of the underlying white noise

by looking only at the centre of the intensity histograms and ignoring the tails.

As with most ICA implementations, Linked ICA is initialized from a PCA decomposition.

For multimodal data, the natural method is to concatenate all of the voxels across modalities k,t

(to get a (?K

t

k=1NkTk)×R matrix) and then do a PCA decomposition on that. The full details

are given in appendix B.

2.3.4Component elimination

To provide a large (∼10x) speedup in inference, eliminated components (or part-components) are

removed from the model completely, avoiding additional inference on these zero-weight spatial

maps. These are actually removed from the model as well, removing the free energy cost asso-

ciated with keeping these parameters around; this cost is partially related to the factorization,

and is highly dependent on the priors selected, e.g. the cost is doubled if an “uninformative”

P(ω) ∼ Ga(1012,10−12) prior is used instead of a Ga(106,10−6).

estimated posterior in any significant way, but it does greatly change the model comparison,

especially with many initial components (L = 90). When using this elimination approach, the

model comparison results are independent of the number of extra components in the model.

This does not affect the

2.3.5 Convergence

Convergence is monitored by evaluating the free energy F. In practice, evaluating F takes longer

than a full cycle of updates so it is not done in every iteration. We evaluate F at logarithmically-

9

Page 10

spaced numbers of iterations, so that it is evaluated when the iteration count is ?(√2)integer?.

Convergence is declared when the change in F per iteration drops below 0.1.

2.4Compared Approaches

Of the existing data configurations discussed in the introduction, spatially-concatenated ICA

(fig. 1b) represents the only reasonable way to arrange all of the available multimodal data into

a single ICA decomposition. This enables all of the data to be used in inferring subject-courses

H, but treats all voxels the same regardless of which modality they came from, and also loses

the spatial correspondence between the modalities in any modality groups.

All voxels are concatenated across modalities and a single, large spatial map containing all

of the modalities is used, with the same noise-normalizing scaling as used for the initial PCA.

This is the same configuration of the data used in joint ICA (Calhoun et al., 2006) but we use

Bayesian ICA for inference. This “Concatenated ICA” approach is the baseline model against

which the Linked ICA models will be evaluated.

In this situation, all of the spatial maps are concatenated voxelwise and the same mixture

model is used for the entire concatenated spatial map. Furthermore the tensor model is flattened

out so that instead of modalities sharing the same basic pattern with different scalings, the

link between corresponding voxels is lost and the spatial map has to be learned separately for

each modality, as illustrated in figure 1(b). The W matrix is still present but it can only scale

(and eliminate) each component from the entire model. This makes the model identical to the

Bayesian ICA model of Choudrey and Roberts (2001), aside from the fact that the ARD is on

the scaling matrix diag(Wt,·) rather than on the rows of H.

For fairness, the correction for spatial smoothness described in appendix A is also applied

to the concatenated data. Because the concatenated modalities may have different amounts of

intrinsic smoothness, the weight applied to each voxel will depend on which modality it came

from. To our knowledge this correction is not used in existing multimodal methods (Calhoun

et al., 2006; Xu et al., 2009) which avoid the issue by using identical spatial smoothing levels

even across very different modalities.

3 Evaluation and Results

3.1Simulated Multimodal Data

This section presents a simulated multimodal data set, which will be analysed using linked tensor

ICA and spatially-concatenated ICA to demonstrate the differences between the two approaches

in terms of modelling common (multimodal) components and single-modality structured noise

components.

A simulated multimodal data set was constructed with four modalities in two modality groups.

The first group contains three modalities of 1000 voxels each that share the same spatial patterns

with different weightings, and the second group which has a single, 3000-voxel modality. The

spatial maps are shown in figure 3. For all of the images of simulated data, the spatial maps

shown have a consistent colour scale running from Z = ±5, i.e. five standard deviations of the

noise in estimating the spatial map.

There were three common components (labelled C1–C3) that were expressed in each of the

modalities, and a structured noise source that was unique to each modality (N1–N4).

subject-courses (not shown, R = 100) are white noise, but C1 and C2 were generated with

somewhat correlated subject-courses (30%) in order to make crosstalk (cross-contamination)

between this pair of sources more likely. This means that the initial PCA will mix the signal

The

10

Page 11

(a) Spatial maps (b) Source intensity distributions

N4C1

Modality 1a

C2C3 N1N2N3

Modality 1b

Modality 1c

Modality 2

−30 −20−100 102030

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

True X

Probability density (as a fraction of all voxels)

Input activations/deactivations to simulated data

Modality group 1

Modality group 2

Inactive voxels

Figure 3: The simulated multimodal data. (a) There are seven components: three shared sources

(C1–C3) that appear in all four modalities, and four structured noise sources (N1–N4) each ap-

pearing in a single modality. The first three modalities (1a–1c) have 1000 voxels each (20x50)and

are in the same modality group and therefore share the same spatial maps; in this example all

of the weights are the same (set to 1) but they have different noise levels. The last modality

(labelled modality 2) has 3000 voxels (60x50) and different spatial maps from 1a-c. All of these

images are scaled consistently relative to the noise level in the data. (b) The histograms of the

two modality groups are very different: group k = 1 has 40% of its voxels active (half positive

and half negative), while group k = 2 has only 10% active (all positive) but the activation is

stronger. Note that the remaining 60% or 90% of voxels in both histograms are completely

inactive so they are shown by the peak at exactly zero.

11

Page 12

(a) High noise(b) Low noise(c)

Components

Modality

Concatenated ICA

123456789 10

1

2

3

4

Prior

0

0.2

0.4

0.6

0.8

1

Components

Modality

Linked tensor ICA

123456789 10

1a

1b

1c

2

Prior

0

0.2

0.4

0.6

0.8

1

Components

Modality

Concatenated ICA

123456789 10

1

2

3

4

Prior

0

0.2

0.4

0.6

0.8

1

Components

Modality

Linked tensor ICA

123456789 10

1a

1b

1c

2

Prior

0

0.2

0.4

0.6

0.8

1

Linked Cat

0

200

400

600

800

1000

High noise

Free energy (relative)

Linked Cat

0

500

1000

1500

2000

2500

3000

3500

4000

Low noise

Figure 4: Inferred precision contributions for the simulated data set, showing which modalities

dominate in determining each source’s subject-course.

consistently identifies the structured noise modalities (components 4–7) while the Concatenated

ICA exhibits mixing (most severely in high noise). (c) Model comparison results, showing that

the true Linked ICA model is preferred to the Concatenated ICA model in both data sets (more

so in low noise).

Notice that the Linked ICA method

components together, so the ICA method must move away from its initialization point and

seek non-Gaussian histograms in order to accurately separate components with non-orthogonal

subject-courses.

Figure 3(b) shows the histogram of the true activation levels used in the simulation. These

were chosen so that the two modality groups had very different histograms in terms of sparsity,

symmetry, voxel count and SNR. Gamma distributions were used because the heavy tails are

thought to more accurately reflect the properties of real activation than a Gaussian. The acti-

vation distributions shown only account for part of the intensity histogram; the remaining 60%

or 90% of voxels are inactive and therefore collect in a large point mass at exactly zero. In the

high-noise simulation, the white noise added to the four modalities had standard deviations of

25, 30, 35, and 50 respectively. In the low-noise simulation, the same signal was used but with

noise scales of 15, 20, 25, and 40.

The number of components was set to L = 10 so that with a true dimensionality of 7 there

was work for the dimensionality estimation (via the ARD prior on W) to do. The inference ran

until it converged (∆F < 0.1), which took 200–1000 iterations. This MATLAB code took about

5 minutes to run each inference on a single core of a 2.4 GHz AMD Opteron 8431 processor.

3.2Results on Unsmoothed Simulated Data

The precision contribution plots are shown in figure 4(a,b). In the Linked ICA model at both noise

levels, the three shared sources (components 1–3) each split their precision fairly equally between

the four modalities. The structured noise sources (4,5,6,7 or 4,5,6,9) are mostly determined by

a single modality, and the others are eliminated completely (dominated by the prior precision).

In the concatenated model at high noise, only 4 sources are inferred and the rest are eliminated.

The last components (4) primarily models the strongest structured noise source.

This occurs in high noise (but not low noise) because these structured noise components

are only slightly detectable above the high noise level, so the Concatenated model determined

that switching off those additional components (by inferring Wi= 0) provides a more concise

explanation of the data. In the Linked ICA model it is possible to switch off each modality’s

contribution to component i separately (by inferring W(k)

ti= 0 for only some modalities k,t), so

12

Page 13

the complexity penalty for keeping each component is reduced. Model comparison also strongly

prefers the correct model (figure 4c) over several alternatives. The subject-courses are all ex-

tracted slightly more accurately with Linked ICA, but this improvement is relatively small (not

shown).

Figure 5 shows the inferred spatial maps. Linked ICA correctly extracts all seven sources in

both the high- and low-noise data, while 3 of the structured noise sources are lost in the high-

noise data in the concatenated approach. All of the common-signal spatial maps are recovered

well. The signal from C1 is shown contaminating signal C2 in the concatenated model, while

linked ICA separates these signals much more cleanly.

The ROC curves show that the Linked ICA method discriminates active from non-active

voxels more accurately than the concatenation approach, at both noise levels. This advantage is

largest in modality group 1 because Linked ICA knows that the three modalities share the same

spatial map whereas the concatenated approach does not. A small improvement is observed even

in the maps for modality 2 where there is no such benefit. This may be partially due to the

improvements in subject-course estimates or because it models separate histograms for modality

group 1 and modality group 2.

To assess the sensitivity to initialization, we re-ran these simulations using random initial

subject-courses instead of the PCA decomposition. On the low-noise simulation, the results for

both Linked ICA and Concat ICA were nearly identical in terms of final subject-courses and

spatial maps (although of course the order of the components was different). The modality-

weights were also very similar, with the final weights varying by ±0.6% on Linked ICA and

±3.5% on Concatenated ICA.

On the high-noise simulation, some of the weaker components (especially N4) were lost due

to the poorer initializations. Note that in these cases we used L = 90 so that there were more

components to eliminate; starting with L = 10 gave similar results but was more likely to lose

components.

We also initialized using the minor PCA subspace, i.e. discarding the strongest 10 PCA

components instead of using them as initial components; again, the resulting subject-courses,

spatial maps and modality weights were nearly identical. This illustrates that the initial PCA is

not a hard dimensionality-reduction step. The variance that is not part of any initial component

is not discarded, but is instead initially treated as part of the noise. The signals buried in this

noise can be sought out and recovered within the iterative Bayesian ICA framework.

We also assessed the effect of changing the Gaussian mixture model order (from the default

value of M = 3) on the low-noise simulated data. In Concatenated ICA, using the simpler M = 2

model made some components slightly worse (e.g. reducing correlation with the true spatial map

from 0.772 to 0.767) and others slightly better (e.g. from 0.743 to 0.748). This not surprising

because Concatenated ICA uses a single average histogram for two different sources; changing

the prior on that histogram can bias the estimates towards having slightly heavier or lighter tails,

which may fit one source better and one worse. In the other direction, increasing from M = 3

to M = 4 had a far smaller effect. In Linked ICA, the differences were all negligible. This shows

that M=3 is a sensible choice and it is used throughout this paper.

3.3Smoothed simulation results

The results in figure 4 were based on simulated data with no spatial correlations in the noise.

However, real neuroimaging data often has significant spatial smoothing and this section shows

the behaviour of the Linked ICA model with and without the smoothness correction (described

in section 2.3.2 and appendix A).

The data set from the previous section was re-generated using higher noise standard deviations

13

Page 14

(a) Linked ICA spatial maps(b) Concat. ICA spatial maps

C1

Modality 1a

C2 C3N1 N2 N3N4

Modality 1b

Modality 1c

i = 1

Modality 2

i = 3 i = 2i = 4 i = 6 i = 5i = 9

C1

Modality 1a

C2 C3N1

Modality 1b

Modality 1c

i = 1

Modality 2

i = 3 i = 2 i = 4

(c) ROC curves (d) ROC curves (low noise)

Figure 5: Inferred spatial maps on simulated data in high noise (a) using linked ICA and (b)

using concatenated ICA; the true sources are shown in figure 3(a). The ROC curves showing

how well each method’s |?H?| discriminates active voxels (both positive and negative) from

inactive voxels. The solid red lines show Linked ICA results while the dashed blue lines show

the concatenated ICA results, in the (c) high noise and (d) low noise conditions. The diagonal

dotted line indicates chance. Note that there are no dashed blue lines for N2–N4 in high noise

because Concatenated ICA did not find components to match those sources.

14

Page 15

of [30 40 50 80]. We applied a spherical Gaussian smoothing kernel to each simulated modality,

using a FWHM of 2 voxels on modalities 1a–1c, and 4 voxels for modality 2. The exact DOF

in each case (as given by equation A.1) is 0.23 DOF/voxel and 0.058 DOF/voxel respectively.

Estimating this from the data yields estimates of 0.18, 0.19, and 0.20 DOF/voxel for modalities

1a, 1b, and 1c respectively; these imply greater smoothness than is really present because they

include the very smooth signal in the data, and 1a is more severely affected because it has a

higher SNR. Modality 2 is estimated at 0.047 DOF/voxel. When these estimates are computed

on the noise only (i.e. the residuals), the estimates are considerably more accurate, 0.25 and 0.061

DOF/voxel. Clearly it would be more accurate to compute DOF/voxel on the residuals rather

than the original data; however, for reasons discussed in appendix A, the current implementation

uses the values derived from the full data set. We also average the estimate of DOF/voxel across

modalities 1a-c, because they share spatial maps and therefore ought to have the same true

amount of smoothness. Very similar results are obtained if the exact DOF/voxel values are used

instead.

The precision contribution plots are shown in figure 6, illustrating the necessity of correcting

for spatial smoothness. Note how the DOF/voxel weighting enables the method (fig 6a) to

estimate the dimensionality almost perfectly, even starting from L=90. Only one extraneous

component survives (and the large weighting given to the prior contribution to this component

indicates that was close to being eliminated). The inferred components are accurate (fig 6c). In

contrast, fig. 6b shows that without the DOF correction all 90 components are kept. A subset

of these are shown in 6d, demonstrating that most of these are modelling spatially-smooth noise

patterns.

The common signal and structured noise components are well-recovered by both methods,

but in the uncorrected approach note that the common components (C1–C3) dominated by the

smoother modality (2), because smoothing reduces the noise level without changing the number of

voxels. In the corrected method, the weighting of these components is similar to the unsmoothed

case. Also, note that one of the structured noise terms is pushed down to component 76 by the

many erroneous noise-smoothness components in modality group 2, which have larger apparent

significances.

3.4Separation of correlated subject-courses

All of the configurations discussed in this paper assume that there is a single subject-course

for each component, which is identical across modalities. If the subject-courses are different

across modalities, even if they are strongly correlated, then they should be split into separate

components. The model has to make this hard distinction, so it is possible for two similar

components to be combined into one, or for a single component to be split due to noise.

The Linked ICA methods reduce this problem by informing the model where the divisions

between the modalities lie, making it easier to split the component apart when the modalities’

subject-courses are different. We simulated this by modifying the low-noise simulated data to

change the subject-course used to generate modality group 2’s data, starting from the subject-

courses being identical (correlation = 1) down to them being almost decorrelated (correlation =

0.1). For high correlations (>0.75), both methods combine the components; for low correlations

(<0.35), both methods split them. However, Linked ICA splits these subject-courses much earlier

as shown in figure 7(a).

Furthermore, figure 7(b) shows that this earlier splitting results in more accurate recovery of

the modality-group-2 subject-courses (C1?to C3?). This indicates that the modalities are being

modelled separately, rather than using one component to model a weighted average of the two

subject-courses and creatinga new “difference” component (e.g. with a subjectcourse C1’-C1) to

15

Page 16

(a) with DOF correction

Component

Modality

FWHM = 2,4, with DOF/voxel correction

1020 30405060 70 8090

1a

1b

1c

2

Prior

0

0.2

0.4

0.6

0.8

1

(b) without DOF correction

Component

Modality

FWHM = 2,4, no DOF/voxel correction

102030405060708090

1a

1b

1c

2

Prior

0

0.2

0.4

0.6

0.8

1

(c) All comps with DOF correction

Modality 1a

Modality 1b

Modality 1c

i = 1

Modality 2

i = 2i = 3i = 4i = 5i = 6i = 7 i = 47

(d) Some components (1–10, 70–79) without DOF correction

Modality 1a

Modality 1b

Modality 1c

i = 1

Modality 2

i = 2i = 3 i = 4 i = 5 i = 6 i = 7 i = 8i = 9 i = 10

Modality 1b

Modality 1c

i = 70

Modality 2

i = 71 i = 72 i = 73 i = 74 i = 75 i = 76 i = 77 i = 78 i = 79

Modality 1a

Figure 6: Analysis of simulated data smoothed with Gaussian kernels with FWHM of 2 voxels

and 4 voxels on modality groups 1 and 2 respectively. (a-b) The precision contribution plots

with and without the correction for smoothness. (c-d) Spatial maps of the components inferred

with and without this correction. Without the correction, many extra components are inferred

that model spatially-smooth noise in the data.

16

Page 17

(a) Number of components extracted

Separation of highly correlated components

(b) Accuracy of subject-courses

00.10.20.3 0.4 0.50.60.7 0.8 0.91

7

7.5

8

8.5

9

9.5

10

Correlation of subject−courses between modality groups

Total number of components

Linked ICA

Concat ICA

00.51

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Correlation between subject−courses

Accuracy of best recovered subject−course

Linked ICA

C1

C2

C3

C1′

C2′

C3′

00.51

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Correlation between subject−courses

Accuracy of best recovered subject−course

Concatenated ICA

C1

C2

C3

C1′

C2′

C3′

Figure 7: The effect of progressively decorrelating the subject-courses between the two modality-

groups: (a) on the number of components extracted by each method, (b) on the accuracy of

recovering these subject-courses. C1–C3 denote the subject-courses in modality group 1 of the

original three common components, and C1’–C3’ are subject-courses of the similar components

in modality group 2. At the right side of each figure (correlation between subject-courses = 1),

C1–C3 are identical to C1’–C3’.

soak up the variance caused by the mismatched subject-courses.

3.5Analysis of Real Structural and Diffusion Data

To evaluate this method on real multimodal group data, several Linked ICA configurations (figure

8) were compared to Concatenated ICA in the task of extracting independent components from

a structural and diffusion data set with 47 probable Alzheimer’s patients and 46 age-matched

controls. Exploratory techniques can be used to find inter-subject variability and identify whether

any of these are correlated with regressors of interest; the Linked ICA approach provides a way

to perform this across multiple modalities in a data set. Both grey matter density and white

matter integrity have previously been used as biomarkers for neurodegeneration.

these, structural and diffusion scans were collected for these subjects. The diffusion scans were

preprocessed to extract maps of Fractional Anisotropy (FA), Mean Diffusivity (MD) and Tensor

Mode (MO). These were projected onto a white matter skeleton using TBSS, which improves

registration and makes sure that observed differences are due to white-matter tract properties

and not just movement or misregistration. The grey matter (GM) partial volume maps were

extracted using the FSL-VBM tools (including non-linear registration).

Generally, neurodegeneration results in reduced GM density, decreased FA, and increased

MD. Tensor mode MO is another measure derived from the diffusion tensor which is mathemat-

ically orthogonal to the other two (FA and MD) and is related to whether diffusion is restricted

in a line or in a plane (Ennis and Kindlmann, 2006), and therefore may have significance for

assessing degeneration in areas where fibre bundles cross.

The TBSS modalities were resampled to (2mm)3voxels on the skeleton, yielding N1= 28997

voxels. The grey matter maps were already smoothed by 7.1mm FWHM to account for anatom-

ical differences, and then subsampled by a factor of two to reduce computation time, yielding

To assess

17

Page 18

(a) Linked tensor ICA(b) Linked flat ICA(c) Concatenated ICA

Figure 8: Matrix diagrams of the ICA configurations evaluated for the real data: (a) Linked ICA

with the three DTI modalities configured as a tensor and linked to GM, (b) Linked ICA in flat

configuration, with all four modalities linked together by a shared subject-course matrix H only,

(c) spatially Concatenation ICA, where all modalities share subject-courses H, scaling factors

W, and the GMM source models on X.

N2= 23402; nearest-neighbour downsampling the GM to 4×4×4mm voxels does not discard

much information in this case because of the high smoothing. The degrees of freedom per voxel

were estimated to be 0.017 for FA, 0.018 for MD, 0.022 for MO and 0.029 for GM; the first three

were averaged yielding ν1/N1= 0.019 and ν2/N2= 0.029.

It is worth noting that these are completely different preprocessing steps, which are not

explicitly matched in terms of voxel count, smoothness, or SNR. Instead, these parameters are

estimated from the data and the Bayesian model uses this information to automatically weight

the modalities appropriately.

Since all the DTI modalities exist in the same space, they can be combined in a single

modality group to give a linked tensor model. They can also be linked in a flat formation, with

different histograms and spatial maps for each modality. The concatenation approach is used as

the baseline. These configurations are presented in figure 8.

The following models were evaluated:

• Linked tensor ICA is the Linked ICA model with the three DTI modalities stacked into

a tensor, linked to the GM modality; so K = 2 modality groups and T = [3,1] modalities

in each.

• Linked flat ICA is the linked ICA model which assumes unrelated spatial maps for each

of the four modalities, with K = 4 modality groups and a single modality in each group

(T = [1,1,1,1]).

• Concatenated ICA is the standard spatial-concatenation model. This is different from

Linked flat ICA because the same histogram is used for all modalities and there is no

18

Page 19

(a) Precision contributions

Components

Modality

Linked Tensor ICA

1020304050 60 708090

FA

MD

MO

GM

Prior

Components

Modality

Flat Linked ICA

10 20 30 4050 60 70 80 90

FA

MD

MO

GM

Prior

Components

Modality

Concatenated ICA

1020 30 4050 607080 90

FA

MD

MO

GM

Prior

(b) Model comparisons

Model Comparison

LTLFC

0

500

1000

1500

2000

2500

ICA Configuration

Free energy (relative)

Figure 9: Left: Precision contribution plots for the Alzheimer’s data set; the brightness indicates

the relative strength of the modalities’ contributions to each component. Linked tensor ICA uses

two modality groups: the three DTI modalities (FA, MD, and MO) form one group, while GM

is by itself. Linked non-tensor ICA puts each of the four modalities in its own group, yielding

35 components overall (many of them explaining only a single modality). Spatial concatenation

yields 24 components, with nearly all of them providing a mixture of signals from across all

modalities. Right: Model comparison results for Linked tensor ICA (LT), Linked flat ICA (LF)

and Concatenated ICA (C). The strongest evidence is for the LT model, followed by LF.

sparseness in modalities. Concatenated ICA also assumes the same noise variance for

all modalities, but this is a reasonable assumption because the voxelwise noise-variance

normalization is designed to leave all voxels (in all modalities) with the same noise level.

As far as the model is concerned, there is only one modality (K = 1, T = [1]).

The VB inference was allowed to run until a fairly stringent ∆F < 0.1 condition was satisfied

(around 1000-5000 iterations). The current MATLAB implementation takes around 10-40 hours

of calculation for L = 90.

The precision contributions plots on the resulting fits of each model are shown in figure 9.

The model comparison results also show that the tensor models are greatly preferred to the

flat configuration, and that all Linked models are preferred to the concatenated approach. In

Concatenated ICA, most of the components are spread across all of the modalities. In Linked

ICA there is much more separation between the modalities: some components appear to explain

variability in the white matter only, while some are shared between white matter and grey matter.

This may indicate that there is some variability between subjects in the white matter that is

19

Page 20

not reflected in grey matter, and vice versa; alternatively, some of these components may be

artefacts present in the individual modalities. Both Linked tensor ICA and Linked flat ICA give

sparser solutions (by excluding some modalities from some components) and therefore choose

to have more components than the Concatenated ICA model. The Linked tensor ICA model

has the most restrictive model of each component and therefore uses the most components. In

this way, components that have correlated but distinct subject loadings in different modalities

are more likely to be split because the Linked (flat and tensor) ICA models are provided with

information about where the modality boundaries lie.

Figures 10 and 11 shows a strong component that is well-preserved across all of these models,

and is correlated with age and pathology. In Linked flat ICA (fig.10a) this component involves

all modalities. The widespread increase in MD and decrease in GM are consistent with neural

degeneration and brain atrophy (the smaller areas of apparent increase in GM are on the edge

of GM, hence are most likely indicative of imperfect alignments between the groups in this fairly

high-atrophy dataset). The other DTI patterns are more complicated: in the corpus callosum and

forceps major both FA and MO decrease, while both of these measures increase in the internal

capsule, corona radiata, and superior longitudinal fasciculus. These areas of increasing FA and

MO, for example, where the superior longitudinal fasciculus crosses the descending fibres, are

probably due to degeneration of the “weaker” fibre in this crossing-fibre area, as investigated by

Douaud et al. (2010). Both of these are also consistent with neurodegeneration, because these

signals will decrease in single-fibre direction areas but will increase due to selective degeneration

of one tract in areas of crossing fibres. Concatenated ICA (fig.10b) reveals a similar pattern in

this component; however, it is far less extensive in all modalities, which may partially be due to

the use of a single, compromise source model for all modalities. Furthermore, MD both increases

and decreases in places, which is less physiologically interpretable.

Note that the patterns present in FA and MO are very different from those seen in MD, so

this pattern can not be expressed as a single component in the tensor model. As a result, Linked

tensor ICA appears to split this into three components, shown in figure 11.

Finally, figure 12 shows a component that is extracted very similarly by all three ICA methods.

There are a number of components like this in the decomposition, automatically decomposing the

white matter into bilateral pairs of tracts that vary across groups of subjects. This component

isolates the external/extreme capsule, and in this component the tensor assumption is valid

because all methods infer that FA and MD have nearly identical spatial maps (down to a scale

factor). The Concatenated ICA does not isolate the tract as strongly and also shows scattered

“related” changes in GM that appear to be spurious.

We also re-analysed the real data using a random initialization rather than the PCA decom-

position. Free energy results showed that this produced a poorer fit to the data: the final free

energy changed by -31 for Concatenated ICA, -110 for the Linked flat ICA and -249 for the

Linked tensor ICA. For context, a difference of 3 is usually considered strong evidence in favour

of the higher-free-energy model over the lower-free-energy model. Looking at this in more detail,

the inferred subject-courses and spatial maps can change considerably when the initialization

changes. Of course the component orders and signs will also be completely different, so these

are paired in a greedy way (the two components with the highest absolute correlation are paired

first). The correlation of these resulting pairs are plotted in figure 13. In most components,

the Concatenated approach is less sensitive to initialization (many correlations near 1), and the

Linked ICA methods are considerably more sensitive. However, model comparison provides a

good way to choose which initialization produces the best final result, and the size of the drop

in free energy is related to how sensitive the results were to initialization. Indeed the model

comparisons over different initializations show that the PCA initialization is the better one to

use.

20

Page 21

(a)

(b)

Figure 10:

timodal data analysis: (a) Linked flat ICA and (b) Concatenated ICA are shown here, while

Linked tensor ICA is shown in figure 11. The components shown are positively correlated with

age (r=0.49 for Linked flat ICA and 0.26 Concatenated ICA) and pathology (0.30 and 0.24 re-

spectively). See the text for a full description of these components. Note that these images are

thresholded at |Z| > 2 for display.

Spatial maps of a strong component detected by all three models in the real mul-

21

Page 22

Figure 11:

expressed as a single component due to the tensor constraint. However, it can be clearly identified

across the three components, which have subject-courses that are strongly (0.53–0.78) correlated

with each other. These components each correlate positively with age (0.42, 0.51, and 0.58) and

pathology (0.29, 0.28, 0.17). In each component, the FA, MD, and MO spatial maps are identical

down to a scaling factor; any apparent differences are due to the effect of thresholding |Z| > 2.

In the Linked tensor ICA model, the component found in Figure 10 cannot be

22

Page 23

Figure 12:

nal/extreme capsule. These subject-courses correlate positively with Age (0.23,0.39,0.32) but

only very weakly with pathology (r=0.08,0.14,0.02).

Components from each of the 3 methods that most cleanly extract the exter-

23

Page 24

0 10 2030 40 506070 8090

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Component pair (greedy pairing, sorted)

Correlation of subject−courses

Variability in subject−courses due to initialization

Linked tensor ICA

Linked flat ICA

Concat ICA

0102030405060708090

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Component pair (greedy pairing, sorted)

Correlation of spatial maps (all four concatenated)

Variability in spatial maps due to initialization

Linked tensor ICA

Linked flat ICA

Concat ICA

Figure 13: These figures show the consistency of the inferred subject-courses (a) and spatial

maps (b) under the two different initializations (PCA vs. random). Since the component or-

der/sign is irrelevant, components from the two runs are greedily paired, matching up the highest

absolute correlations first, and stopping when one run has no components remaining. Ideally, all

components should be stable (i.e. all correlations near 1, so each component in one run is paired

with a near-identical component in the other run). The proposed linked ICA methods are more

sensitive to poor initialization than the Concatenated approach. However, model comparison

provides a way to determine which is a better model of the data; in all cases PCA?random, so

only the PCA-initialized results have been given in this paper.

4 Discussion

The Linked ICA method presented in this paper provides a general, flexible way to perform

ICA on multimodal data sets that allows components to be sparse in modalities and allows

different noise levels and histograms for each modality group. This method also permits part-

tensor configurations when the same spatial maps are expected across some of the modalities

and allows model comparison.

Linked-tensor ICA performed well in simulated data, and combined information from across

modalities more accurately than standard Bayesian ICA with spatial concatenation. In partic-

ular, it was better at detecting and isolating single-modality noise, which may help to provide

more interpretable components in real data. The linked-tensor ICA method estimates the spa-

tial maps more accurately (in terms of ROC performance), benefiting from the partial tensor

configuration and from the more accurate histogram estimates found by separating voxels into

modality groups.

A simple RESEL-based correction is developed to account for spatial smoothness in the data,

and this does not interfere with the VB updates or model comparisons. On smoothed simulated

data, we show the importance of correcting for smoothness to allow reasonable dimensionality

estimation. The actual smoothness estimation is currently quite basic, being a point estimate

derived from the data rather than the residuals. This could be estimated iteratively by examining

the residuals, either after a preliminary analysis or as part of the main VB inference. The latter

would be at the cost of losing direct model comparability because changing νk is effectively

changing number of voxels in the data.

One alternative to this virtual decimation approach is to work directly on the unsmoothed

24

Page 25

data (thus making the unsmoothed noise model more valid) while using adaptive priors to en-

code the belief that nearby voxels should have similar levels of signal or are likely to have the

same mixture-model labels (Woolrich et al., 2005; Hartvig and Jensen, 2000). This introduces

additional complexity into the GMM part of the ICA model. Furthermore, one of the goals of

this multimodal ICA is to be able to combine modalities that have been previously analysed

in a way that is optimal for each modality; spatial smoothness is usually an intrinsic part of

optimal preprocessing. Another alternative is to adaptively model the spatial smoothness in the

data, much as the temporal smoothness in FMRI is estimated using autoregressive noise models

(Woolrich et al., 2004; Roberts and Penny, 2002; Woolrich et al., 2001). Factorization across

voxels in P?(X) will probably also be required to achieve a tractable solution, losing some of the

benefits of modelling this spatial correlation structure.

Model comparison showed that the Linked tensor model is preferred for the real multimodal

data set because of the ability to re-use spatial maps, since there are some obvious similarities

between the patterns observed in FA, MD, and MO. Linked flat ICA was still ranked higher than

the Concatenated ICA model on the real data. On the real data set presented, the Flat config-

uration of Linked ICA produced more interpretable spatial maps than the Tensor configuration

by avoiding the assumption of identical spatial patterns between the three DTI modalities. It

therefore appears that Linked flat ICA is a more interpretable model for this data. However,

this does depend on the question being asked. For example in figure 11, the tensor decomposi-

tion may be more meaningful in some situations because it attempts to split the DTI changes

into several components based on each component having a fixed, proportional relationship be-

tween changes in FA, MD, and MO. Further investigation will be required, but this is potentially

a meaningful distinction as it may actually be separating fibre regions in a microstructurally-

relevant way, such as areas of single-direction vs. crossing-fibre composition. This separation of

subtly-different types of change may be valuable in certain applications, for example if it more

cleanly separates Alzheimer’s-related from age-related differences; this remains a topic for future

research.

While model comparison clearly preferred the tensor configuration, this may in part be be-

cause the assumption of spatially-similar maps was actually valid for most of the other compo-

nents. It would be interesting to allow each component to individually decide whether its spatial

maps are sufficiently similar to use a tensor configuration; this even more flexible variant would

require a modified inference method.

Like any ICA method, Linked ICA relies on certain assumptions about the noise.

assumed that the noise floor is the same in all voxels and across all subjects; the former should

be guaranteed by the variance normalization preprocessing, and the latter effects are likely to

be subtle, e.g. noise due to increased head motion in one group. The spatial smoothness of

the noise is also assumed to be homogeneous and well-estimated, which may not be true due

to the complex nature of noise in VBM or TBSS data. In fact, the voxelwise noise may not be

Gaussian at all, and non-parametric statistics (permutation testing) are usually recommended

for statistical testing on these modalities.

One important assumption of ICA is that the changes are strictly linear, and the models

presented here make the further assumption that this linearity is valid across different modalities.

This ignores any saturation effects, for example that FA must by definition be between 0 and 1.

ICA also assumes that the signals are encoded by patterns that vary in intensity, without

moving spatially. In a grey matter map, the underlying change is often a volume reduction, which

will cause strong effects at the edges that move for significant amounts of atrophy. Although

the VBM protocol reduces this effect through the use of nonlinear registration, Jacobian-based

volume correction and spatial smoothing, it is still likely that the same type of change will show

up in somewhat different ways for different levels of atrophy. It would likely be better to use

It is

25

Page 26

surface-based measures (such as cortical thickness or volume) for ICA, as these might be expected

to vary more linearly with the underlying biophysical changes.

There are also limitations in the interpretation of linked components. As demonstrated in

figure 7, splitting occurs more often as the true subject-courses decorrelate, but this will depend

on a number of factors such as noise level. The fact that a component is linked between two

modalities does not actually mean that the relationship between the two modalities is significant,

but rather it is simply a more concise explanation of the data.

The ICA methods presented here are all susceptible to finding local minima in real data,

partially related to the ARD prior which is used to eliminate and part-eliminate components.

Although the elimination is data-driven it is also essentially irreversible in practice, and many of

these components are eliminated early on when the components are still very different from their

final values. Based on the model comparison results, PCA is a much better initialization than

random values, but it may still not find the global optimum. Conversely, on the simulated data

sets the methods were robust against poor initialization, presumably because there were only

seven signals to separate and they were strong independent sources. Improving the initialization

and optimization of this method to avoid these local minima is a key direction for future work.

One possible alternative to the ARD-based elimination approach considered in this paper is

to use a greedy search method (Friston et al., 2008) to build up the model one component at

a time, stopping when the free energy starts to decrease. This could speed up computation by

starting with a small number of components and growing, rather than starting with a large num-

ber and shrinking. It would also simplify initialization, because only one new component needs

to be initialized at a time (either from a PCA decomposition of the residuals, or randomly). This

would allow the strongest components to become settled before increasing the model complex-

ity, which may improve robustness. Unlike the deflation-based approach to fastICA (in which

components are extracted sequentially), this would still allow the earlier components to change

as new components are extracted and the model becomes more detailed.

There is also the possibility of applying these techniques to non-MRI neuroimaging data, for

example by combining FMRI volumes and EEG epochs on a trial-by-trial (rather than subject-

by-subject) basis. In MEG and EEG, tensorial decompositions like PARAFAC are a natural way

to model the space × time × frequency information in single-subject data (Miwakeichi et al.,

2004), and ICA has been used in this context to localise sources. This framework only requires

that the modalities share a single dimension (e.g. subjects or trials), so finding covarying patterns

in modalities as different as EEG and FMRI may still be possible. The major challenge will be

in finding appropriate preprocessing methods to keep the data size down while extracting the

relevant features.

Because of the Linked ICA’s ability to automatically balance information from very different

modalities, the same approach can be used to include non-imaging modalities like behavioural

regressors or genetic data directly in a multimodal ICA. In practice, some components will be

driven by these regressors, while others will model structured noise in the data (Groves, 2010,

Ch. 5). Since only the single matrix H is shared, the generative models of these new modalities

are extremely flexible in terms of source models and noise models. In this way, the Linked ICA

framework can be extended to bridge the gap between data-driven unsupervised learning and

supervised learning of multiple regressors simultaneously.

Acknowledgments

The authors would like to thank Achim Gass and Andreas Monsch for providing the structural

and diffusion data, Gwen Douaud for assistance in interpreting the real data results and Salima

Makni for many helpful discussions on Bayesian ICA.

26

Page 27

ACorrecting for spatial smoothness

One of the complications of a fully Bayesian inference approach is that it requires a generative

model of the data, not just of the signal. A consequence of this is that an inaccurate noise

model often leads to biased inferences. Like standard Bayesian ICA, our model assumes uncor-

related white noise (equation 2); however, in the case of neuroimaging data, there is often a very

significant amount of spatial smoothness in the noise.

The ICA models are unaware of the spatial structure of the data, i.e. an image is simply

a vector of voxels with no attached information about their position relative to one another.

When presented with spatially-smoothed data, this spatial structure is learned as a consistent

pattern across all images, and as a result many extra components can be inferred in order to

model this spatial structure. However, we do not consider this structure to be interesting and

wish to remove it, because these extra components obscure and interfere with the components

based on the non-Gaussian signals of interest.

There are several approaches available. The most direct approach is to use an explicit model

of spatially-smooth noise, e.g. using a Gaussian process as the noise model. This can adaptively

determine the amount of smoothness in the noise and has the property of emphasizing sharp

edges; this would be of great benefit in detecting sharp edges in the presence of spatially-smooth

additive noise. However, this approach introduces a series of practical problems when combined

with the ICA model (see discussion).

In neuroimaging, however, most of the smoothness is introduced to the raw signal intention-

ally. This is primarily because the signals tend to be extended (in the case of BOLD FMRI) or

are not perfectly aligned across subjects (in the case of a VBM-style analysis). In these cases,

spatially smoothing improves the SNR of the signals of interest. This smoothing of the data tends

to reduce the effective degrees of freedom of the image. It is this mismatch between the number

of voxels and the number of independent measurements that leads to incorrect estimation in the

ICA.

For a given linear N × N transformation kernel K, the effective degrees of freedom can be

calculated exactly as

?

Thus, for the identity kernel K = I, ν =N2

simple approximation to the effective DOF is given by Worsley et al. (1995):

?

where the last term is omitted in the case of 2-D data. The smoothing kernel’s FWHM is

estimated by assessing the correlation between adjacent voxels (in each direction) and

ν

= Tr

KKT?2

N= N. When K is a Gaussian smoothing kernel, a

/Tr

?

KKTKKT?

.

(A.1)

ν

N

=

RESELS

N

(4 log(2)/π)D/2=

0.9394

FWHMx

??

0.9394

FWHMy

??

0.9394

FWHMz

?

(A.2)

(FWHMx)2= −2 log(2)/log(corrx)(A.3)

where corrxis the correlation between adjacent voxels in the x-direction.

One approach for reducing the impact of this RESEL/voxel mismatch is to decimate the data,

reducing the number of voxels while retaining as many independent measurements as possible.

There are practical problems with this approach, particularly in the choice of exactly which

voxels to remove and the fact that some information will always be lost (and some correlations

will always remain).

Instead, we opt for a virtual decimation approach, in which all Nkvoxels in each modality

group are retained, but anything that sums over voxels is downweighted by a factor of νk/Nk.

27

Page 28

This is analogous to fixing that only a random fraction of the data points will be kept, but at

each stage averaging over all possible choices of decimated voxels. Surprisingly, a variant of the

variational free energy F can also be retained in this model. For F to remain valid for model

comparison, the νkfor each modality must remain constant across all models to be compared;

the same is also true of real decimation, where changing the decimation fraction would mean

changing the number of voxels Nkand thus changing the data.

For the simulations on unsmoothed data (section 3.2), no correction is needed (i.e. νk= Nk).

For smoothed data (simulated and real), effective degrees of freedom are estimated from the data

using the expressions given above. Ideally, the smoothness would be estimated from the model

residuals rather than the raw data, because spatially-extended signals will cause smoothness to

be overestimated. However, the requirement to keep the F consistent across all approaches makes

it impractical to re-estimate this during inference. The Results section shows that on simulated

data this estimate is reasonable and certainly much better than not correcting at all.

In practice, this adjustment acts much like a regularization weight in a non-Bayesian ap-

proach, adjusting the relative cost of accuracy versus complexity. However, this parameter is

determined directly from the data before the analysis begins; in fact, if intentional Gaussian

smoothing is the dominant form of image smoothness, it could even be obtained simply by in-

specting the preprocessing procedures used and applying equation A.1. Thus this is a constant

and does not need to be tuned using empirical approaches (such as cross-validation). It is very

similar in principle to the correction used in the dimensionality-estimation of PICA; however,

in that case the output of the estimator is a hard dimensionality estimate rather than a fixed

reweighting constant.

B Initialization

As with most ICA implementations, Linked ICA is initialized from a PCA decomposition. For

multimodal data, the natural method is to concatenate all of the voxels across modalities k,t

(to get a (?K

variance-normalized. In the current implementation the νk/Nk smoothness correction is not

taken into account for the PCA, so smoother modalities will tend to be overemphasized in the

initial decomposition. This initial bias is easily undone by the first few VB iterations and is not

present in the final ICA components.

Taking the first L components of the decomposition, the loading matrix provides an initial

point estimate for the shared subject-course matrix H and eigenvectors are used to initialize

the spatial maps X as described below. However, this is not a hard dimensionality reduction

step, and the subject-courses are free to take combination of weights in the full R-subject space.

Furthermore, we ensure that L is sufficiently large so that there are always extra components

(which will be eliminated automatically by the ARD hyperparameters ω).

In the Bayesian ICA model, L is simply the maximum number of components that are allowed.

If the number of distinct signals in the data is greater than L, this means that some of them

will not be represented and this energy will end up in the white noise term. The upper limit is

L = R − 2, to avoid the degenerate solution where the demeaned data is perfectly represented

by components and there is no residual white noise. The model will automatically eliminate

unneeded components using ARD, so the exact number is not particularly important. We chose

L close to this upper limit, using L = 90 for most examples (where R = 93 or 100). The first

examples use the greatly reduced (but still sufficient) L = 10 to make the figures clearer, but

this has little impact on the results.

k=1NkTk) × R matrix) and then do a PCA decomposition on that.

being overly biased by data scaling, the PCA is done after the data has been de-meaned and

To avoid

28

Page 29

The initialization of X(k)depends on whether modality group k is tensor (Tk> 1) or non-

tensor (Tk = 1). If modality group k is non-tensor then this PCA also yields initial point

estimates of X(k)and the weights W(k)are initialized to 1. For tensor modality groups, the PCA

decomposition yields a NkTk×1 vector. Reversing the original concatenation step reshapes this

into one Nk×Tkmatrix per component, but in general this will yield a different spatial map for

each modality, while the tensor model demands that each component’s spatial map be identical

for all modalities (aside from scaling). This Nk× Tk matrix is approximated by the bilinear

matrix X(k)

·,i

point which the Bayesian method can improve upon by looking for independent components.

Once the X(k)matrices are found, each component’s mixture model is initialized by locating

means µ(k)

−1

·,iW(k)

T, using another PCA to find the best approximation. This provides a starting

·,iat the 25th, 50th, and 75th percentiles of the spatial map intensities, and setting the

standard deviation β(k)

·,i

2equal to half of the spacing between the means.

C Free Energy

In the VB framework, the free energy F is a commonly-used measure for model comparison. It

is given by

?

where Θ is the set of all model parameters, and the choice of model itself is implicit in each of

the P(·) expressions. It can be shown that F is a lower bound on the model evidence log P(Y ),

and that they become identical when the factorized posterior matches the real posterior, i.e.

P?(Θ) = P(Θ|Y ). The overall goal of the VB framework is to find the factorized posterior P?(Θ)

that maximizes F for a given model, usually by updating the posterior factors one at a time.

In order to make the free energy consistent with spatial-smoothness adjustments, the

correction also needs to be made whenever there is a sum over Nk. This effectively reduces the

number of data points in the relevant accuracy and complexity terms of the expression, without

actually losing data; this is equivalent to deciding to only use a fixed subset of the data points

but not deciding which, and building a single model that fits all possible subsets simultaneously.

Since this virtual “decimation factor” remains consistent across all models, model comparison

using F is still valid. The (smoothness-corrected) free energy is given by

?νk

?

where KL(·) is shorthand for the complexity cost of the specified variable, defined in terms of the

K-L divergence of the posterior from the prior

?

F =log

?P(Y|Θ)P(Θ)

P?(Θ)

??

P?(Θ)

,

(C.1)

νk

Nk

F =

?

?

k

Nk

?

?

P

?

Y(k)|X(k),W(k),H,λ(k)??

µ(k,i)?

−KL

?

NkKL

λ(k)?

−KL

?

?

ω(k)?

−νk

−KL

?

X(k,i)??

W(k)??

−KL(H)

+

k

i

?

−KL

−KL

?

β(k,i)?

−KL

?

π(k,i)?

−νk

q(k,i)?

NkKL

?

(C.2)

KL(z) = KLP?(z)

?????? P(z)

?

=

?

log P?(z)

?

P?(z))−

?

log P(z)

?

P?(z)

(C.3)

29

Page 30

and the kthlikelihood term expands to

?

P

?

Y(k)|X(k),W(k),H,λ(k)??

−?λt?

2

n=1

r=1

?

=

Tk

?

?

t=1

?

NkR

2

?

log

?λt

2π

??

?

Nk

?

L

R

?

L

?

Y2

ntr+ ?λt?

Nk

?

n=1

R

?

r=1

L

?

?

i=1

HHT?

?

X(k,i)

n

??

W(k)

t,i

?Hir?

?

Yntr

−?λt?

2

i=1j=1

?XTX?

ij?WtiWtj?

ij

?

(C.4)

and the KL divergence on the Gaussian mixture model expressed as

KL

?

X(k,i)?

=

Nk

?

n=1

M

?

m=1

P??

q(k,i)

n

=m

?

KL

?

P??

X(k,i)

n

|q(k,i)

n

=m

????

???P

?

X(k,i)

n

|q(k,i)

n

=m

??

(C.5)

where both the conditional prior and conditional posterior are normal distributions.

DPriors and VB updates

Shared latent space (subject-course) matrix H

The VB updates are found in a similar way to F in equation C.1, but integrating over all factors

of the posterior apart from the one being updated:

?

where ¬H is the set of all random variables in Θ except H. The function corresponding to P?(H)

is found by evaluating the right hand side algebraically and matching all terms of H. Many of

the terms do not involve H so in practice (and with the inclusion of the DOF/voxel correction

factor as in eqn C.2) this becomes

log P?(H)=log

?P(Y|Θ) P(Θ)

P?(¬H)

??

P?(¬H)

+ const,

(D.1)

log P?(H)=log P(H) +

K

?

k=1

νk

Nk

?log P(Y(k)|H,X(k),W(k),λ(k))?+ const.

(D.2)

It turns out that the posterior on H is a matrix normal distribution:

P?(H)=MN(H|MH,ΩH,ΣH).

(D.3)

Since all subjects r have the same noise levels and the same prior, the posterior row covariance

ΩH= IR. The L × L column covariance is given by

?Σ−1

while the posterior mean is

H

?

ij

=

?

Σ−1

H,0

?

ij+

K

?

k=1

?

νk

Nk

Nk

?

n=1

?

X(k)

niX(k)

nj

???Tk

?

t=1

?

W(k)

tiW(k)

tj

?

λ(k)

t

?

(D.4)

?MHΣ−1

H

?

ir

=

?

MH,0Σ−1

H,0

?

ir+

K

?

k=1

Tk

?

t=1

?

λ(k)

t

??

W(k)

ti

?νk

Nk

Nk

?

n=1

Y(k)

n,r,t

?

X(k)

n,i

?

. (D.5)

30

Page 31

The priors are simply N(0,1) on each element, i.e. ΣH,0= IL, ΩH,0= IRand MH,0= 0. For

MATLAB implementation, these updates are more efficiently expressed in matrix form:

Σ−1

H

=

IL+

K

?

Tk

?

k=1

?

X(k)TX(k)?

T?

◦

?

W(k)Tdiag

?

T?

λ(k)?

W(k)?

(D.6)

MHΣ−1

H

=

K

?

k=1

t=1

Y(k)

·,·,t

X(k)?

diag

?

λ(k)

t W(k)

t,·

(D.7)

where ◦ represents the elementwise (Schur) product of two matrices.

Noise precision λ

When separate noise is estimated for each modality (k,t), the posterior distribution of the noise

precision is found by calculating

log P?(λ)= log P(λ) +

K

?

k=1

νk

Nk

?

log P

?

Y(k)|H,X(k),W(k),λ(k)??

P?(¬λ(k))+ const

(D.8)

which yields

P?(λ)=

K

?

Ga

k=1

T?

?

t=1

λ(k)

t

P??

???bkt,ckt

+1

2

Nk

??

Nk

λ(k)

t

?

(D.9)

P??

λ(k)

t

?

=

?

(D.10)

ckt

=

c0+ νkR/2(D.11)

?

b−1

kt

=

b−1

0

νk

Nk

?

Nk

?

n=1

R

?

?

r=1

Y(k)

ntr

2−νk

???

Nk

Nk

?

W(k)

n=1

R

?

r=1

?

Y(k)

ntr

L

?

i=1

?

X(k)

ni

??

W(k)

ti

?

?Hir?

+1

2

L

?

i=1

L

?

j=1

νk

n=1

X(k)

niX(k)

njtiW(k)

tj

??R

?

r=1

?HirHjr?

??

(D.12)

Spatial sources X and hidden mixture memberships q

We explicitly model the hidden mixture labels using the categorical variable q(k,i)

specifies that voxel n (in modality group (k,t) and in ICA component i) is drawn from the mth

mixture component. The notation q(k,i)

·,n

a single one in row q(k,i)

n

. This is convenient because it means that

probability of a voxel being drawn from each mixture component.

By conditioning on q(k,i)

n

the GMM prior (equation 5) can be rewritten as

?

where the categorical distribution Cat(π) is defined by the probability mass distribution

n

= m, which

is also used as the M×1 vector of all zeros except for

?

q(k,i)

·,n

?

gives the posterior

X(k)

n,i

???q(k,i)

n

?

∼ N

?

µ(k)

i,q(k,i)

n

,1/β(k)

i,q(k,i)

n

?

,

q(k,i)

·,n

∼ Cat

?

π(k,i)?

(D.13)

Cat(q|π)=

qTπ

(D.14)

31

Page 32

assuming that q has a single element equal to 1 and the rest of the elements equal 0. In other

words, P(q = m|π) = πm.

The posterior distribution is also a Gaussian mixture model, factorized over components:

P?(X)=

K

?

M

?

k=1

L

?

P??

i=1

P??

X(k,i)?

(D.15)

P??

X(k,i)?

=

m=1

X(k,i)|q(k,i)

n

= m

?

P??

q(k,i)

n

= m

?

(D.16)

Note that the Gaussian mixture model prior naturally factors itself into two parts: the

posterior distributions conditional on the label, and the mixture labels. However, these remain

tightly connected; in particular, the update P?(q) depends on Y, despite q and Y not being

directly connected in the graphical model. These are therefore treated as a single monolithic

update during VB inference.

The conditional posterior distributions are Gaussian, shown below. A more complete deriva-

tion can be found in Choudrey and Roberts (2001).

P??

Σ−1

X,n,m

=

i,m

+

?

+

λ(k)

X(k,i)|q(k,i)

n

= m

?

=N

?

β(k)

X(k,i)

n

|MX,n,m,ΣX,n,m

??R

r=1

??

??

?

?

(D.17)

??

?

?

?

??

H2

ir

??

t

?λt??W2

ti

?

?

(D.18)

MX,n,mΣ−1

X,n,m

=

µ(k)

i,m

β(k)

i,m

?

t

t

?Wt,i?

R

?

r=1

Yn,t,r?Hi,r?

−

λ(k)

t

j?=i

?Xnj??WtiWtj??HirHjr?

(D.19)

while the mixture labels are distributed as

P??

log(Qm)

q(k,i)

n

?

=Cat

?

q(k,i)

n

?????Q/

?

?

m

Qm

?

(D.20)

=

?

−1

log π(k)

?

i,m

?

+1

2

log β(k)

?

i,m

?

−1

2

?

β(k)

i,m

??

µ(k)

i,m

2?

2

log Σ−1

X,n,m

+1

2(MX,n,m)2Σ−1

X,n,m.

(D.21)

Mixture component means µ and precisions β

The priors are given by

P

?

?

µ(k,i)?

β(k,i)?

=

Mk,i

?

Mk,i

?

m=1

N

?

µ(k,i)

m

|u0,v0

?

(D.22)

P

=

m=1

Ga

?

β(k,i)

m

|b0,c0

?

(D.23)

32

Page 33

with the relatively uninformative priors u0= 0, v0= 106, b0= 103, c0= 10−6, and the Gamma

distribution in terms of the Gamma function:

Ga(x|b,c)=

x(c−1)e−x/b

Γ(c)bc

(D.24)

The posterior forms are given by

P??

µ(k,i)

m

?

= N(u,v)

(D.25)

v−1

=

v−1

0

+

?

β(k,i)

m

?νk

Nk

Nk

?

n=1

?

q(k,i)

m,n

?

(D.26)

uv−1

=

u0v−1

0

+

?

β(k,i)

m

?νk

Nk

Nk

?

n=1

?

X(k,i)

n

???q(k,i)

m,n= 1

??

q(k,i)

m,n

?

(D.27)

P??

β(k,i)

m

?

=Ga(b,c)

(D.28)

b−1

=

b−1

0

+1

2

νk

Nk

Nk

?

−2

n=1

?

?

q(k,i)

m,n

???

???q(k,i)

(X(k,i)

m,n)2|q(k,i)

m,n= 1

?

?

X(k,i)

m,n

m,n= 1

??

µ(k,i)

m

?

+(µ(k,i)

m

)2??

(D.29)

c

=

c0+1

2

νk

Nk

Nk

?

n=1

?

q(k,i)

m,n

?

(D.30)

Mixture model weights π

The priors are given as follows:

P(π)=

K

?

k=1

L

?

i=1

Dir

?

π(k,i)????0

?

(D.31)

where the uniform prior (?0∈ RMis a vector of all ones) was used. The Dirichlet distribution

is defined as

Dir(Q|π)

∝

M

?

m=1

(Qm)πm−1.

(D.32)

The VB updates are given by:

P?(π)=

K

?

νk

Nk

k=1

L

?

Nk

?

i=1

Dir

?

π(k,i)????(k,i)?

q(k,i)

n

+ π0

(D.33)

?(k,i)

=

n=1

??

(D.34)

33

Page 34

Modality weight matrix W and ARD prior ω

The posterior distribution of W is found by expanding

log P?(W) =

?

log P(W|ω)

?

P?(ω)+

K

?

k=1

νk

Nk

?

log P

?

Y(k)|H,X(k),W(k),λ(k)??

P?(¬W(k))+const.

(D.35)

The posterior on W naturally factors across modalities:

P?(W)=

K

?

k=1

Tk

?

t=1

P?(Wt,·)(D.36)

each of which is a normal distribution given by

P?(Wt,·)

V−1

=N(Wt,·|m,V)

?ωt?I + ?λt?νk

R

?

(D.37)

=

Nk

?XTX?◦?HHT?

?Hir?νk

Nk

n=1

(D.38)

(mV−1)i

=

?λt?

r=1

Nk

?

Ynrt?Xni?.

(D.39)

The posterior on the ARD parameter ω naturally factorizes as

P?(ω)=

K

?

k=1

Tk

?

t=1

L

?

i=1

P??

ω(k)

ti

?

(D.40)

with each element of ω distributed as

P??

ω(k)

ti

?

=Ga

?

ω(k)

t |b,c

+(W(k)

?

(D.41)

b−1

=

b−1

0

c0+ 1/2.

?

ti)2?

(D.42)

c

=(D.43)

EPrecision contributions

The VB update for P?(H) is given by equations D.4 and D.5. Ignoring the off-diagonal elements

of Σ−1

means that each modality provides its own ideal (likelihood-maximizing) estimate of H and the

posterior ?H? is a precision-weighted average of these. Conveniently, this precision is the same

for each subject r. To find the dominant modalities in estimating each source’s subject-course, it

is informative to look at these precisions. The “precision contribution” by modality t in modality

group k to each source i is defined by looking at the parts of the sum in equation D.4:

?

Nk

n=1

So overall the precision of P?(Hi,·) is given by 1 +?K

H(which should be small because the spatial maps are independent), this formulation

pc(k,t,i)=

νk

Nk

?

?

X(k)

n,i

?2???

W(k)

t,i

?2??

?Tk

λ(k)

t

?

(E.1)

k=1t=1pc(k,t,i), because the prior

makes a constant precision contribution of 1. This provides a fixed scale against which to measure

34

Page 35

these contributions; if pc(k,t,i) < 1 then that modality is considered to have been eliminated

from that component.

In the results section the figures will show these precision contributions normalized by the

overall precision, so that the sum of all contributions is 1 for each component i. This makes it

easy to see if a component is dominated by one modality or is informed by a combination of

several modalities.

For comparisons, the precision contributions from individual modalities can also be calculated

for the Concatenated ICA results by using only the relevant voxels in the spatial maps:

?

where N(ˆk,ˆ t)is the set of voxels relating to modality k =ˆk,t =ˆt in the Linked ICA.

pc(ˆk,ˆt,i)=

νˆk

Nˆk

n∈N(ˆk,ˆ t)

?X2

n,i

??W2

i

??λ?

(E.2)

References

J. Ashburner and K. J. Friston. Voxel-based morphometry - the methods. NeuroImage, 11(6):

805–821, 2000.

H. Attias. A variational Bayesian framework for graphical models. Advances in Neural Informa-

tion Processing Systems, 12(1–2):209–215, 2000.

H. Attias. Independent factor analysis. Neural Computation, 11:803–851, 1998.

C. F. Beckmann. Independent component analysis for functional magnetic resonance imaging.

D.Phil. in information engineering, Image Analysis Group, FMRIB Centre and Robotics Re-

search Group, University of Oxford, UK, 2004.

C. F. Beckmann and S. M. Smith. Probabilistic independent components analysis for functional

magnetic resonance imaging. IEEE Transactions on Medical Imaging, 23:137–152, 2004.

C. F. Beckmann and S. M. Smith. Tensorial extensions of independent component analysis for

multisubject FMRI analysis. NeuroImage, 25(1):294–311, 2005.

C. M. Bishop. Variational principal components. Artificial Neural Networks, 7–10 September

1999(Conference Publication No. 470):509–514, 1999.

V. D. Calhoun, T. Adali, N. R. Giuliani, J. J. Pekar, and G. D. Pearlson. Method for multimodal

analysis of independent source differences in schizophrenia: combining gray matter structural

and auditory oddball functional data. Hum Brain Mapp, 27(1):47–62, Jan 2006.

R. A. Choudrey and S. J. Roberts. Flexible Bayesian independent component analysis for blind

source separation. Proc. Int. Conf. on Independent Component Analysis, 2001.

G. Douaud, S. Smith, M. Jenkinson, T. Behrens, H. Johansen-Berg, J. Vickers, S. James,

N. Voets, K. Watkins, P. M. Matthews, and A. James. Anatomically related grey and white

matter abnormalities in adolescent-onset schizophrenia. Brain, 130:2375–2386, 2007. doi:

10.1093/brain/awm184.

G. Douaud, S. Jbabdi, T. Behrens, R. Menke, A. Gass, A. Monsch, A. Rao, B. Whitcher,

G. Kindlmann, P. Matthews, and S. Smith. Increased diffusion anisotropy in crossing fibres

reveals very early white matter alteration in prodromal and mild Alzheimer’s disease. 2010.

Submitted.

35

Page 36

D. B. Ennis and G. Kindlmann. Orthogonal tensor invariants and the analysis of diffusion tensor

magnetic resonance images. Magnetic Resonance in Medicine, 55:136–146, 2006.

N. Filippini, B. J. MacIntosh, M. G. Hough, G. M. Goodwin, G. B. Frisoni, S. M. Smith, P. M.

Matthews, C. F. Beckmann, and C. E. MacKay. Distinct patterns of brain activity in young

carriers of the APOE-ε4 allele. PNAS, 106(17):7209–7214, 2009.

K. Friston, C. Chu, J. Mourao-Miranda, O. Hulme, G. Rees, W. Penny, and J. Ashburner.

Bayesian decoding of brain images. Neuroimage, 39(1):181–205, 2008.

A. R. Groves. Bayesian Learning Methods for Modelling Functional MRI. D.Phil. in Clinical

Neurology, Image Analysis Group, FMRIB Centre, University of Oxford, UK, 2010.

N. V. Hartvig and J. L. Jensen. Spatial mixture modeling of fMRI data. Human Brain Mapping,

11:233–248, 2000.

A. Hyv¨ arinen and E. Oja. Independent component analysis: algorithms and applications. Neural

Networks, 13(4-5):411–430, 2000.

J. Liu, G. Pearlson, A. Windemuth, G. Ruano, N. I. Perrone-Bizzozero, and V. Calhoun. Com-

bining fMRI and SNP data to investigate connections between brain function and genetics

using parallel ICA. Hum Brain Mapp, 30(1):241–255, 2009.

S. Makni, P. Ciuciu, J. Idier, and J.-B. Poline. Bayesian joint detection-estimation of brain

activity using MCMC with a gamma-gaussian mixture prior model. In IEEE International

Conference on Acoustics, Speech and Signal Processing, 2006, volume 5, 2006.

F. Miwakeichi, E. Martinez-Montes, P. A. Valdes-Sosa, N. Nishiyama, H. Mizuhara, and Y. Ya-

maguchi. Decomposing EEG data into space-time-frequency components using Parallel Factor

Analysis. NeuroImage, 22:1035–1045, 2004.

F. B. Nielsen. Variational Approach to Factor Analysis and Related Models. Master of Science

in Engineering, Intelligent Signal Processing group, Institute of Informatics and Mathematical

Modelling – Technical University of Denmark, Anker Engelundsvej 1, Building 101A, 2800

Kgs. Lyngby, Denmark, 2004.

S. J. Roberts and W. D. Penny. Variational Bayes for generalized autoregressive models. IEEE

Trans Sig Proc, 50(9):2245–2257, Sept 2002.

J. Scholz, M. C. Klein, T. E. J. Behrens, and H. Johansen-Berg. Training induces changes in

white-matter architecture. Nat Neurosci, 12(11):1367–1368, 2009.

S. M. Smith, M. Jenkinson, H. Johansen-Berg, D. Rueckert, T. E. Nichols, C. E. Mackay, K. E.

Watkins, O. Ciccarelli, M. Z. Cader, and T. E. J. Behrens. Tract-based spatial statistics:

Voxelwise analysis of multi-subject diffusion data. NeuroImage, 31:1487–1505, 2006.

K. E. Watkins, S. M. Smith, S. Davis, and P. Howell. Structural and functional abnormalities of

the motor system in developmental stuttering. Brain, 131:50–59, 2008.

D. Wipf and S. Nagarajan. A new view of automatic relevance determination. In J. C. Platt,

D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Sys-

tems 20, pages 1625–1632. MIT Press, Cambridge, MA, 2008.

M. W. Woolrich, B. D. Ripley, M. Brady, and S. M. Smith. Temporal autocorrelation in univariate

linear modeling of FMRI data. NeuroImage, 14:1370–1386, 2001.

36

Page 37

M. W. Woolrich, M. Jenkinson, J. M. Brady, and S. M. Smith. Fully Bayesian spatio-temporal

modeling of FMRI data. IEEE Trans Med Imaging, 23(2):213–31, 2004.

M. W. Woolrich, T. E. Behrens, C. F. Beckmann, and S. M. Smith. Mixture models with adaptive

spatial regularization for segmentation with an application to FMRI data. IEEE Trans Med

Imaging, 24(1):1–11, 2005.

K. J. Worsley, J. B. Poline, A. C. Vandal, and K. J. Friston. Tests for distributed, nonfocal brain

activation. NeuroImage, 2(3):183–194, 1995.

L. Xu, G. Pearlson, and V. D. Calhoun. Joint source based morphometry identifies linked gray

and white matter group differences. NeuroImage, 44(3):777 – 789, 2009.

37