Available via license: CC BY 4.0

Content may be subject to copyright.

Equivariance Allows Handling Multiple Nuisance Variables

When Analyzing Pooled Neuroimaging Datasets

Vishnu Suresh Lokhande

lokhande@cs.wisc.edu

Rudrasis Chakraborty

rudrasischa@gmail.com

Sathya N. Ravi

sathya@uic.edu

Vikas Singh

vsingh@biostat.wisc.edu

Abstract

Pooling multiple neuroimaging datasets across institu-

tions often enables improvements in statistical power when

evaluating associations (e.g., between risk factors and dis-

ease outcomes) that may otherwise be too weak to detect.

When there is only a single source of variability (e.g., dif-

ferent scanners), domain adaptation and matching the dis-

tributions of representations may sufﬁce in many scenar-

ios. But in the presence of more than one nuisance vari-

able which concurrently inﬂuence the measurements, pool-

ing datasets poses unique challenges, e.g., variations in the

data can come from both the acquisition method as well

as the demographics of participants (gender, age). Invari-

ant representation learning, by itself, is ill-suited to fully

model the data generation process. In this paper, we show

how bringing recent results on equivariant representation

learning (for studying symmetries in neural networks) in-

stantiated on structured spaces together with simple use

of classical results on causal inference provides an effec-

tive practical solution. In particular, we demonstrate how

our model allows dealing with more than one nuisance

variable under some assumptions and can enable analy-

sis of pooled scientiﬁc datasets in scenarios that would

otherwise entail removing a large portion of the samples.

Our code is available on https:// github. com/ vsingh-

group/DatasetPooling

1. Introduction

Observational studies in many disciplines acquire cross-

sectional/longitudinal clinical and imaging data to under-

stand diseases such as neurodegeneration and dementia

[44]. Typically, these studies are sufﬁciently powered for

the primary scientiﬁc hypotheses of interest. However, sec-

ondary analyses to investigate weaker but potentially inter-

esting associations between risk factors (such as genetics)

and disease outcomes are often difﬁcult when using com-

mon statistical signiﬁcance thresholds, due to the small-

/medium sample sizes.

Over the last decade, there are coordinated large scale

Figure 1. Learning Invariant Representations. In our framework,

input images Xare pooled together from multiple sites. An encoder Φ

maps Xto the latent representations Φ(X)that corresponds to high-level

causal features XCthat inﬂuences the label prediction. Unlike the input

images X,Φ(X)is robust to nuisance attributes like site (scanner) and

covariates (age). Φis trained alongside predictor hand decoder Ψ.

multi-institutional imaging studies (e.g., ADNI [26], NIH

All of Us and HCP [19]) but the types of data collected

or the project’s scope (e.g., demographic pool of partici-

pants) may not be suited for studying speciﬁc secondary

scientiﬁc questions. A “pooled” imaging dataset obtained

from combining roughly similar studies across different in-

stitutions/sites, when possible, is an attractive alternative.

The pooled datasets provide much larger sample sizes and

improved statistical power to identify early disease bio-

markers – analyses which would not otherwise be possi-

ble [16,30]. But even when study participants are consistent

across sites, pooling poses challenges. This is true even for

linear regression [56] – improvement in statistical power is

not always guaranteed. Partly due to these as well as other

reasons, high visibility projects such as ENIGMA [47] have

reported ﬁndings using meta-analysis methods.

Data pooling and fairness. Even under ideal conditions,

pooling imaging datasets across sites requires care. As-

sume that the participants across two sites, say site1and

site2, are perfectly gender matched with the same propor-

tion of male/female and the age distribution (as well as the

proportion of diseased/health controls) is also identical. In

this idealized setting, the only difference between sites may

come from variations in scanners or acquisition (e.g., pulse

sequences). When training modern neural networks for a

regression/classiﬁcation task with imaging data obtained in

this scenario, we may ask that the representations learned by

1

arXiv:2203.15234v1 [cs.LG] 29 Mar 2022

(a) Causal Diagram (b) Variation due to site (scanner) for particular age group. (c) Variation due to covariates (age) in Siemens scanner.

Figure 2. (a) ACausal Diagram listing variable of interest and their relationship for multi-site pooling problem. Nodes Dpopul ,Dacqui and Dpreval denote

the population, acquisition and prevalence biases that vary across sites. C’s are covariates (like age or gender). XCdenotes the high-level causal features

of an image Xthat inﬂuences the labels Y. Nodes in red d-separate the nodes in blue and green. (b) MRI images on control subjects from the ADNI [26]

dataset for different scanners in the age group 70-80.(c) Images obtained from the Siemens scanner (i.e., ﬁxing site) on control subjects for three extreme

age groups. The gantt chart on top of the image indicates the respective age range in the Phillips and GE scanners. As observed, different scanner groups

do not share a common support on “age” covariates, resulting in samples outside of the common support to be discarded in na¨

ıve pooling approaches.

the model be invariant to the categorical variable denoting

“site”. While this is not a “solved” problem, this strategy

has been successfully deployed based on results in invariant

representation learning [3,5,34] (see Fig. 1). One may al-

ternatively view this task via the lens of fairness – we want

the model’s performance to be fair with respect to the site

variable. This approach is effective, via constraints [52] or

using adversarial modules [17,53]. This setting also permits

re-purposing tools from domain adaptation [35,50,55] or

transfer learning [12] as a pre-processing step, before anal-

ysis of the pooled data proceeds.

Nuisance variables/confounds. Data pooling problems

one often encounters in scientiﬁc research typically vio-

lates many of the conditions in the aforementioned exam-

ple. The measured data Xat each site is inﬂuenced not

only by the scanner properties but also by a number of other

covariates / nuisance variables. For instance, if the age dis-

tribution of participants is not identical across sites, com-

parison of the site-wise distributions is challenging because

it is inﬂuenced both by age and the scanner. An example

of the differences introduced due to age and scanner biases

is shown in Figures 2b,2c. With multiple nuisance vari-

ables, even effective tools for invariant representation learn-

ing, when used directly, can provide limited help. The data

generation process, and the role of covariates/nuisance vari-

ables, available via a causal diagram (Figure 2a), can inform

how the formulation is designed [6,45]. Indeed, concepts

from causality have beneﬁted various deep learning mod-

els [37,41]. Specially, recent work [31] has shown the value

of integrating structural causal models for domain general-

ization, which is related to dataset pooling.

Causal Diagram. Dataset pooling under completely arbi-

trary settings is challenging to study systematically. So, we

assume that the site-speciﬁc imaging datasets are not sig-

niﬁcantly different to begin with, although the distributions

for covariates such as age/disease prevalence may not be

perfectly matched and each of these factors will inﬂuence

the data. We assume access to a causal diagram describing

how these variables inﬂuence the measurements. We show

how the distribution matching criteria provided by a causal

diagram can be nicely handled for some ordinal covariates

that are not perfectly matched across sites by adapting ideas

from equivariant representation learning.

Contributions. We propose a method to pool multiple neu-

roimaging datasets by learning representations that are ro-

bust to site (scanner) and covariate (age) values (see Fig. 1

for visualization). We show that continuous nuisance co-

variates which do not have the same support and are not

identically distributed across sites, can be effectively han-

dled when learning invariant representations. We do not

require ﬁnding “closest match” participants across sites –

a strategy loosely based on covariate matching [39] from

statistics which is less feasible if the distributions for a co-

variate (e.g., age) do not closely overlap. Our model is

based on adapting recent results on equivariance together

with known concepts from group theory. When tied with

common invariant representation learners, our formulation

allows far improved analysis of pooled imaging datasets.

We ﬁrst perform evaluations on common fairness datasets

and then show its applicability on two separate neuroimag-

ing tasks with multiple nuisance variables.

2. Reducing Multi-site Pooling to Inﬁnite Di-

mensional Optimization

Let Xdenote an image of a participant and let Ybe the

corresponding (continuous or discrete) response variable or

target label (such as cognitive score or disease status). For

simplicity, consider only two sites – site1and site2. Let

Drepresent the site-speciﬁc shifts, biases or covariates that

we want to take into account. One possible data generation

process relating these variables is shown in Figure 2a.

Site-speciﬁc biases/confounds. Observe that Yis, in

fact, inﬂuenced by high-level (or latent) features XCspe-

ciﬁc to the participant. The images (or image-based disease

biomarkers) Xare simply our (lossy) measurement of the

participant’s brain XC[14]. Further, Xalso includes an

(unknown) confound: contribution from the scanner (or ac-

quisition protocol). Figure 2a also lists covariates C, such

as age and other factors which impact XC(and therefore,

X). A few common site-speciﬁc biases Dare shown in

2

Fig. 2a. These include (i) population bias Dpopul that

leads to differences in age or gender distributions of the

cohort [9]; (ii) we must also account for acquisition shift

Dacqui resulting from different scanners or imaging pro-

tocols – this affects Xbut not XC;(iii) data are also in-

ﬂuenced by a class prevalence bias Dpreval, e.g., healthier

individuals over-represented in site2will impact the distri-

bution of cognitive scores across sites.

For imaging data, in principle, site-invariance can be

achieved via an encoder-decoder style architecture to map

the images Xinto a “site invariant” latent space Φ(X).

Here, Φ(X)in the idealized setting, corresponds to the true

“causal” features XCthat is comparable across sites. In

practice, we know that images cannot fully capture the dis-

ease – so, Φ(X)is simply a surrogate, limited by the mea-

surements we have on hand. Given these caveats, an archi-

tecture is shown in Fig. 1. Ideally, the encoder will mini-

mize Maximum Mean Discrepancy (MMD) [20] or another

discrepancy between the distributions of latent representa-

tions Φ(·)of the data from site1and site2.

The site-speciﬁc attributes Dare often unobserved or

otherwise unavailable. For instance, we may not have full

access to Dpopul from which our participants are drawn.

To tackle these issues, we use a causal diagram, see Fig. 2a,

similar to existing works [31,55] with minimal changes. For

dealing with unobserved D’s, some standard approaches are

known [22]. Let us see how it can help here. Applying d-

separation (see [22,36] ) on Fig. 2a, we see that the nodes

(Dpopul, C, XC)form a so-called “head-to-tail” branch and

the nodes (Dacqui, X, XC),(Dpreval , Y, XC)form a “head-

to-head” branch. This implies that XC⊥⊥ D|C. This

is exactly an invariance condition: XCshould not change

across different sites for samples with the same value of C.

To enforce this using Φ(·), we must optimize a discrepancy

between site-wise Φ(X)’s at a given value of C,

min

ΦMMDPsite1Φ(X)|C, Psite2Φ(X)|C(1)

Provably solving (1)?A brief comment on the difﬁculty

of the distributional optimization in (1) is useful. Generic

tools for (worst case) convergence rates for such problems

are actively being developed [51]. For the average case,

[38] presents an online method for a speciﬁc class of (ﬁ-

nite dimensional) distributionally robust optimization prob-

lems that can be deﬁned using standard divergence mea-

sures. Observe that even these convergence guarantees are

local in nature, i.e., they output a point that satisﬁes neces-

sary conditions and may not be sufﬁcient.

In practice, the outlook is a little better. Intuitively, an

optimal matching of the conditional distributions P(Φ(X)|

C)across the two sites corresponds to a (globally) optimal

solution to the probabilistic optimization task in (1). Ex-

isting works show that it is indeed possible to approach this

computationally via sub-sampling methods [55] or by learn-

ing elaborate matching functions to identify image or object

pairs across sites that are “similar” [31] or have the same

value for C. Sub-sampling, by deﬁnition, reduces the num-

ber of samples from the two sites by discarding samples

outside of the common support. This impacts the quality

of the estimator – for instance, [55] must restrict the analy-

sis only to that age range of Cwhich overlaps or is shared

across sites. Such discarding of samples is clearly undesir-

able when each image acquisition is expensive. Matching

functions also do not work if the support of Cis not identi-

cal across the sites, as brieﬂy described next.

Example 2.1. Let Cdenote an observed covariate, e.g.,

age. Consider Xiat site1with C=c1and Xjat site2

with C=c2. If c1≈c2, a matching will seek Φ(Xi)≈

Φ(Xj)in XCspace. If c1falls outside the support of c’s

acquired at site2, one must not only estimate Φ(·)but also a

transport expression Γc2→c1(·)on XCsuch that Φ(Xi)≈

Γc2→c1(Φ(Xj)). The “transportation” involves estimating

what a latent image acquired at age c2would look like at

age c1. This means that matching would need a solution to

the key difﬁculty, obtained further upstream.

2.1. Improved Distribution Matching via Equivari-

ant Mappings may be possible

Ignoring Yfor the moment, recall that matching here

corresponds to a bijection between unlabeled (ﬁnite) con-

ditional distributions. Indeed, if the conditional distribu-

tions take speciﬁc forms such as a Poisson process, it is in-

deed possible to use simple matching algorithms that only

require access to pairwise ranking information on the cor-

responding empirical distributions [43], for example, the

well-known Gale-Shapley algorithm [46]. Unfortunately,

in applications that we consider here, such distributional as-

sumptions may not be fully faithful with respect to site spe-

ciﬁc covariates C. In essence, we want representations Φ

(when viewed as a function of C) that vary in a predictable

(or say deterministic) manner – if so, we can avoid match-

ing altogether and instead match a suitable property of the

site-wise distributions of the representation Φ(X). We can

make this criterion more speciﬁc. We want the site-wise

distributions to vary in a manner where the “trend” is con-

sistent across the sites. Assume that this were not true, say

P(Φ(X)|C)is continuous and monotonically increases

with Cfor site1but monotonically decreases for site2. A

match of P(Φ(X)|C)across the sites at a particular value

of C=cimplies at least one C=c0where P(Φ(X)|C)

do not match. The monotonicity argument is weak for high

dimensional Φ. Plus, we have multiple nuisance variables.

It turns out that our requirement for P(Φ(X)|C)to vary

in a predictable manner across sites can be handled using

the idea of equivariant mappings, i.e., P(Φ(X)|C)must

3

Figure 3. Visualization of Stage one. First, an image pair Xi, Xjare mapped onto a hypersphere using an encoder E. The resulting pair `i,`j

are passed through τnetwork to map them into the space of rotation matrices (which is the quotient group denoted by G/H). Fact 3ensure that τis a

G=SO(n)−equivariant map. G(i, j)/G(i, j )−1is the group action of transforming τ(`i)to τ(`j)/τ(`j)to τ(`i)respectively.

be equivariant with respect to Cfor both sites. In addition,

we will also seek invariance to scanner attributes.

While we are familiar with the well-studied notion of

invariance through the MMD criterion [29], we will brieﬂy

formalize our idea behind an equivariant mapping which is

less common in this setting.

Deﬁnition 1. A mapping f:X → Y deﬁned over mea-

surable Borel spaces Xand Yis said to be G−equivariant

under the action of group Giff

f(g·x) = g·f(x), g ∈G

We refer the reader to two recent surveys, Section 2.1of

[7] and Section 3.1of [8], which provide a detailed review.

Equivariance is often understood in the context of a

group action (say, a matrix group) [24,28]. While the co-

variates Cis a vector (and every vector space is an abelian

group), since this group will eventually act on the latent

space of our images, imposing additional structure will be

beneﬁcial. To do so, we will utilize a mapping between C’s

and a group suitable for our setting. Once this is accom-

plished, we will derive an equivariant encoder. We discuss

these steps next.

3. Methods

The dual goals of (i) equivariance to covariates (such as

age) and (ii) invariance to site, involves learning multiple

mappings. For simplicity, and to keep the computational

effort manageable, we divide our method into two stages.

Brieﬂy, our stages are (a) Stage one: Equivariance to Co-

variates. We learn a mapping to a space that provides the

essential ﬂexibility to characterize changes in covariate C

as a group action. This enables us to construct a space sat-

isfying the equivariance condition as per Def. 1(b) Stage

two: Invariance to Site. We learn a second encoding to

a generic vector space by apriori ensuring that the equiv-

ariance properties from Stage one are preserved. Such an

encoding is then tuned to optimize the MMD criterion, thus

generating a latent space that is invariant to site while re-

maining equivariant to covariates. We describe these stages

one by one in the following sections.

3.1. Stage one: Equivariance to Covariates

Given the space of images, X, with the covariates C,

ﬁrst, we want to characterize the effect of Con Xas a

group action for some group G. Here, an element g∈G

characterizes the change from covariate ci∈Cto cj∈C

(for short, we will use iand j). The change in Ccorre-

sponds to a translation action which is difﬁcult to instanti-

ate in Xwithout invoking expensive conditional generative

models. Instead, we propose to learn a mapping to a latent

space Lsuch that the change in Ccan be characterized by

a group action pertaining to Gin the space L(the latent

space of X). As an example, let us say Xigoes to Xjin X

as (Xi→Xj). This means that (Xi→Xj)is caused due

to the covariate change (ci→cj)in C. Let Ebe a mapping

between the image space Xand the latent space L. In the la-

tent space L, for the moment, we want that (EXi→EXj)

should correspond to the change in covariate (ci→cj).

Remark 2. We are mostly interested in normalized covari-

ates for example, in `pnorm, while other volume based de-

terministic normalization functions may also be applicable.

In the simplest case of p= 2 norm, the corresponding group

action is naturally induced by the matrix group of rotations.

Based on this choice of group, we will learn an autoen-

coder (E,D)with an encoder E:X → L and a decoder

D:L→X, here Lis the encoding space. Due to Re-

mark 2, we can choose Lto be a hypersphere, Sn−1, and

(E,D)as a hyperspherical autoencoder [54]. Then, we can

characterize the “action of Con X” as the action of Gon

Sn−1. That is to say that a covariate change (translation in

C) is a change in angles on L. This corresponds to a rotation

due to the choice of our group G. Note that for L=Sn−1,

Gis the space of n×nrotation matrices, denoted by SO(n),

and the action of Gis well-deﬁned. What remains is to en-

courage the latent space Lto be G-equivariant. We start

with some group theoretic properties that will be useful.

3.1.1 Review: Group theoretic properties of SO(n)

Let SO(n) = X∈Rn×n|XTX=In,det(X)=1be

the group of n×nspecial orthogonal matrices. The group

4

SO(n)acts on Sn−1with the group action “·” given by

g·`7→ g`, for g∈SO(n)and `∈Sn−1. Here we use g`

to denote the multiplication of matrix gwith `. Under this

group action, we can identify Sn−1with the quotient space

G/H with G=SO(n)and H=SO(n−1) (see Ch. 3

of [13] for more details). Let τ:Sn−1→G/H be such

an identiﬁcation, i.e., τ(`) = gH for some g∈G. The

identiﬁcation τis equivariant to Gin the following sense.

Fact 3. Given τ:Sn−1→G/H as deﬁned above, τis

equivariant with the action of G, i.e., τ(g·`) = gτ (`).

Next, we see that given two points `i,`jon Sn−1there

is a unique group element in Gto move from τ(`i)to τ(`j).

Lemma 4. Given two latent space representations `i,`j∈

Sn−1, and the corresponding cosets giH=τ(`i)and

gjH=τ(`j),∃!gij =gjg−1

i∈Gsuch that `j=gij ·`i.

Thanks to Fact 3and Lemma A.1, simply identifying a

suitable τwill provide us the necessary equivariance prop-

erty. To do so, next, we parameterize τby a neural network

and describe a loss function to learn such a τand (E,D).

3.1.2 Learning a G-equivariant τwith DNNs

Now that we established the key components: (a) an au-

toencoder (E,D)to map from Xto the latent space Sn−1

(b) a mapping τ:Sn−1→SO(n)which is G=SO(n)-

equivariant, see Figure 3, we discuss how to learn such a

(E,D)and a G-equivariant τ.

Let Xi, Xj∈ X be two images with the corresponding

covariates i, j ∈Cwith i6=j. Let `i=E(Xi),`j=

E(Xj). Using Lemma A.1, we can see that a gij ∈Gto

move from `ito `jdoes exist and is unique. Now, to learn

aτthat satisﬁes the equivariance property (Fact 3), we will

need τto satisfy two conditions, τ(gij ·`i) = gij τ(`i)and

τ(gji ·`j) = gj iτ(`j)∀g∈G. The two conditions are

captured in the following loss function,

`i=E(Xi)`j=E(Xj)(2)

Lstage1=X

{(Xi,i),(Xj,j)}

⊂X ×C

kG(i, j )·τ(`i)−τ(`j)k2+

kG−1(i, j )·τ(`j)−τ(`i)k2(3)

Here, G:C×C→Gwill be a table lookup given by

(i, j)7→ gij is the function that takes two values for the

covariate c, say, i, j corresponding to Xi, Xj∈ X and

simply returns the group element (rotation) gij needed to

move from E(Xi)to E(Xj).Choice of G:In general,

learning Gis difﬁcult since Cmay not be continuous. In

this work, we ﬁx Gand learn τby minimizing (3). We

will simplify the choice of Gas follows: assuming that

Cis a numerical/ordinal random variable, we deﬁne Gby

(i, j)7→ expm((i−j)1m). Here m=n

2is the di-

mension of Gand expm is the matrix exponential, i.e.,

Algorithm 1 Learning representations that are Equivariant

to Covariates and Invariant to Site

Input: Training Sets from multiple sites (X, Y )site1 ,

(X, Y )site2. Nuisance covariates C.

Stage one: Equivariance to Covariates

1 : Parameterize Encoder-Decoder pairs (E,D)and τ

mapping with neural networks

2 : Optimize over (E,D)and τto minimize,

Lstage1+PikXi−D(E(Xi))k2

Output: First latent space mapping Eand a supporting

mapping function τ. Here, τis G-equivariant to the co-

variates C(see Lemma (A.1) and (3)).

Stage two: Invariance to Site

1 : Parameterize encoder b, predictor hand decoder Ψ

with neural networks

2 : Preserve equivariance from stage one with an equiv-

ariant mapping Φ, (see Lemma (A.1))

3 : Optimize Φ, b, h and Ψto minimize Lstage2+MMD

Output: Second latent space mapping Φ. Here, Φis

equivariant to the covariates and invariant to site.

expm :so(n)→SO(n), where so(n)is the Lie alge-

bra [21] of SO(n). Since so(n)is a vector space, hence

(i−j)1m∈so(n). To reduce the runtime of expm,

we replace expm by a Cayley map [32,42] deﬁned by:

so(n)3A7→ (I−A)(I+A)−1∈SO(n). Here we used

expm for parameterization (other choices also suitable).

Finally, we learn the encoder-decoder (E,D)by using

a reconstruction loss constraint with Lstage1in (3). This

can also be thought of as a combined loss for this stage as

Lstage1+PikXi−D(E(Xi))k2where the second term is

the reconstruction loss. The loss balances two terms and re-

quires a scaling factor (see appendix § A.7). A ﬂowchart of

all steps in this stage can be seen in Fig 3.

3.2. Stage two: Invariance to Site

Having constructed a latent space Lthat is equivariant to

changes in the covariates C, we must now handle the site

attribute, i.e., invariance with respect to site. Here, it will

be convenient to project Lonto a space that simultaneously

preserves the equivariant structure from Land offers the

ﬂexibility to enforce site-invariance. The following lemma,

inspired from the functional representations of probabilistic

symmetries (§4.2of [7]), provides us strategies to achieve

this goal. Here, consider Φ : L→Zto be the projection.

Lemma 5. For a τ:L → G/H as deﬁned above, and for

any arbitrary mapping b:L→Z, the function Φ : L→Z

deﬁned by

Φ(`) = τ(`)·bτ(`)−1·`(4)

is G-equivariant, i.e., Φ(g·`) = gΦ(`).

5

(a) ADNI Dataset (b) Adult Dataset

Figure 4. t-SNE plots of latent representations τ(`).For ADNI (left) and Adult (right), an equivariant encoder ensures that the latent features are

evenly distributed and bear a monotonic trend with respect to the changes in the age covariate value. The non-equivariant space is generated from the Na¨

ıve

pooling baseline. Each color denotes a discretized age group. Age was discretized only for the ﬁgure to highlight the density of samples in each age group.

Proof is available in the appendix § A.1. Note that Φre-

mains equivariant for any mapping b. This provides us the

option to parameterize bas a neural network and train the

entirety of Φfor the desired site invariance where equivari-

ance will be preserved due to (9). In this work, we learn

such a Φ : L → Z with the help of a decoder Ψ : Z → L

by minimizing the following loss,

Lstage2=X

`=E(X)∈L

X∈X ,Y ∈Y

Reconstruction loss

z }| {

k`−Ψ(Φ(`))k2+

Prediction loss

z }| {

kY−h(Φ(`))k2(5)

subject to Φ(`) = τ(`)·bτ(`)−1·`

| {z }

G-equivariant map

(6)

Minimizing the loss (5) with the constraint (6) allows learn-

ing the network b:L→Zand the decoder Ψ : Z → L.

We are now left with asking that Z∈ Z be such that the rep-

resentations are invariant across the sites. We simply use the

following MMD criterion although other statistical distance

measures can also be utilized.

MMD =kE

Z1∼

P(Φ(`))site1

K(Z1,·)−E

Z2∼

P(Φ(`))site2

K(Z2,·)kH(7)

The criterion is deﬁned using a Reproducing Kernel Hilbert

Space with norm k·kHand kernel K. We combine (5), (6)

and (7) as the objective function to ensure site invariance.

Thus, the combined loss function Lstage2+MMD is min-

imized to learn (Φ,Ψ). Scaling factor details are available

in the appendix § A.7.

Summary of the two stages. Our overall method com-

prises of two stages. The ﬁrst stage, Section 3.1, involves

learning the τfunction. The function learned in this stage is

G-equivariant by the choice of the loss Lstage1, see (3). Our

next stage, Section 3.2, employs the learned τfunction and

a trainable mapping bto generate invariant representations.

This stage preserves G-equivariance due to the Φmapping

in (9). The loss for the second step is Lstage2+MMD , see

(5). Our method is summarized in Algorithm 1. Conver-

gence behavior of the proposed optimization (of τ, Φ) still

seems challenging to characterize exactly, but recent papers

provide some hope, and opportunities. For example, if the

networks are linear, then results from [18] maybe applicable

which explain the our superior empirical performance.

4. Experiments

We evaluate our proposed encoder for site-invariance

and robustness to changes in the covariate values C. Eval-

uations are performed on two multi-site neuroimaging

datasets, where algorithmic developments are likely to be

most impactful. Prior to neuroimaging datasets, we also

conduct experiments on two standard fairness datasets, Ger-

man and Adult. The inclusion of fairness datasets in our

analysis, provides us a means for sanity tests and optimiza-

tion feasibility on an established problem. Here, the goal

of achieving fair representations is treated as pooling multi-

ple subsets of data indexed by separate sensitive attributes.

We begin our analysis by ﬁrst describing our measures of

evaluation and then reporting baselines for comparisons.

Measures of Evaluation. Recall that our method in-

volves learning τas in (3) to satisfy the equivariance prop-

erty. Moreover, we need to learn Φas in (9)–(5) to achieve

site invariance. Our measures assess the structure of the la-

tent space τ(`)and Φ(`). The measures are: (a) ∆Eq :This

metric evaluates the `2distance between τ(`i)and τ(`j)for

all pairs i, j. Formally, it is computed as

∆Eq =X

{(Xi,i),(Xj,j)}⊂X ×C

`i=Ee(Xi),`j=Ee(Xj)

|i−j|kτ(`i)−τ(`j)k2(8)

A higher value of this metric indicates that τ(`i)and

τ(`j)are related by the group action gij . Additionally,

we use t-SNE [48] to qualitatively visualize the effect of

τ.(b) Adv :This metric quantiﬁes the site-invariance

achieved by the encoder Φ. We evaluate if Φ(`)for a

learned `∈ L has any information about the site. A three

layered fully network (see appendix § A.6) is trained as an

adversary to predict site from Φ(`), similar to [49]. A lower

value of Adv, that is close to random chance, is desirable.

(c) M:Here, we compute the MMD measure, as in (7),

on the test set. A smaller value of Mindicates better invari-

ance to site. Lastly, (d) ACC :This metric notes the test set

6

Figure 5. Statistical Analysis on the reconstructed outputs. The vox-

els that are signiﬁcantly associated with Alzheimer’s disease (p <0.001)

are shown. Adjustments for multiple comparisons were made using Bon-

ferroni correction. A high density of signiﬁcant voxels indicates that our

method preserves disease related signal after pooling across scanners.

accuracy in predicting the target variable Y.

Baselines for Comparison. We contrast our method’s

performance with respect to a few well-known baselines.

(i) Na¨

ıve: This method indicates a na¨

ıve approach of pool-

ing data from multiple sites without any scheme to handle

nuisance variables. (ii) MMD [29]: This method mini-

mizes the distribution differences across the sites without

any requirements for equivariance to the covariates. The la-

tent representations being devoid of the equivariance prop-

erty result in lower accuracy values as we will see shortly.

(iii) CAI [49]: This method introduces a discriminator to

train the encoder in a minimax adversarial fashion. The

training routine directly optimizes the Adv measure above.

While being a powerful implicit data model, adversarial

methods are known to have unstable training and lack con-

vergence guarantees [40]. (iv) SS [55]: This method adopts

a Sub-sampling (SS) framework to divide the images across

the sites by the covariate values C. An MMD criterion is

minimized individually for each of the sub-sampled groups

and an average estimate is computed. Lastly, (v) RM [33]:

Also used in [31], RandMatch (RM) learns invariant repre-

sentations on samples across sites that ”match” in terms of

the class label (we match based on both Yand Cvalues) .

Below, we summarize each method and nuisance attribute

correction adopted by them.

Correction Na¨

ıve MMD [29] CAI [49] SS [55] RM [33] Ours

Site 7 3 3 3 3 3

Covariates 7 7 7 3 3 3

Table 1. Baselines in the paper and their nuisance attribute correction.

We evaluate methods on the test partition provided with

the datasets. The mean of the metrics over three random

seeds is reported. The hyper-parameter selection is done on

a validation split from the training set, such that the predic-

tion accuracy falls within 5% window relative to the best

performing model [10] (more details in appendix § A.2).

4.1. Obtaining Fair Representations

We approach the problem of learning fair representations

through our multi-site pooling formulation. Speciﬁcally, we

consider each sensitive attribute value as a separate site. Re-

sults on two benchmark datasets, German and Adult [11],

are described below.

German Dataset. This dataset is a classiﬁcation prob-

lem used to predict defaults on the consumer loans in the

German market. Among the several features in the dataset,

the attribute foreigner is chosen as a sensitive attribute. We

train our encoder while maintaining equivariance with re-

spect to the continuous valued age feature. Table 2provides

a summary of the results in comparison to the baselines.

Our equivariant encoder maximizes the ∆Eq metric indicat-

ing the the latent space τ(`)is well separated for different

values of age. Further, the invariance constraint improves

the Adv metric signifying a better elimination of sensitive

attribute information from the representations. The Mmet-

ric is higher relative to the other baselines. The ACC for all

the methods are within a 2% range.

Adult Dataset. In the Adult dataset, the task is to predict

if a person has an income higher (or lower) than $50Kper

year. The dataset is biased with respect to gender, roughly,

1-in-5women (in contrast to 1-in-3men) are reported to

make over $50K. Thus, the female/male genders are con-

sidered as two separate sites with age as a nuisance covari-

ate feature. As shown in Table 2, our equivariant encoder

improves on metrics ∆Eq and Adv relative to all the base-

lines similar to the German dataset. In addition to the quan-

titative metrics, we visualize the t-SNE plots of the repre-

sentations τ(`)in Fig. 4(right). It is clear from the ﬁg-

ure that an equivariant encoder imposes a certain monotonic

trend as the Age values as varied.

4.2. Pooling Brain Images across Scanners

For our motivating application, we focus on pool-

ing tasks for two different brain imaging datasets where

the problem is to classify individuals diagnosed with

Alzheimer’s disease (AD) and healthy control (CN).

Setup. Images are pre-processed by ﬁrst normaliz-

ing and then skull-stripping using Freesurfer [15]. A linear

(afﬁne) registration is performed to register each image to

MNI template space. Images are trained using 3D convo-

lutions with ResNet [23] backbone (details in the appendix

§A.6). Since the datasets are small, we report results over

ﬁve random training-validation splits.

ADNI Dataset. The data for this experiment has been

downloaded from the Alzheimers Disease Neuroimaging

Figure 6. Distribution of age covariate in the ADNI dataset. Two set-

tings are considered – (left) the intersection of the support is large, and

(right) with a smaller common support. Despite the mismatch of support

across scanner attributes, our approach minimizes the MMD measure (de-

sirable) on the test set relative to the na¨

ıve pooling method.

7

∆Eq :Equivariance Gap, Adv :Adversarial Test Accuracy, M:Test MMD measure, ACC :Test prediction accuracy

↑: Higher Value is preferred, ↓: Lower Value is preferred

German Adult ADNI ADCP

∆Eq ↑Adv ↓ M ↓ ACC ↑ ∆Eq ↑Adv ↓ M ↓ ACC ↑ ∆Eq ↑Adv ↓ M ↓ AC C ↑ ∆Eq ↑Adv ↓ M ↓ ACC ↑

Na¨

ıve 4.6(0.7) 0.62(0.03) 7.7(0.8) 74(0.9) 3.4(0.7) 83(0.1) 9.8(0.3) 84(0.1) 3.1(1.0) 59(2.9) 27(1.6) 80(2.6) 4.1(0.9) 49(8.4) 90(8.7) 83(4.4)

MMD [29]4.5(1.0) 0.66(0.04) 1.5(0.3) 73(1.5) 3.4(0.9) 83(0.1) 3.1(0.3) 84(0.1) 3.1(1.0) 59(3.3) 27(1.7) 80(2.6) 3.6(1.0) 49(11.9) 86(11.0) 84(6.5)

CAI [49]1.9(0.6) 0.65(0.01) 1.2(0.2) 76(1.3) 0.1(0.0) 81(0.7) 4.2(2.4) 84(0.04) 2.4(0.7) 61(2.1) 27(1.5) 74(3.6) 2.8(1.6) 56(6.9) 85(12.3) 82(5.1)

SS [55]3.8(0.5) 0.70(0.07) 1.5(0.6) 76(0.9) 2.8(0.5) 83(0.2) 1.5(0.2) 84(0.1) 3.7(0.5) 57(2.1) 26(1.6) 81(3.7) 3.4(1.3) 51(6.7) 88(14.6) 82(3.5)

RM [33]3.4(0.4) 0.66(0.04) 7.5(0.9) 74(2.1) 0.8(0.1) 82(0.4) 4.8(0.7) 84(0.3) 0.8(0.9) 52(5.4) 22(0.6) 78(3.8) 0.4(0.5) 40(4.7) 77(13.8) 84(5.3)

Ours 6.4(0.6) 0.54(0.01) 2.7(0.6) 75(3.3) 5.3(0.9) 75(1.4) 7.1(0.6) 83(0.1) 5.1(1.2) 50(4.2) 16(7.2) 77(4.8) 7.5(1.2) 49(7.3) 70(22.3) 81(1.8)

Table 2. Quantitative Results. We show Mean(Std) results over multiple run. For our baselines, we consider a Na¨

ıve encoder-decoder model, learning

representations via minimizing the MMD criterion [29] and Adversarial training [49], termed as CAI. We also compare against Sub-sampling (SS) [55]

that minimizes the MMD criterion separately for every age group, and the RandMatch (RM) [33] baseline that generates matching input pairs based on the

Age and target label values. The SS and RM baselines discard subset of samples if a match across sites is not available. The measure Adv represents the

adversarial test accuracy except for the German dataset where ROC-AUC is used due to high degree of skew in the data.

Initiative (ADNI) database (adni.loni.usc.edu). We have

three scanner types in the dataset, namely, GE, Siemens

and Phillips. Similar to the fairness experiments, equivari-

ance is sought relative to the covariate Age. The values of

Age are in the range 50-95 as indicated in density plot of

Fig. 6(left). The Age distribution is observed to vary across

different scanners, albeit minimally, in the full dataset. In

the t-SNE plot, Fig. 4(left), we see that the latent space

has an equivariant structure. Closer inspection of the plot

shows that the representations vary in the same order as that

of Age. Different colors indicate different Age sub-groups.

Next, in Fig. 5, we present the t-statistics in the template

space on the reconstructed images after pooling. Here, the

t-statistics measure the association with AD/CN target la-

bels. As seen in the ﬁgure, the voxels signiﬁcantly associ-

ated with the Alzheimer’s disease (p < 0.001) are consid-

erable in number. This result supports our goal to combine

datasets to increase sample size and obtain a high power

in statistical analysis. Next, in Fig. 6(right), we increase

the difﬁculty of our problem by randomly sub-sampling for

each scanner group such that the intersection of support is

minimized. In such an extreme case, our method attains a

better Mmetric relative to the Na¨

ıve method, thus justify-

ing the applicability to situations where there is a mismatch

of support across the sites. Lastly, we inspect the perfor-

mance on the quantitative metrics on the entire dataset in

Table 2. All metrics ∆Eq ,Adv and Mimprove relative to

the baselines with a small drop in the ACC.

ADCP dataset. This experiment’s data was collected

as part of the NIH-sponsored Alzheimer’s Disease Connec-

tome Project (ADCP) [1,25]. It is a two-center MRI, PET,

and behavioral study of brain connectivity in AD. Study in-

clusion criteria for AD / MCI (Mild Cognitive Impairment)

patients consisted of age 55−90 years who retain decisional

capacity at initial visit, and meet criteria for probable AD or

MCI. MRI images were acquired at three sites. The three

sites differ primarily in terms of the patient demograph-

ics. We inspect the quantitative results of this experiment

in Tab. 2and place the qualitative results in the appendix

§A.4,A.5. The table reveals considerable improvements in

all our metrics relative to the Na¨

ıve method.

Limitations. Currently, our formulation assumes that the

to-be-pooled imaging datasets are roughly similar – there is

deﬁnitely a role for new developments in domain alignment

to facilitate deployment in a broader range of applications.

Secondly, larger latent space dimensions may cause com-

pute overhead due to matrix exponential parameterization.

Finally, algorithmic improvements can potentially simplify

the overhead of the two-stage training.

5. Conclusions

Retrospective analysis of data pooled from previous /

ongoing studies can have a sizable inﬂuence on identifying

early disease processes, not otherwise possible to glean

from analysis of small neuroimaging datasets. Our devel-

opment based on recent results in equivariant representation

learning offers a strategy to perform such analysis when

covariates/nuisance attributes are not identically distributed

across sites. Our current work is limited to a few such

variables but suggests that this direction is promising and

can potentially lead to more powerful algorithms.

Acknowledgments The authors are grateful to Vib-

hav Vineet (Microsoft Research) for discussions on

the causal diagram used in the paper. Thanks to Amit

Sharma (Microsoft Research) for the conversation on

their MatchDG project. Special thanks to Veena Nair

and Vivek Prabhakaran from UW Health for helping with

the ADCP dataset. Research supported by NIH grants to

UW CPCP (U54AI117924), RF1AG059312, Alzheimer’s

Disease Connectome Project (ADCP) U01 AG051216,

and RF1AG059869, as well as NSF award CCF 1918211.

Sathya Ravi was also supported by UIC-ICR start-up funds.

8

References

[1] Nagesh Adluru, Veena A Nair, Vivek Prabhakaran, Shi-Jiang

Li, Andrew L Alexander, and Barbara B Bendlin. Geodesic

path differences in neural networks in the alzheimer’s dis-

ease connectome project: Developing topics. Alzheimer’s &

Dementia, 16:e047284, 2020. 8

[2] Paul S Aisen, Jeffrey Cummings, Clifford R Jack, John C

Morris, Reisa Sperling, Lutz Fr¨

olich, Roy W Jones, Sherie A

Dowsett, Brandy R Matthews, Joel Raskin, et al. On the path

to 2025: understanding the alzheimer’s disease continuum.

Alzheimer’s research & therapy, 9(1):1–10, 2017. 11,12

[3] Aditya Kumar Akash, Vishnu Suresh Lokhande, Sathya N

Ravi, and Vikas Singh. Learning invariant represen-

tations using inverse contrastive loss. arXiv preprint

arXiv:2102.08343, 2021. 2

[4] Jesper LR Andersson, Mark Jenkinson, Stephen Smith, et al.

Non-linear registration, aka spatial normalisation fmrib tech-

nical report tr07ja2. FMRIB Analysis Group of the University

of Oxford, 2(1):e21, 2007. 12

[5] Martin Arjovsky, L ´

eon Bottou, Ishaan Gulrajani, and David

Lopez-Paz. Invariant risk minimization. arXiv preprint

arXiv:1907.02893, 2019. 2

[6] Elias Bareinboim and Judea Pearl. Causal inference and the

data-fusion problem. Proceedings of the National Academy

of Sciences, 113(27):7345–7352, 2016. 2

[7] Benjamin Bloem-Reddy and Yee Whye Teh. Probabilistic

symmetries and invariant neural networks. Journal of Ma-

chine Learning Research, 21(90):1–61, 2020. 4,5

[8] Michael M Bronstein, Joan Bruna, Taco Cohen, and Petar

Veliˇ

ckovi´

c. Geometric deep learning: Grids, groups, graphs,

geodesics, and gauges. arXiv preprint arXiv:2104.13478,

2021. 4

[9] Daniel C Castro, Ian Walker, and Ben Glocker. Causal-

ity matters in medical imaging. Nature Communications,

11(1):1–10, 2020. 3

[10] Michele Donini, Luca Oneto, Shai Ben-David, John S

Shawe-Taylor, and Massimiliano Pontil. Empirical risk min-

imization under fairness constraints. In S. Bengio, H. Wal-

lach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R.

Garnett, editors, Advances in Neural Information Processing

Systems, volume 31. Curran Associates, Inc., 2018. 7

[11] Dheeru Dua, Casey Graff, et al. Uci machine learning repos-

itory. 2017. 7

[12] Abhimanyu Dubey, Vignesh Ramanathan, Alex Pentland,

and Dhruv Mahajan. Adaptive methods for real-world do-

main generalization. In Proceedings of the IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition, pages

14340–14349, 2021. 2

[13] David S. Dummit and Richard M. Foote. Abstract algebra.

Wiley, 3rd ed edition, 2004. 5

[14] Simon F Eskildsen, Pierrick Coup´

e, Vladimir S Fonov,

Jens C Pruessner, D Louis Collins, Alzheimer’s Dis-

ease Neuroimaging Initiative, et al. Structural imaging

biomarkers of alzheimer’s disease: predicting disease pro-

gression. Neurobiology of aging, 36:S23–S31, 2015. 2

[15] Bruce Fischl. Freesurfer. Neuroimage, 62(2):774–781, 2012.

7,11

[16] Jean-Philippe Fortin, Drew Parker, Birkan Tunc¸, Takanori

Watanabe, Mark A Elliott, Kosha Ruparel, David R Roalf,

Theodore D Satterthwaite, Ruben C Gur, Raquel E Gur, et al.

Harmonization of multi-site diffusion tensor imaging data.

Neuroimage, 161:149–170, 2017. 1

[17] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pas-

cal Germain, Hugo Larochelle, Franc¸ois Laviolette, Mario

Marchand, and Victor Lempitsky. Domain-adversarial train-

ing of neural networks. The journal of machine learning

research, 17(1):2096–2030, 2016. 2

[18] Avishek Ghosh and Ramchandran Kannan. Alternating min-

imization converges super-linearly for mixed linear regres-

sion. In International Conference on Artiﬁcial Intelligence

and Statistics, pages 1093–1103. PMLR, 2020. 6

[19] Matthew F Glasser, Stamatios N Sotiropoulos, J Anthony

Wilson, Timothy S Coalson, Bruce Fischl, Jesper L Anders-

son, Junqian Xu, Saad Jbabdi, Matthew Webster, Jonathan R

Polimeni, et al. The minimal preprocessing pipelines for the

human connectome project. Neuroimage, 80:105–124, 2013.

1,12

[20] Arthur Gretton, Karsten Borgwardt, Malte Rasch, Bernhard

Sch¨

olkopf, and Alex Smola. A kernel method for the two-

sample-problem. Advances in neural information processing

systems, 19:513–520, 2006. 3

[21] Marshall Hall. The theory of groups. Courier Dover Publi-

cations, 2018. 5

[22] Moritz Hardt and Benjamin Recht. Patterns, predictions,

and actions: A story about machine learning.https:

//mlstory.org, 2021. 3

[23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.

Identity mappings in deep residual networks. In European

conference on computer vision, pages 630–645. Springer,

2016. 7

[24] Mark Hunacek. Lie groups, lie algebras, and representa-

tions: An elementary introduction, by brian hall. pp. 351.£

50. 2003. isbn 0 387 401229 (springer-verlag). The Mathe-

matical Gazette, 89(514):149–151, 2005. 4

[25] Gyujoon Hwang, Cole John Cook, Veena A Nair, Andrew L

Alexander, Piero G Antuono, Sanjay Asthana, Rasmus Birn,

Cynthia M Carlsson, Guangyu Chen, Dorothy Farrar Ed-

wards, et al. Ic-p-161: Characterizing structural brain alter-

ations in alzheimer’s disease patients with machine learning.

Alzheimer’s & Dementia, 14(7S Part 2):P135–P136, 2018.

8

[26] Clifford R Jack Jr, Matt A Bernstein, Nick C Fox,

Paul Thompson, Gene Alexander, Danielle Harvey, Bret

Borowski, Paula J Britson, Jennifer L. Whitwell, Chadwick

Ward, et al. The alzheimer’s disease neuroimaging initia-

tive (adni): Mri methods. Journal of Magnetic Resonance

Imaging: An Ofﬁcial Journal of the International Society for

Magnetic Resonance in Medicine, 27(4):685–691, 2008. 1,

2

[27] Mark Jenkinson, Christian F Beckmann, Timothy EJ

Behrens, Mark W Woolrich, and Stephen M Smith. Fsl. Neu-

roimage, 62(2):782–790, 2012. 12

[28] Anthony W Knapp and AW Knapp. Lie groups beyond an

introduction, volume 140. Springer, 1996. 4

9

[29] Yujia Li, Kevin Swersky, and Richard Zemel. Learning un-

biased features. arXiv preprint arXiv:1412.5244, 2014. 4,7,

8

[30] Jingqin Luo, Folasade Agboola, Elizabeth Grant, Colin L

Masters, Marilyn S Albert, Sterling C Johnson, Eric M Mc-

Dade, Jonathan V¨

oglein, Anne M Fagan, Tammie Benzinger,

et al. Sequence of alzheimer disease biomarker changes in

cognitively normal adults: A cross-sectional study. Neurol-

ogy, 95(23):e3104–e3116, 2020. 1

[31] Divyat Mahajan, Shruti Tople, and Amit Sharma. Domain

generalization using causal matching. In International Con-

ference on Machine Learning, pages 7313–7324. PMLR,

2021. 2,3,7

[32] Ronak Mehta, Rudrasis Chakraborty, Yunyang Xiong, and

Vikas Singh. Scaling recurrent models via orthogonal

approximations in tensor trains. In Proceedings of the

IEEE/CVF International Conference on Computer Vision,

pages 10571–10579, 2019. 5

[33] Saeid Motiian, Marco Piccirilli, Donald A Adjeroh, and Gi-

anfranco Doretto. Uniﬁed deep supervised domain adapta-

tion and generalization. In Proceedings of the IEEE inter-

national conference on computer vision, pages 5715–5725,

2017. 7,8

[34] Daniel Moyer, Shuyang Gao, Rob Brekelmans, Aram Gal-

styan, and Greg Ver Steeg. Invariant representations with-

out adversarial training. In S. Bengio, H. Wallach, H.

Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett,

editors, Advances in Neural Information Processing Systems,

volume 31. Curran Associates, Inc., 2018. 2,11

[35] Prashant Pandey, Mrigank Raman, Sumanth Varambally,

and Prathosh AP. Domain generalization via inference-

time label-preserving target projections. arXiv preprint

arXiv:2103.01134, 2021. 2

[36] Judea Pearl, Madelyn Glymour, and Nicholas P Jewell.

Causal inference in statistics: A primer. John Wiley & Sons,

2016. 3

[37] Jonas Peters, Dominik Janzing, and Bernhard Sch¨

olkopf. El-

ements of causal inference: foundations and learning algo-

rithms. The MIT Press, 2017. 2

[38] Qi Qi, Zhishuai Guo, Yi Xu, Rong Jin, and Tianbao Yang. An

online method for distributionally deep robust optimization,

2020. 3

[39] Paul R Rosenbaum and Donald B Rubin. Constructing a con-

trol group using multivariate matched sampling methods that

incorporate the propensity score. The American Statistician,

39(1):33–38, 1985. 2

[40] Florian Schaefer and Anima Anandkumar. Competitive gra-

dient descent. In H. Wallach, H. Larochelle, A. Beygelzimer,

F. d'Alch´

e-Buc, E. Fox, and R. Garnett, editors, Advances in

Neural Information Processing Systems, volume 32. Curran

Associates, Inc., 2019. 7

[41] Bernhard Sch¨

olkopf, Francesco Locatello, Stefan Bauer,

Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and

Yoshua Bengio. Toward causal representation learning. Pro-

ceedings of the IEEE, 109(5):612–634, 2021. 2

[42] Jon M Selig. Cayley maps for se (3). In 12th International

Federation for the Promotion of Mechanism and Machine

Science World Congress, page 6. London South Bank Uni-

versity, 2007. 5

[43] Nihar B Shah and Martin J Wainwright. Simple, robust and

optimal ranking from pairwise comparisons. The Journal of

Machine Learning Research, 18(1):7246–7283, 2017. 3

[44] Anja Soldan, Corinne Pettigrew, Anne M Fagan, Suzanne E

Schindler, Abhay Moghekar, Christopher Fowler, Qiao-Xin

Li, Steven J Collins, Cynthia Carlsson, Sanjay Asthana, et al.

Atn proﬁles among cognitively normal individuals and lon-

gitudinal cognitive outcomes. Neurology, 92(14):e1567–

e1579, 2019. 1

[45] Adarsh Subbaswamy, Peter Schulam, and Suchi Saria. Pre-

venting failures due to dataset shift: Learning predictive

models that transport. In The 22nd International Conference

on Artiﬁcial Intelligence and Statistics, pages 3118–3127.

PMLR, 2019. 2

[46] Chung-Piaw Teo, Jay Sethuraman, and Wee-Peng Tan. Gale-

shapley stable marriage problem revisited: Strategic issues

and applications. Management Science, 47(9):1252–1267,

2001. 3

[47] Paul M Thompson, Jason L Stein, Sarah E Medland, Der-

rek P Hibar, Alejandro Arias Vasquez, Miguel E Renteria,

Roberto Toro, Neda Jahanshad, Gunter Schumann, Barbara

Franke, et al. The enigma consortium: large-scale collab-

orative analyses of neuroimaging and genetic data. Brain

imaging and behavior, 8(2):153–182, 2014. 1

[48] Laurens Van der Maaten and Geoffrey Hinton. Visualiz-

ing data using t-sne. Journal of machine learning research,

9(11), 2008. 6

[49] Qizhe Xie, Zihang Dai, Yulun Du, Eduard Hovy, and Gra-

ham Neubig. Controllable invariance through adversarial

feature learning. In I. Guyon, U. V. Luxburg, S. Bengio,

H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, ed-

itors, Advances in Neural Information Processing Systems,

volume 30. Curran Associates, Inc., 2017. 6,7,8,11

[50] Shiqi Yang, Yaxing Wang, Joost van de Weijer, Luis Her-

ranz, and Shangling Jui. Generalized source-free domain

adaptation. In Proceedings of the IEEE/CVF International

Conference on Computer Vision, pages 8978–8987, 2021. 2

[51] Zhuoran Yang, Yufeng Zhang, Yongxin Chen, and Zhao-

ran Wang. Variational transport: A convergent particle-

basedalgorithm for distributional optimization. arXiv

preprint arXiv:2012.11554, 2020. 3

[52] Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cyn-

thia Dwork. Learning fair representations. In International

conference on machine learning, pages 325–333. PMLR,

2013. 2

[53] Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell.

Mitigating unwanted biases with adversarial learning. In

Proceedings of the 2018 AAAI/ACM Conference on AI,

Ethics, and Society, pages 335–340, 2018. 2

[54] Deli Zhao, Jiapeng Zhu, and Bo Zhang. Latent variables on

spheres for autoencoders in high dimensions. arXiv preprint

arXiv:1912.10233, 2019. 4

[55] Hao Henry Zhou, Vikas Singh, Sterling C Johnson, Grace

Wahba, Alzheimer’s Disease Neuroimaging Initiative, et al.

Statistical tests and identiﬁability conditions for pooling and

10

analyzing multisite datasets. Proceedings of the National

Academy of Sciences, 115(7):1481–1486, 2018. 2,3,7,8

[56] Hao Henry Zhou, Yilin Zhang, Vamsi K. Ithapu, Sterling C.

Johnson, Grace Wahba, and Vikas Singh. When can multi-

site datasets be pooled for regression? Hypothesis tests, `2-

consistency and neuroscience applications. In Doina Precup

and Yee Whye Teh, editors, Proceedings of the 34th Interna-

tional Conference on Machine Learning, volume 70 of Pro-

ceedings of Machine Learning Research, pages 4170–4179.

PMLR, 06–11 Aug 2017. 1

A. Appendix

A.1. Proofs of theoretical results

In this section, we will provide the proofs of Lemma 4

and Lemma 5discussed in the main paper.

Lemma. Given two latent space representations `i,`j∈

Sn−1, and the corresponding cosets giH=τ(`i)and

gjH=τ(`j),∃!gij =gjg−1

i∈Gsuch that `j=gij ·`i.

Proof. Given giH=τ(`i)and gjH=τ(`j), we use gij =

gig−1

j∈Gsuch that, gjH=gij giH.

Now using the equivariance fact (3) , we get,

gjH=gij giH

=⇒τ(`j) = gij τ(`i)

=⇒τ(`j) = τ(gij ·`i)

Now as τis an identiﬁcation, i.e., a diffeomorphism, we

get `j=gij `i. Note that Sn−1is a Riemannian homoge-

neous space and the group Gacts transitively on Sn−1, i.e.,

given x,y∈Sn−1,∃g∈Gsuch that, y=g·x. Hence

from `j=gij `iand the transitivity property we can con-

clude that gij is unique.

Lemma. For a τ:L → G/H as deﬁned above, and a

mapping b:L→Z, the function Φ : L→Zdeﬁned by

Φ(`) = τ(`)·bτ(`)−1·`(9)

is G-equivariant, i.e., Φ(g·`) = gΦ(`).

Proof. Let `∈ L. Consider the Φmapping of g·`, that is

Φ(g·`) = τ(g·`)·bτ(g·`)−1·`.

Using the fact (3) from the main paper, we have τ(g·

`) = gτ (`)and τ(g·`)−1=τ(`)−1g−1. Substituting

these in Φ(g·`), we get

Φ(g·`) = gτ (`)·bτ(`)−1g−1g·`

=gτ (`)bτ(`)−1·`

Thus, Φ(g·`) = gΦ(`)

A.2. Details on Evaluation Metrics

Recall from Section 4of the paper, our discussion on

three metrics – ∆Eq ,Adv and M. While ∆Eq and M

are variants of distance measure on the latent space, Adv

assesses the ability to predict the nuisance attributes from

the latent representation (and is therefore probabilistic in

nature). Observe that ∆Eq and Mare (euclidean) distance

measures and could be very different depending on the nor-

malization of the vectors. For our purposes of evaluating

these latent vectors/features in downstream tasks, we per-

form a simple feature normalization in order to obtain 0−1

latent vectors given by,

˜zi=zi−min(zi)

max(zi)−min(zi).(10)

Our feature normalization is composed of two steps: (i) cen-

tering – the numerator in (10) ensures that the mean of z

(along its coordinates) is 0; and (ii) scale – the denominator

projects the features zon the sphere at origin with radius

kzik≥

∞= max(zi)−min(zi)≥0. Note that our scaling

step can be thought of as the usual projection in a special

case: when ziis guaranteed to be nonnegative (for exam-

ple, when zirepresent activations), then kzik≥

∞simply cor-

responds to a lower bound of the usual inﬁnity norm, kzk∞

(hence projection on a scaled `∞ball). We adopt this nor-

malization only to compute ∆Eq and Mmeasures, and not

for model training.

For computing the Adv measure, we follow [49] to train

an adversarial neural network predicting the nuisance at-

tributes. We use a three-layered fully connected network

with batch normalization and train for 150 epochs. [34] uses

similar architecture for the adversaries with different hidden

layers of 0,1,2,3. We found that a three-layer adversary

is powerful enough to predict the nuisance attributes and

hence we use it to report the Adv measure.

A.3. Understanding ADNI dataset

Dataset. The data was downloaded from the Alzheimers

Disease Neuroimaging Initiative (ADNI) database

(adni.loni.usc.edu). The ADNI was launched in 2003

as a public-private partnership, led by Principal Investi-

gator Michael W. Weiner, MD. ADNI was set up with an

objective to measure the progression of mild cognitive

impairment (MCI) and early Alzheimers disease (AD)

using serial magnetic resonance imaging (MRI), positron

emission tomography (PET), other biological markers. We

have three imaging protocol (scanner) types in the dataset,

namely, GE, Siemens and Phillips. The count of samples

AD/CN in each of these imaging protocols are provided in

Table 3. An example illustration (borrowed from [2]) of

using different scanner on the images is shown in Figure 8.

Preprocessing. All images were ﬁrst normalized and skull-

stripped using Freesurfer [15]. A linear (afﬁne) registra-

11

(a) Variation due to scanner for particular age group. (b) Variation due to covariates (age) in scanner 3.

Figure 7. Sample Images from ADCP dataset. (a) MRI images on control subjects from the ADCP dataset for different sites in the age group 70-80.

(b) Images obtained from Site 3for three extreme age groups. The gantt chart on top of the image indicates the respective age range in the other sites.

(a) GE (b) Siemens

Figure 8. Scanner effects on images. Two imaging protocols are shown:

(a) Siemens, (b) GE. The yellow region is the cortical ribbon segmenta-

tion, and the green circle shows that the imaging protocol from different

manufacturers have an effect on the scan. Image borrowed from [2].

tion was performed to register each image to MNI template

space.

A.4. Understanding ADCP dataset

Participants. The data for ADCP was collected through an

NIH-sponsored Alzheimer’s Disease Connectome Project

(ADCP) U01 AG051216. The study inclusion criteria for

AD (Alzheimer’s disease) / MCI (Mild Cognitive Impair-

ment) patients consisted of age between 55-90 years, will-

ing and able to undergo all procedures, retains decisional

capacity at initial visit, meets criteria for probable AD or

meets criteria for MCI.

Scanners. MRI images were acquired at three distinct sites

on GE scanners. T1-weighted structural images were ac-

quired using a 3D gradient-echo pulse sequence (repetition

time (TR) = 604 ms, echo time (TE) = 2.516 ms, inversion

time = 1060 ms, ﬂip angle = 8o, ﬁeld of view (FOV) = 25.6

cm, 0.8mm isotropic). T2-weighted structural images were

acquired using a 3D fast spin-echo sequence (TR = 2500

ms, TE = 94.398 ms, ﬂip angle = 90o, FOV = 25.6cm, 0.8

Table 3. Sample counts for ADNI dataset

Imaging Protocol AD CN

Manufacturer=GE Medical Systems 44 78

Manufacturer=Philips Medical Systems 32 50

Manufacturer=Siemens 83 162

mm isotropic).

Preprocessing. The Human Connectome Project (HCP)

minimal preprocessing pipeline version 3.4.0[19] was fol-

lowed for data processing. This pipeline is based on FM-

RIB Software Library [27]. Next, the T1w and T2w images

are aligned, a B1 (bias ﬁeld) correction is performed, and

the subject’s image in native structural volume space is reg-

istered to MNI space using FSL’s FNIRT [4]. Only T1w

images in the MNI space were used for further analysis and

experiments.

Data Statistics. We plot the distributions of several at-

tributes in this dataset conditioned on the site. In Figure 10,

we show that the values of age and cognitive scores differ

across the three sites in this dataset. Cognitive scores are

computed based on an test assigned to the patients. Higher

scores indicate higher cognitive operation in the patient. Ta-

ble 4shows the sample counts for target variable of predic-

tion AD (Alzheimer’s disease) and Control group.

A.5. Visualizing the latent space

In the paper Figure 4, we have seen the latent space

τ(`)for the samples in the ADNI and the Adult datasets.

Here, we will see similar qualitative results for the

German and the ADCP dataset in Figure 9of the sup-

plement. In the plots, the latent representations for a

non-equivariant encoder are stretched thoughout the latent

space. In contrast, the representations of an equivariant

encoder, for a discretized value of Age, are localized to

speciﬁc regions. Further, these representations have a

monotonic behaviour with respect to the values of Age.

Table 4. Sample counts for ADCP dataset

AD Control Female Male

site 1 10 39 29 20

site 2 10 33 30 13

site 3 5 19 14 10

12

(a) ADCP Dataset (b) German Dataset

Figure 9. t-SNE plots of latent representations of τ(`). For both ADCP (left) and German (right), the the latent vectors of the equivariant encoder are

evenly distributed with respect to the age covariate value. The non-equivariant space is generated from the na¨

ıve pooling model. Different colors denote the

discretized set of age covariate value present in the data.

(a) Age (b) Cognitive Score

Figure 10. Distribution of attributes in the ADCP dataset. On the left

we observe the distribution of age for the three different sites present in the

ADCP dataset. On the right, we see the distribution of the cognitive scores.

The cognitive scores are computed based on a test that assesses executive

function. Higher scores indicate higher level of cognitive ﬂexibility. Both

age and cognitive scores are observed to vary across the sites.

Listing 1. Residual Block

BatchNorm3d

Swi s h

Conv3d

BatchNorm3d

Swi s h

Conv3d

Listing 2. Fully Connected Block

AdaptiveAvgPool3d

Flatten

Dropout

L i n e a r

BatchNorm1d

Swi s h

Dropout

L i n e a r

A.6. Hyper-parameters and NN Architectures

For tabular datasets such as German and Adult, our en-

coders and decoders comprise of fully connected networks

and a hidden layer of 64 nodes. The dimension of the quo-

tient latent space τ(`i)is 30. Adam is used as a default

optimizer and the learning rate is adjusted based on the val-

idation set.

Imaging datasets like ADNI and ADCP require 3D con-

volutions and a ResNet architecture as the backbone. The

last layer is used to describe the quotient space τ(`i). We

present the residual and the fully connected block below.

Detailed architectures can be viewed in the code.

A.7. Scaling factors

Recall from the Algorithm 1of the main paper that our

loss function for each stage comprises of reconstruction

and prediction losses in addition to the objectives concern-

ing equivariance and invariance. These multi-objective loss

functions require scaling factors that upweight one objec-

tive over the other. These scaling factors group up as hyper-

parameters for the Algorithm. In our experiments, it was

observed that the results were robust to a range of scaling

factor choices. For the results reported in Table 1of the

paper, they were identiﬁed through cross-validation. Here

we provide an example for the scaling factors used for the

Adult dataset, please refer to the bash scripts available in

the code for the scaling factors of other datasets.

•Stage one: Equivariance to Covariates

–Equivariance Loss Lstage1

Scaling Factor : 1.0

–Reconstruction Loss PikXi−D(E(Xi))k

Scaling Factor : 0.02

•Stage two: Invariance to Site

–Invariance Loss MMD

Scaling Factor : 0.1

–Prediction Loss kY−h(Φ(`))k2

Scaling Factor : 1.0

–Reconstruction Loss k`−Ψ(Φ(`))k2

Scaling Factor : 0.1

We refer the reader to Algorithm 1and Section 3of the main

paper for the details on the notations used above.

13