PreprintPDF Available

Equivariance Allows Handling Multiple Nuisance Variables When Analyzing Pooled Neuroimaging Datasets

Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Pooling multiple neuroimaging datasets across institutions often enables improvements in statistical power when evaluating associations (e.g., between risk factors and disease outcomes) that may otherwise be too weak to detect. When there is only a {\em single} source of variability (e.g., different scanners), domain adaptation and matching the distributions of representations may suffice in many scenarios. But in the presence of {\em more than one} nuisance variable which concurrently influence the measurements, pooling datasets poses unique challenges, e.g., variations in the data can come from both the acquisition method as well as the demographics of participants (gender, age). Invariant representation learning, by itself, is ill-suited to fully model the data generation process. In this paper, we show how bringing recent results on equivariant representation learning (for studying symmetries in neural networks) instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution. In particular, we demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
Content may be subject to copyright.
Equivariance Allows Handling Multiple Nuisance Variables
When Analyzing Pooled Neuroimaging Datasets
Vishnu Suresh Lokhande
Rudrasis Chakraborty
Sathya N. Ravi
Vikas Singh
Pooling multiple neuroimaging datasets across institu-
tions often enables improvements in statistical power when
evaluating associations (e.g., between risk factors and dis-
ease outcomes) that may otherwise be too weak to detect.
When there is only a single source of variability (e.g., dif-
ferent scanners), domain adaptation and matching the dis-
tributions of representations may suffice in many scenar-
ios. But in the presence of more than one nuisance vari-
able which concurrently influence the measurements, pool-
ing datasets poses unique challenges, e.g., variations in the
data can come from both the acquisition method as well
as the demographics of participants (gender, age). Invari-
ant representation learning, by itself, is ill-suited to fully
model the data generation process. In this paper, we show
how bringing recent results on equivariant representation
learning (for studying symmetries in neural networks) in-
stantiated on structured spaces together with simple use
of classical results on causal inference provides an effec-
tive practical solution. In particular, we demonstrate how
our model allows dealing with more than one nuisance
variable under some assumptions and can enable analy-
sis of pooled scientific datasets in scenarios that would
otherwise entail removing a large portion of the samples.
Our code is available on https:// github. com/ vsingh-
1. Introduction
Observational studies in many disciplines acquire cross-
sectional/longitudinal clinical and imaging data to under-
stand diseases such as neurodegeneration and dementia
[44]. Typically, these studies are sufficiently powered for
the primary scientific hypotheses of interest. However, sec-
ondary analyses to investigate weaker but potentially inter-
esting associations between risk factors (such as genetics)
and disease outcomes are often difficult when using com-
mon statistical significance thresholds, due to the small-
/medium sample sizes.
Over the last decade, there are coordinated large scale
Figure 1. Learning Invariant Representations. In our framework,
input images Xare pooled together from multiple sites. An encoder Φ
maps Xto the latent representations Φ(X)that corresponds to high-level
causal features XCthat influences the label prediction. Unlike the input
images X,Φ(X)is robust to nuisance attributes like site (scanner) and
covariates (age). Φis trained alongside predictor hand decoder Ψ.
multi-institutional imaging studies (e.g., ADNI [26], NIH
All of Us and HCP [19]) but the types of data collected
or the project’s scope (e.g., demographic pool of partici-
pants) may not be suited for studying specific secondary
scientific questions. A “pooled” imaging dataset obtained
from combining roughly similar studies across different in-
stitutions/sites, when possible, is an attractive alternative.
The pooled datasets provide much larger sample sizes and
improved statistical power to identify early disease bio-
markers – analyses which would not otherwise be possi-
ble [16,30]. But even when study participants are consistent
across sites, pooling poses challenges. This is true even for
linear regression [56] – improvement in statistical power is
not always guaranteed. Partly due to these as well as other
reasons, high visibility projects such as ENIGMA [47] have
reported findings using meta-analysis methods.
Data pooling and fairness. Even under ideal conditions,
pooling imaging datasets across sites requires care. As-
sume that the participants across two sites, say site1and
site2, are perfectly gender matched with the same propor-
tion of male/female and the age distribution (as well as the
proportion of diseased/health controls) is also identical. In
this idealized setting, the only difference between sites may
come from variations in scanners or acquisition (e.g., pulse
sequences). When training modern neural networks for a
regression/classification task with imaging data obtained in
this scenario, we may ask that the representations learned by
arXiv:2203.15234v1 [cs.LG] 29 Mar 2022
(a) Causal Diagram (b) Variation due to site (scanner) for particular age group. (c) Variation due to covariates (age) in Siemens scanner.
Figure 2. (a) ACausal Diagram listing variable of interest and their relationship for multi-site pooling problem. Nodes Dpopul ,Dacqui and Dpreval denote
the population, acquisition and prevalence biases that vary across sites. C’s are covariates (like age or gender). XCdenotes the high-level causal features
of an image Xthat influences the labels Y. Nodes in red d-separate the nodes in blue and green. (b) MRI images on control subjects from the ADNI [26]
dataset for different scanners in the age group 70-80.(c) Images obtained from the Siemens scanner (i.e., fixing site) on control subjects for three extreme
age groups. The gantt chart on top of the image indicates the respective age range in the Phillips and GE scanners. As observed, different scanner groups
do not share a common support on “age” covariates, resulting in samples outside of the common support to be discarded in na¨
ıve pooling approaches.
the model be invariant to the categorical variable denoting
“site”. While this is not a “solved” problem, this strategy
has been successfully deployed based on results in invariant
representation learning [3,5,34] (see Fig. 1). One may al-
ternatively view this task via the lens of fairness – we want
the model’s performance to be fair with respect to the site
variable. This approach is effective, via constraints [52] or
using adversarial modules [17,53]. This setting also permits
re-purposing tools from domain adaptation [35,50,55] or
transfer learning [12] as a pre-processing step, before anal-
ysis of the pooled data proceeds.
Nuisance variables/confounds. Data pooling problems
one often encounters in scientific research typically vio-
lates many of the conditions in the aforementioned exam-
ple. The measured data Xat each site is influenced not
only by the scanner properties but also by a number of other
covariates / nuisance variables. For instance, if the age dis-
tribution of participants is not identical across sites, com-
parison of the site-wise distributions is challenging because
it is influenced both by age and the scanner. An example
of the differences introduced due to age and scanner biases
is shown in Figures 2b,2c. With multiple nuisance vari-
ables, even effective tools for invariant representation learn-
ing, when used directly, can provide limited help. The data
generation process, and the role of covariates/nuisance vari-
ables, available via a causal diagram (Figure 2a), can inform
how the formulation is designed [6,45]. Indeed, concepts
from causality have benefited various deep learning mod-
els [37,41]. Specially, recent work [31] has shown the value
of integrating structural causal models for domain general-
ization, which is related to dataset pooling.
Causal Diagram. Dataset pooling under completely arbi-
trary settings is challenging to study systematically. So, we
assume that the site-specific imaging datasets are not sig-
nificantly different to begin with, although the distributions
for covariates such as age/disease prevalence may not be
perfectly matched and each of these factors will influence
the data. We assume access to a causal diagram describing
how these variables influence the measurements. We show
how the distribution matching criteria provided by a causal
diagram can be nicely handled for some ordinal covariates
that are not perfectly matched across sites by adapting ideas
from equivariant representation learning.
Contributions. We propose a method to pool multiple neu-
roimaging datasets by learning representations that are ro-
bust to site (scanner) and covariate (age) values (see Fig. 1
for visualization). We show that continuous nuisance co-
variates which do not have the same support and are not
identically distributed across sites, can be effectively han-
dled when learning invariant representations. We do not
require finding “closest match” participants across sites –
a strategy loosely based on covariate matching [39] from
statistics which is less feasible if the distributions for a co-
variate (e.g., age) do not closely overlap. Our model is
based on adapting recent results on equivariance together
with known concepts from group theory. When tied with
common invariant representation learners, our formulation
allows far improved analysis of pooled imaging datasets.
We first perform evaluations on common fairness datasets
and then show its applicability on two separate neuroimag-
ing tasks with multiple nuisance variables.
2. Reducing Multi-site Pooling to Infinite Di-
mensional Optimization
Let Xdenote an image of a participant and let Ybe the
corresponding (continuous or discrete) response variable or
target label (such as cognitive score or disease status). For
simplicity, consider only two sites – site1and site2. Let
Drepresent the site-specific shifts, biases or covariates that
we want to take into account. One possible data generation
process relating these variables is shown in Figure 2a.
Site-specific biases/confounds. Observe that Yis, in
fact, influenced by high-level (or latent) features XCspe-
cific to the participant. The images (or image-based disease
biomarkers) Xare simply our (lossy) measurement of the
participant’s brain XC[14]. Further, Xalso includes an
(unknown) confound: contribution from the scanner (or ac-
quisition protocol). Figure 2a also lists covariates C, such
as age and other factors which impact XC(and therefore,
X). A few common site-specific biases Dare shown in
Fig. 2a. These include (i) population bias Dpopul that
leads to differences in age or gender distributions of the
cohort [9]; (ii) we must also account for acquisition shift
Dacqui resulting from different scanners or imaging pro-
tocols – this affects Xbut not XC;(iii) data are also in-
fluenced by a class prevalence bias Dpreval, e.g., healthier
individuals over-represented in site2will impact the distri-
bution of cognitive scores across sites.
For imaging data, in principle, site-invariance can be
achieved via an encoder-decoder style architecture to map
the images Xinto a “site invariant” latent space Φ(X).
Here, Φ(X)in the idealized setting, corresponds to the true
“causal” features XCthat is comparable across sites. In
practice, we know that images cannot fully capture the dis-
ease – so, Φ(X)is simply a surrogate, limited by the mea-
surements we have on hand. Given these caveats, an archi-
tecture is shown in Fig. 1. Ideally, the encoder will mini-
mize Maximum Mean Discrepancy (MMD) [20] or another
discrepancy between the distributions of latent representa-
tions Φ(·)of the data from site1and site2.
The site-specific attributes Dare often unobserved or
otherwise unavailable. For instance, we may not have full
access to Dpopul from which our participants are drawn.
To tackle these issues, we use a causal diagram, see Fig. 2a,
similar to existing works [31,55] with minimal changes. For
dealing with unobserved D’s, some standard approaches are
known [22]. Let us see how it can help here. Applying d-
separation (see [22,36] ) on Fig. 2a, we see that the nodes
(Dpopul, C, XC)form a so-called “head-to-tail” branch and
the nodes (Dacqui, X, XC),(Dpreval , Y, XC)form a “head-
to-head” branch. This implies that XCD|C. This
is exactly an invariance condition: XCshould not change
across different sites for samples with the same value of C.
To enforce this using Φ(·), we must optimize a discrepancy
between site-wise Φ(X)’s at a given value of C,
ΦMMDPsite1Φ(X)|C, Psite2Φ(X)|C(1)
Provably solving (1)?A brief comment on the difficulty
of the distributional optimization in (1) is useful. Generic
tools for (worst case) convergence rates for such problems
are actively being developed [51]. For the average case,
[38] presents an online method for a specific class of (fi-
nite dimensional) distributionally robust optimization prob-
lems that can be defined using standard divergence mea-
sures. Observe that even these convergence guarantees are
local in nature, i.e., they output a point that satisfies neces-
sary conditions and may not be sufficient.
In practice, the outlook is a little better. Intuitively, an
optimal matching of the conditional distributions P(Φ(X)|
C)across the two sites corresponds to a (globally) optimal
solution to the probabilistic optimization task in (1). Ex-
isting works show that it is indeed possible to approach this
computationally via sub-sampling methods [55] or by learn-
ing elaborate matching functions to identify image or object
pairs across sites that are “similar” [31] or have the same
value for C. Sub-sampling, by definition, reduces the num-
ber of samples from the two sites by discarding samples
outside of the common support. This impacts the quality
of the estimator – for instance, [55] must restrict the analy-
sis only to that age range of Cwhich overlaps or is shared
across sites. Such discarding of samples is clearly undesir-
able when each image acquisition is expensive. Matching
functions also do not work if the support of Cis not identi-
cal across the sites, as briefly described next.
Example 2.1. Let Cdenote an observed covariate, e.g.,
age. Consider Xiat site1with C=c1and Xjat site2
with C=c2. If c1c2, a matching will seek Φ(Xi)
Φ(Xj)in XCspace. If c1falls outside the support of c’s
acquired at site2, one must not only estimate Φ(·)but also a
transport expression Γc2c1(·)on XCsuch that Φ(Xi)
Γc2c1(Φ(Xj)). The “transportation” involves estimating
what a latent image acquired at age c2would look like at
age c1. This means that matching would need a solution to
the key difficulty, obtained further upstream.
2.1. Improved Distribution Matching via Equivari-
ant Mappings may be possible
Ignoring Yfor the moment, recall that matching here
corresponds to a bijection between unlabeled (finite) con-
ditional distributions. Indeed, if the conditional distribu-
tions take specific forms such as a Poisson process, it is in-
deed possible to use simple matching algorithms that only
require access to pairwise ranking information on the cor-
responding empirical distributions [43], for example, the
well-known Gale-Shapley algorithm [46]. Unfortunately,
in applications that we consider here, such distributional as-
sumptions may not be fully faithful with respect to site spe-
cific covariates C. In essence, we want representations Φ
(when viewed as a function of C) that vary in a predictable
(or say deterministic) manner – if so, we can avoid match-
ing altogether and instead match a suitable property of the
site-wise distributions of the representation Φ(X). We can
make this criterion more specific. We want the site-wise
distributions to vary in a manner where the “trend” is con-
sistent across the sites. Assume that this were not true, say
P(Φ(X)|C)is continuous and monotonically increases
with Cfor site1but monotonically decreases for site2. A
match of P(Φ(X)|C)across the sites at a particular value
of C=cimplies at least one C=c0where P(Φ(X)|C)
do not match. The monotonicity argument is weak for high
dimensional Φ. Plus, we have multiple nuisance variables.
It turns out that our requirement for P(Φ(X)|C)to vary
in a predictable manner across sites can be handled using
the idea of equivariant mappings, i.e., P(Φ(X)|C)must
Figure 3. Visualization of Stage one. First, an image pair Xi, Xjare mapped onto a hypersphere using an encoder E. The resulting pair `i,`j
are passed through τnetwork to map them into the space of rotation matrices (which is the quotient group denoted by G/H). Fact 3ensure that τis a
G=SO(n)equivariant map. G(i, j)/G(i, j )1is the group action of transforming τ(`i)to τ(`j)(`j)to τ(`i)respectively.
be equivariant with respect to Cfor both sites. In addition,
we will also seek invariance to scanner attributes.
While we are familiar with the well-studied notion of
invariance through the MMD criterion [29], we will briefly
formalize our idea behind an equivariant mapping which is
less common in this setting.
Definition 1. A mapping f:X → Y defined over mea-
surable Borel spaces Xand Yis said to be Gequivariant
under the action of group Giff
f(g·x) = g·f(x), g G
We refer the reader to two recent surveys, Section 2.1of
[7] and Section 3.1of [8], which provide a detailed review.
Equivariance is often understood in the context of a
group action (say, a matrix group) [24,28]. While the co-
variates Cis a vector (and every vector space is an abelian
group), since this group will eventually act on the latent
space of our images, imposing additional structure will be
beneficial. To do so, we will utilize a mapping between C’s
and a group suitable for our setting. Once this is accom-
plished, we will derive an equivariant encoder. We discuss
these steps next.
3. Methods
The dual goals of (i) equivariance to covariates (such as
age) and (ii) invariance to site, involves learning multiple
mappings. For simplicity, and to keep the computational
effort manageable, we divide our method into two stages.
Briefly, our stages are (a) Stage one: Equivariance to Co-
variates. We learn a mapping to a space that provides the
essential flexibility to characterize changes in covariate C
as a group action. This enables us to construct a space sat-
isfying the equivariance condition as per Def. 1(b) Stage
two: Invariance to Site. We learn a second encoding to
a generic vector space by apriori ensuring that the equiv-
ariance properties from Stage one are preserved. Such an
encoding is then tuned to optimize the MMD criterion, thus
generating a latent space that is invariant to site while re-
maining equivariant to covariates. We describe these stages
one by one in the following sections.
3.1. Stage one: Equivariance to Covariates
Given the space of images, X, with the covariates C,
first, we want to characterize the effect of Con Xas a
group action for some group G. Here, an element gG
characterizes the change from covariate ciCto cjC
(for short, we will use iand j). The change in Ccorre-
sponds to a translation action which is difficult to instanti-
ate in Xwithout invoking expensive conditional generative
models. Instead, we propose to learn a mapping to a latent
space Lsuch that the change in Ccan be characterized by
a group action pertaining to Gin the space L(the latent
space of X). As an example, let us say Xigoes to Xjin X
as (XiXj). This means that (XiXj)is caused due
to the covariate change (cicj)in C. Let Ebe a mapping
between the image space Xand the latent space L. In the la-
tent space L, for the moment, we want that (EXiEXj)
should correspond to the change in covariate (cicj).
Remark 2. We are mostly interested in normalized covari-
ates for example, in `pnorm, while other volume based de-
terministic normalization functions may also be applicable.
In the simplest case of p= 2 norm, the corresponding group
action is naturally induced by the matrix group of rotations.
Based on this choice of group, we will learn an autoen-
coder (E,D)with an encoder E:X L and a decoder
D:L→X, here Lis the encoding space. Due to Re-
mark 2, we can choose Lto be a hypersphere, Sn1, and
(E,D)as a hyperspherical autoencoder [54]. Then, we can
characterize the “action of Con X” as the action of Gon
Sn1. That is to say that a covariate change (translation in
C) is a change in angles on L. This corresponds to a rotation
due to the choice of our group G. Note that for L=Sn1,
Gis the space of n×nrotation matrices, denoted by SO(n),
and the action of Gis well-defined. What remains is to en-
courage the latent space Lto be G-equivariant. We start
with some group theoretic properties that will be useful.
3.1.1 Review: Group theoretic properties of SO(n)
Let SO(n) = XRn×n|XTX=In,det(X)=1be
the group of n×nspecial orthogonal matrices. The group
SO(n)acts on Sn1with the group action “·” given by
g·`7→ g`, for gSO(n)and `Sn1. Here we use g`
to denote the multiplication of matrix gwith `. Under this
group action, we can identify Sn1with the quotient space
G/H with G=SO(n)and H=SO(n1) (see Ch. 3
of [13] for more details). Let τ:Sn1G/H be such
an identification, i.e., τ(`) = gH for some gG. The
identification τis equivariant to Gin the following sense.
Fact 3. Given τ:Sn1G/H as defined above, τis
equivariant with the action of G, i.e., τ(g·`) = (`).
Next, we see that given two points `i,`jon Sn1there
is a unique group element in Gto move from τ(`i)to τ(`j).
Lemma 4. Given two latent space representations `i,`j
Sn1, and the corresponding cosets giH=τ(`i)and
gjH=τ(`j),!gij =gjg1
iGsuch that `j=gij ·`i.
Thanks to Fact 3and Lemma A.1, simply identifying a
suitable τwill provide us the necessary equivariance prop-
erty. To do so, next, we parameterize τby a neural network
and describe a loss function to learn such a τand (E,D).
3.1.2 Learning a G-equivariant τwith DNNs
Now that we established the key components: (a) an au-
toencoder (E,D)to map from Xto the latent space Sn1
(b) a mapping τ:Sn1SO(n)which is G=SO(n)-
equivariant, see Figure 3, we discuss how to learn such a
(E,D)and a G-equivariant τ.
Let Xi, Xj∈ X be two images with the corresponding
covariates i, j Cwith i6=j. Let `i=E(Xi),`j=
E(Xj). Using Lemma A.1, we can see that a gij Gto
move from `ito `jdoes exist and is unique. Now, to learn
aτthat satisfies the equivariance property (Fact 3), we will
need τto satisfy two conditions, τ(gij ·`i) = gij τ(`i)and
τ(gji ·`j) = gj iτ(`j)gG. The two conditions are
captured in the following loss function,
⊂X ×C
kG(i, j )·τ(`i)τ(`j)k2+
kG1(i, j )·τ(`j)τ(`i)k2(3)
Here, G:C×CGwill be a table lookup given by
(i, j)7→ gij is the function that takes two values for the
covariate c, say, i, j corresponding to Xi, Xj∈ X and
simply returns the group element (rotation) gij needed to
move from E(Xi)to E(Xj).Choice of G:In general,
learning Gis difficult since Cmay not be continuous. In
this work, we fix Gand learn τby minimizing (3). We
will simplify the choice of Gas follows: assuming that
Cis a numerical/ordinal random variable, we define Gby
(i, j)7→ expm((ij)1m). Here m=n
2is the di-
mension of Gand expm is the matrix exponential, i.e.,
Algorithm 1 Learning representations that are Equivariant
to Covariates and Invariant to Site
Input: Training Sets from multiple sites (X, Y )site1 ,
(X, Y )site2. Nuisance covariates C.
Stage one: Equivariance to Covariates
1 : Parameterize Encoder-Decoder pairs (E,D)and τ
mapping with neural networks
2 : Optimize over (E,D)and τto minimize,
Output: First latent space mapping Eand a supporting
mapping function τ. Here, τis G-equivariant to the co-
variates C(see Lemma (A.1) and (3)).
Stage two: Invariance to Site
1 : Parameterize encoder b, predictor hand decoder Ψ
with neural networks
2 : Preserve equivariance from stage one with an equiv-
ariant mapping Φ, (see Lemma (A.1))
3 : Optimize Φ, b, h and Ψto minimize Lstage2+MMD
Output: Second latent space mapping Φ. Here, Φis
equivariant to the covariates and invariant to site.
expm :so(n)SO(n), where so(n)is the Lie alge-
bra [21] of SO(n). Since so(n)is a vector space, hence
(ij)1mso(n). To reduce the runtime of expm,
we replace expm by a Cayley map [32,42] defined by:
so(n)3A7→ (IA)(I+A)1SO(n). Here we used
expm for parameterization (other choices also suitable).
Finally, we learn the encoder-decoder (E,D)by using
a reconstruction loss constraint with Lstage1in (3). This
can also be thought of as a combined loss for this stage as
Lstage1+PikXiD(E(Xi))k2where the second term is
the reconstruction loss. The loss balances two terms and re-
quires a scaling factor (see appendix § A.7). A flowchart of
all steps in this stage can be seen in Fig 3.
3.2. Stage two: Invariance to Site
Having constructed a latent space Lthat is equivariant to
changes in the covariates C, we must now handle the site
attribute, i.e., invariance with respect to site. Here, it will
be convenient to project Lonto a space that simultaneously
preserves the equivariant structure from Land offers the
flexibility to enforce site-invariance. The following lemma,
inspired from the functional representations of probabilistic
symmetries (§4.2of [7]), provides us strategies to achieve
this goal. Here, consider Φ : L→Zto be the projection.
Lemma 5. For a τ:L → G/H as defined above, and for
any arbitrary mapping b:L→Z, the function Φ : L→Z
defined by
Φ(`) = τ(`)·bτ(`)1·`(4)
is G-equivariant, i.e., Φ(g·`) = gΦ(`).
(a) ADNI Dataset (b) Adult Dataset
Figure 4. t-SNE plots of latent representations τ(`).For ADNI (left) and Adult (right), an equivariant encoder ensures that the latent features are
evenly distributed and bear a monotonic trend with respect to the changes in the age covariate value. The non-equivariant space is generated from the Na¨
pooling baseline. Each color denotes a discretized age group. Age was discretized only for the figure to highlight the density of samples in each age group.
Proof is available in the appendix § A.1. Note that Φre-
mains equivariant for any mapping b. This provides us the
option to parameterize bas a neural network and train the
entirety of Φfor the desired site invariance where equivari-
ance will be preserved due to (9). In this work, we learn
such a Φ : L → Z with the help of a decoder Ψ : Z → L
by minimizing the following loss,
X∈X ,Y ∈Y
Reconstruction loss
z }| {
Prediction loss
z }| {
subject to Φ(`) = τ(`)·bτ(`)1·`
| {z }
G-equivariant map
Minimizing the loss (5) with the constraint (6) allows learn-
ing the network b:L→Zand the decoder Ψ : Z → L.
We are now left with asking that Z∈ Z be such that the rep-
resentations are invariant across the sites. We simply use the
following MMD criterion although other statistical distance
measures can also be utilized.
The criterion is defined using a Reproducing Kernel Hilbert
Space with norm k·kHand kernel K. We combine (5), (6)
and (7) as the objective function to ensure site invariance.
Thus, the combined loss function Lstage2+MMD is min-
imized to learn ,Ψ). Scaling factor details are available
in the appendix § A.7.
Summary of the two stages. Our overall method com-
prises of two stages. The first stage, Section 3.1, involves
learning the τfunction. The function learned in this stage is
G-equivariant by the choice of the loss Lstage1, see (3). Our
next stage, Section 3.2, employs the learned τfunction and
a trainable mapping bto generate invariant representations.
This stage preserves G-equivariance due to the Φmapping
in (9). The loss for the second step is Lstage2+MMD , see
(5). Our method is summarized in Algorithm 1. Conver-
gence behavior of the proposed optimization (of τ, Φ) still
seems challenging to characterize exactly, but recent papers
provide some hope, and opportunities. For example, if the
networks are linear, then results from [18] maybe applicable
which explain the our superior empirical performance.
4. Experiments
We evaluate our proposed encoder for site-invariance
and robustness to changes in the covariate values C. Eval-
uations are performed on two multi-site neuroimaging
datasets, where algorithmic developments are likely to be
most impactful. Prior to neuroimaging datasets, we also
conduct experiments on two standard fairness datasets, Ger-
man and Adult. The inclusion of fairness datasets in our
analysis, provides us a means for sanity tests and optimiza-
tion feasibility on an established problem. Here, the goal
of achieving fair representations is treated as pooling multi-
ple subsets of data indexed by separate sensitive attributes.
We begin our analysis by first describing our measures of
evaluation and then reporting baselines for comparisons.
Measures of Evaluation. Recall that our method in-
volves learning τas in (3) to satisfy the equivariance prop-
erty. Moreover, we need to learn Φas in (9)–(5) to achieve
site invariance. Our measures assess the structure of the la-
tent space τ(`)and Φ(`). The measures are: (a) Eq :This
metric evaluates the `2distance between τ(`i)and τ(`j)for
all pairs i, j. Formally, it is computed as
Eq =X
{(Xi,i),(Xj,j)}⊂X ×C
A higher value of this metric indicates that τ(`i)and
τ(`j)are related by the group action gij . Additionally,
we use t-SNE [48] to qualitatively visualize the effect of
τ.(b) Adv :This metric quantifies the site-invariance
achieved by the encoder Φ. We evaluate if Φ(`)for a
learned `∈ L has any information about the site. A three
layered fully network (see appendix § A.6) is trained as an
adversary to predict site from Φ(`), similar to [49]. A lower
value of Adv, that is close to random chance, is desirable.
(c) M:Here, we compute the MMD measure, as in (7),
on the test set. A smaller value of Mindicates better invari-
ance to site. Lastly, (d) ACC :This metric notes the test set
Figure 5. Statistical Analysis on the reconstructed outputs. The vox-
els that are significantly associated with Alzheimer’s disease (p <0.001)
are shown. Adjustments for multiple comparisons were made using Bon-
ferroni correction. A high density of significant voxels indicates that our
method preserves disease related signal after pooling across scanners.
accuracy in predicting the target variable Y.
Baselines for Comparison. We contrast our method’s
performance with respect to a few well-known baselines.
(i) Na¨
ıve: This method indicates a na¨
ıve approach of pool-
ing data from multiple sites without any scheme to handle
nuisance variables. (ii) MMD [29]: This method mini-
mizes the distribution differences across the sites without
any requirements for equivariance to the covariates. The la-
tent representations being devoid of the equivariance prop-
erty result in lower accuracy values as we will see shortly.
(iii) CAI [49]: This method introduces a discriminator to
train the encoder in a minimax adversarial fashion. The
training routine directly optimizes the Adv measure above.
While being a powerful implicit data model, adversarial
methods are known to have unstable training and lack con-
vergence guarantees [40]. (iv) SS [55]: This method adopts
a Sub-sampling (SS) framework to divide the images across
the sites by the covariate values C. An MMD criterion is
minimized individually for each of the sub-sampled groups
and an average estimate is computed. Lastly, (v) RM [33]:
Also used in [31], RandMatch (RM) learns invariant repre-
sentations on samples across sites that ”match” in terms of
the class label (we match based on both Yand Cvalues) .
Below, we summarize each method and nuisance attribute
correction adopted by them.
Correction Na¨
ıve MMD [29] CAI [49] SS [55] RM [33] Ours
Site 7 3 3 3 3 3
Covariates 7 7 7 3 3 3
Table 1. Baselines in the paper and their nuisance attribute correction.
We evaluate methods on the test partition provided with
the datasets. The mean of the metrics over three random
seeds is reported. The hyper-parameter selection is done on
a validation split from the training set, such that the predic-
tion accuracy falls within 5% window relative to the best
performing model [10] (more details in appendix § A.2).
4.1. Obtaining Fair Representations
We approach the problem of learning fair representations
through our multi-site pooling formulation. Specifically, we
consider each sensitive attribute value as a separate site. Re-
sults on two benchmark datasets, German and Adult [11],
are described below.
German Dataset. This dataset is a classification prob-
lem used to predict defaults on the consumer loans in the
German market. Among the several features in the dataset,
the attribute foreigner is chosen as a sensitive attribute. We
train our encoder while maintaining equivariance with re-
spect to the continuous valued age feature. Table 2provides
a summary of the results in comparison to the baselines.
Our equivariant encoder maximizes the Eq metric indicat-
ing the the latent space τ(`)is well separated for different
values of age. Further, the invariance constraint improves
the Adv metric signifying a better elimination of sensitive
attribute information from the representations. The Mmet-
ric is higher relative to the other baselines. The ACC for all
the methods are within a 2% range.
Adult Dataset. In the Adult dataset, the task is to predict
if a person has an income higher (or lower) than $50Kper
year. The dataset is biased with respect to gender, roughly,
1-in-5women (in contrast to 1-in-3men) are reported to
make over $50K. Thus, the female/male genders are con-
sidered as two separate sites with age as a nuisance covari-
ate feature. As shown in Table 2, our equivariant encoder
improves on metrics Eq and Adv relative to all the base-
lines similar to the German dataset. In addition to the quan-
titative metrics, we visualize the t-SNE plots of the repre-
sentations τ(`)in Fig. 4(right). It is clear from the fig-
ure that an equivariant encoder imposes a certain monotonic
trend as the Age values as varied.
4.2. Pooling Brain Images across Scanners
For our motivating application, we focus on pool-
ing tasks for two different brain imaging datasets where
the problem is to classify individuals diagnosed with
Alzheimer’s disease (AD) and healthy control (CN).
Setup. Images are pre-processed by first normaliz-
ing and then skull-stripping using Freesurfer [15]. A linear
(affine) registration is performed to register each image to
MNI template space. Images are trained using 3D convo-
lutions with ResNet [23] backbone (details in the appendix
§A.6). Since the datasets are small, we report results over
five random training-validation splits.
ADNI Dataset. The data for this experiment has been
downloaded from the Alzheimers Disease Neuroimaging
Figure 6. Distribution of age covariate in the ADNI dataset. Two set-
tings are considered – (left) the intersection of the support is large, and
(right) with a smaller common support. Despite the mismatch of support
across scanner attributes, our approach minimizes the MMD measure (de-
sirable) on the test set relative to the na¨
ıve pooling method.
Eq :Equivariance Gap, Adv :Adversarial Test Accuracy, M:Test MMD measure, ACC :Test prediction accuracy
: Higher Value is preferred, : Lower Value is preferred
German Adult ADNI ADCP
Eq Adv M ↓ ACC ↑ Eq Adv M ↓ ACC Eq Adv M ↓ AC C ↑ Eq Adv M ↓ ACC
ıve 4.6(0.7) 0.62(0.03) 7.7(0.8) 74(0.9) 3.4(0.7) 83(0.1) 9.8(0.3) 84(0.1) 3.1(1.0) 59(2.9) 27(1.6) 80(2.6) 4.1(0.9) 49(8.4) 90(8.7) 83(4.4)
MMD [29]4.5(1.0) 0.66(0.04) 1.5(0.3) 73(1.5) 3.4(0.9) 83(0.1) 3.1(0.3) 84(0.1) 3.1(1.0) 59(3.3) 27(1.7) 80(2.6) 3.6(1.0) 49(11.9) 86(11.0) 84(6.5)
CAI [49]1.9(0.6) 0.65(0.01) 1.2(0.2) 76(1.3) 0.1(0.0) 81(0.7) 4.2(2.4) 84(0.04) 2.4(0.7) 61(2.1) 27(1.5) 74(3.6) 2.8(1.6) 56(6.9) 85(12.3) 82(5.1)
SS [55]3.8(0.5) 0.70(0.07) 1.5(0.6) 76(0.9) 2.8(0.5) 83(0.2) 1.5(0.2) 84(0.1) 3.7(0.5) 57(2.1) 26(1.6) 81(3.7) 3.4(1.3) 51(6.7) 88(14.6) 82(3.5)
RM [33]3.4(0.4) 0.66(0.04) 7.5(0.9) 74(2.1) 0.8(0.1) 82(0.4) 4.8(0.7) 84(0.3) 0.8(0.9) 52(5.4) 22(0.6) 78(3.8) 0.4(0.5) 40(4.7) 77(13.8) 84(5.3)
Ours 6.4(0.6) 0.54(0.01) 2.7(0.6) 75(3.3) 5.3(0.9) 75(1.4) 7.1(0.6) 83(0.1) 5.1(1.2) 50(4.2) 16(7.2) 77(4.8) 7.5(1.2) 49(7.3) 70(22.3) 81(1.8)
Table 2. Quantitative Results. We show Mean(Std) results over multiple run. For our baselines, we consider a Na¨
ıve encoder-decoder model, learning
representations via minimizing the MMD criterion [29] and Adversarial training [49], termed as CAI. We also compare against Sub-sampling (SS) [55]
that minimizes the MMD criterion separately for every age group, and the RandMatch (RM) [33] baseline that generates matching input pairs based on the
Age and target label values. The SS and RM baselines discard subset of samples if a match across sites is not available. The measure Adv represents the
adversarial test accuracy except for the German dataset where ROC-AUC is used due to high degree of skew in the data.
Initiative (ADNI) database ( We have
three scanner types in the dataset, namely, GE, Siemens
and Phillips. Similar to the fairness experiments, equivari-
ance is sought relative to the covariate Age. The values of
Age are in the range 50-95 as indicated in density plot of
Fig. 6(left). The Age distribution is observed to vary across
different scanners, albeit minimally, in the full dataset. In
the t-SNE plot, Fig. 4(left), we see that the latent space
has an equivariant structure. Closer inspection of the plot
shows that the representations vary in the same order as that
of Age. Different colors indicate different Age sub-groups.
Next, in Fig. 5, we present the t-statistics in the template
space on the reconstructed images after pooling. Here, the
t-statistics measure the association with AD/CN target la-
bels. As seen in the figure, the voxels significantly associ-
ated with the Alzheimer’s disease (p < 0.001) are consid-
erable in number. This result supports our goal to combine
datasets to increase sample size and obtain a high power
in statistical analysis. Next, in Fig. 6(right), we increase
the difficulty of our problem by randomly sub-sampling for
each scanner group such that the intersection of support is
minimized. In such an extreme case, our method attains a
better Mmetric relative to the Na¨
ıve method, thus justify-
ing the applicability to situations where there is a mismatch
of support across the sites. Lastly, we inspect the perfor-
mance on the quantitative metrics on the entire dataset in
Table 2. All metrics Eq ,Adv and Mimprove relative to
the baselines with a small drop in the ACC.
ADCP dataset. This experiment’s data was collected
as part of the NIH-sponsored Alzheimer’s Disease Connec-
tome Project (ADCP) [1,25]. It is a two-center MRI, PET,
and behavioral study of brain connectivity in AD. Study in-
clusion criteria for AD / MCI (Mild Cognitive Impairment)
patients consisted of age 5590 years who retain decisional
capacity at initial visit, and meet criteria for probable AD or
MCI. MRI images were acquired at three sites. The three
sites differ primarily in terms of the patient demograph-
ics. We inspect the quantitative results of this experiment
in Tab. 2and place the qualitative results in the appendix
§A.4,A.5. The table reveals considerable improvements in
all our metrics relative to the Na¨
ıve method.
Limitations. Currently, our formulation assumes that the
to-be-pooled imaging datasets are roughly similar – there is
definitely a role for new developments in domain alignment
to facilitate deployment in a broader range of applications.
Secondly, larger latent space dimensions may cause com-
pute overhead due to matrix exponential parameterization.
Finally, algorithmic improvements can potentially simplify
the overhead of the two-stage training.
5. Conclusions
Retrospective analysis of data pooled from previous /
ongoing studies can have a sizable influence on identifying
early disease processes, not otherwise possible to glean
from analysis of small neuroimaging datasets. Our devel-
opment based on recent results in equivariant representation
learning offers a strategy to perform such analysis when
covariates/nuisance attributes are not identically distributed
across sites. Our current work is limited to a few such
variables but suggests that this direction is promising and
can potentially lead to more powerful algorithms.
Acknowledgments The authors are grateful to Vib-
hav Vineet (Microsoft Research) for discussions on
the causal diagram used in the paper. Thanks to Amit
Sharma (Microsoft Research) for the conversation on
their MatchDG project. Special thanks to Veena Nair
and Vivek Prabhakaran from UW Health for helping with
the ADCP dataset. Research supported by NIH grants to
UW CPCP (U54AI117924), RF1AG059312, Alzheimer’s
Disease Connectome Project (ADCP) U01 AG051216,
and RF1AG059869, as well as NSF award CCF 1918211.
Sathya Ravi was also supported by UIC-ICR start-up funds.
[1] Nagesh Adluru, Veena A Nair, Vivek Prabhakaran, Shi-Jiang
Li, Andrew L Alexander, and Barbara B Bendlin. Geodesic
path differences in neural networks in the alzheimer’s dis-
ease connectome project: Developing topics. Alzheimer’s &
Dementia, 16:e047284, 2020. 8
[2] Paul S Aisen, Jeffrey Cummings, Clifford R Jack, John C
Morris, Reisa Sperling, Lutz Fr¨
olich, Roy W Jones, Sherie A
Dowsett, Brandy R Matthews, Joel Raskin, et al. On the path
to 2025: understanding the alzheimer’s disease continuum.
Alzheimer’s research & therapy, 9(1):1–10, 2017. 11,12
[3] Aditya Kumar Akash, Vishnu Suresh Lokhande, Sathya N
Ravi, and Vikas Singh. Learning invariant represen-
tations using inverse contrastive loss. arXiv preprint
arXiv:2102.08343, 2021. 2
[4] Jesper LR Andersson, Mark Jenkinson, Stephen Smith, et al.
Non-linear registration, aka spatial normalisation fmrib tech-
nical report tr07ja2. FMRIB Analysis Group of the University
of Oxford, 2(1):e21, 2007. 12
[5] Martin Arjovsky, L ´
eon Bottou, Ishaan Gulrajani, and David
Lopez-Paz. Invariant risk minimization. arXiv preprint
arXiv:1907.02893, 2019. 2
[6] Elias Bareinboim and Judea Pearl. Causal inference and the
data-fusion problem. Proceedings of the National Academy
of Sciences, 113(27):7345–7352, 2016. 2
[7] Benjamin Bloem-Reddy and Yee Whye Teh. Probabilistic
symmetries and invariant neural networks. Journal of Ma-
chine Learning Research, 21(90):1–61, 2020. 4,5
[8] Michael M Bronstein, Joan Bruna, Taco Cohen, and Petar
c. Geometric deep learning: Grids, groups, graphs,
geodesics, and gauges. arXiv preprint arXiv:2104.13478,
2021. 4
[9] Daniel C Castro, Ian Walker, and Ben Glocker. Causal-
ity matters in medical imaging. Nature Communications,
11(1):1–10, 2020. 3
[10] Michele Donini, Luca Oneto, Shai Ben-David, John S
Shawe-Taylor, and Massimiliano Pontil. Empirical risk min-
imization under fairness constraints. In S. Bengio, H. Wal-
lach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R.
Garnett, editors, Advances in Neural Information Processing
Systems, volume 31. Curran Associates, Inc., 2018. 7
[11] Dheeru Dua, Casey Graff, et al. Uci machine learning repos-
itory. 2017. 7
[12] Abhimanyu Dubey, Vignesh Ramanathan, Alex Pentland,
and Dhruv Mahajan. Adaptive methods for real-world do-
main generalization. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pages
14340–14349, 2021. 2
[13] David S. Dummit and Richard M. Foote. Abstract algebra.
Wiley, 3rd ed edition, 2004. 5
[14] Simon F Eskildsen, Pierrick Coup´
e, Vladimir S Fonov,
Jens C Pruessner, D Louis Collins, Alzheimer’s Dis-
ease Neuroimaging Initiative, et al. Structural imaging
biomarkers of alzheimer’s disease: predicting disease pro-
gression. Neurobiology of aging, 36:S23–S31, 2015. 2
[15] Bruce Fischl. Freesurfer. Neuroimage, 62(2):774–781, 2012.
[16] Jean-Philippe Fortin, Drew Parker, Birkan Tunc¸, Takanori
Watanabe, Mark A Elliott, Kosha Ruparel, David R Roalf,
Theodore D Satterthwaite, Ruben C Gur, Raquel E Gur, et al.
Harmonization of multi-site diffusion tensor imaging data.
Neuroimage, 161:149–170, 2017. 1
[17] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pas-
cal Germain, Hugo Larochelle, Franc¸ois Laviolette, Mario
Marchand, and Victor Lempitsky. Domain-adversarial train-
ing of neural networks. The journal of machine learning
research, 17(1):2096–2030, 2016. 2
[18] Avishek Ghosh and Ramchandran Kannan. Alternating min-
imization converges super-linearly for mixed linear regres-
sion. In International Conference on Artificial Intelligence
and Statistics, pages 1093–1103. PMLR, 2020. 6
[19] Matthew F Glasser, Stamatios N Sotiropoulos, J Anthony
Wilson, Timothy S Coalson, Bruce Fischl, Jesper L Anders-
son, Junqian Xu, Saad Jbabdi, Matthew Webster, Jonathan R
Polimeni, et al. The minimal preprocessing pipelines for the
human connectome project. Neuroimage, 80:105–124, 2013.
[20] Arthur Gretton, Karsten Borgwardt, Malte Rasch, Bernhard
olkopf, and Alex Smola. A kernel method for the two-
sample-problem. Advances in neural information processing
systems, 19:513–520, 2006. 3
[21] Marshall Hall. The theory of groups. Courier Dover Publi-
cations, 2018. 5
[22] Moritz Hardt and Benjamin Recht. Patterns, predictions,
and actions: A story about machine learning.https:
//, 2021. 3
[23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Identity mappings in deep residual networks. In European
conference on computer vision, pages 630–645. Springer,
2016. 7
[24] Mark Hunacek. Lie groups, lie algebras, and representa-
tions: An elementary introduction, by brian hall. pp. 351.£
50. 2003. isbn 0 387 401229 (springer-verlag). The Mathe-
matical Gazette, 89(514):149–151, 2005. 4
[25] Gyujoon Hwang, Cole John Cook, Veena A Nair, Andrew L
Alexander, Piero G Antuono, Sanjay Asthana, Rasmus Birn,
Cynthia M Carlsson, Guangyu Chen, Dorothy Farrar Ed-
wards, et al. Ic-p-161: Characterizing structural brain alter-
ations in alzheimer’s disease patients with machine learning.
Alzheimer’s & Dementia, 14(7S Part 2):P135–P136, 2018.
[26] Clifford R Jack Jr, Matt A Bernstein, Nick C Fox,
Paul Thompson, Gene Alexander, Danielle Harvey, Bret
Borowski, Paula J Britson, Jennifer L. Whitwell, Chadwick
Ward, et al. The alzheimer’s disease neuroimaging initia-
tive (adni): Mri methods. Journal of Magnetic Resonance
Imaging: An Official Journal of the International Society for
Magnetic Resonance in Medicine, 27(4):685–691, 2008. 1,
[27] Mark Jenkinson, Christian F Beckmann, Timothy EJ
Behrens, Mark W Woolrich, and Stephen M Smith. Fsl. Neu-
roimage, 62(2):782–790, 2012. 12
[28] Anthony W Knapp and AW Knapp. Lie groups beyond an
introduction, volume 140. Springer, 1996. 4
[29] Yujia Li, Kevin Swersky, and Richard Zemel. Learning un-
biased features. arXiv preprint arXiv:1412.5244, 2014. 4,7,
[30] Jingqin Luo, Folasade Agboola, Elizabeth Grant, Colin L
Masters, Marilyn S Albert, Sterling C Johnson, Eric M Mc-
Dade, Jonathan V¨
oglein, Anne M Fagan, Tammie Benzinger,
et al. Sequence of alzheimer disease biomarker changes in
cognitively normal adults: A cross-sectional study. Neurol-
ogy, 95(23):e3104–e3116, 2020. 1
[31] Divyat Mahajan, Shruti Tople, and Amit Sharma. Domain
generalization using causal matching. In International Con-
ference on Machine Learning, pages 7313–7324. PMLR,
2021. 2,3,7
[32] Ronak Mehta, Rudrasis Chakraborty, Yunyang Xiong, and
Vikas Singh. Scaling recurrent models via orthogonal
approximations in tensor trains. In Proceedings of the
IEEE/CVF International Conference on Computer Vision,
pages 10571–10579, 2019. 5
[33] Saeid Motiian, Marco Piccirilli, Donald A Adjeroh, and Gi-
anfranco Doretto. Unified deep supervised domain adapta-
tion and generalization. In Proceedings of the IEEE inter-
national conference on computer vision, pages 5715–5725,
2017. 7,8
[34] Daniel Moyer, Shuyang Gao, Rob Brekelmans, Aram Gal-
styan, and Greg Ver Steeg. Invariant representations with-
out adversarial training. In S. Bengio, H. Wallach, H.
Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett,
editors, Advances in Neural Information Processing Systems,
volume 31. Curran Associates, Inc., 2018. 2,11
[35] Prashant Pandey, Mrigank Raman, Sumanth Varambally,
and Prathosh AP. Domain generalization via inference-
time label-preserving target projections. arXiv preprint
arXiv:2103.01134, 2021. 2
[36] Judea Pearl, Madelyn Glymour, and Nicholas P Jewell.
Causal inference in statistics: A primer. John Wiley & Sons,
2016. 3
[37] Jonas Peters, Dominik Janzing, and Bernhard Sch¨
olkopf. El-
ements of causal inference: foundations and learning algo-
rithms. The MIT Press, 2017. 2
[38] Qi Qi, Zhishuai Guo, Yi Xu, Rong Jin, and Tianbao Yang. An
online method for distributionally deep robust optimization,
2020. 3
[39] Paul R Rosenbaum and Donald B Rubin. Constructing a con-
trol group using multivariate matched sampling methods that
incorporate the propensity score. The American Statistician,
39(1):33–38, 1985. 2
[40] Florian Schaefer and Anima Anandkumar. Competitive gra-
dient descent. In H. Wallach, H. Larochelle, A. Beygelzimer,
F. d'Alch´
e-Buc, E. Fox, and R. Garnett, editors, Advances in
Neural Information Processing Systems, volume 32. Curran
Associates, Inc., 2019. 7
[41] Bernhard Sch¨
olkopf, Francesco Locatello, Stefan Bauer,
Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and
Yoshua Bengio. Toward causal representation learning. Pro-
ceedings of the IEEE, 109(5):612–634, 2021. 2
[42] Jon M Selig. Cayley maps for se (3). In 12th International
Federation for the Promotion of Mechanism and Machine
Science World Congress, page 6. London South Bank Uni-
versity, 2007. 5
[43] Nihar B Shah and Martin J Wainwright. Simple, robust and
optimal ranking from pairwise comparisons. The Journal of
Machine Learning Research, 18(1):7246–7283, 2017. 3
[44] Anja Soldan, Corinne Pettigrew, Anne M Fagan, Suzanne E
Schindler, Abhay Moghekar, Christopher Fowler, Qiao-Xin
Li, Steven J Collins, Cynthia Carlsson, Sanjay Asthana, et al.
Atn profiles among cognitively normal individuals and lon-
gitudinal cognitive outcomes. Neurology, 92(14):e1567–
e1579, 2019. 1
[45] Adarsh Subbaswamy, Peter Schulam, and Suchi Saria. Pre-
venting failures due to dataset shift: Learning predictive
models that transport. In The 22nd International Conference
on Artificial Intelligence and Statistics, pages 3118–3127.
PMLR, 2019. 2
[46] Chung-Piaw Teo, Jay Sethuraman, and Wee-Peng Tan. Gale-
shapley stable marriage problem revisited: Strategic issues
and applications. Management Science, 47(9):1252–1267,
2001. 3
[47] Paul M Thompson, Jason L Stein, Sarah E Medland, Der-
rek P Hibar, Alejandro Arias Vasquez, Miguel E Renteria,
Roberto Toro, Neda Jahanshad, Gunter Schumann, Barbara
Franke, et al. The enigma consortium: large-scale collab-
orative analyses of neuroimaging and genetic data. Brain
imaging and behavior, 8(2):153–182, 2014. 1
[48] Laurens Van der Maaten and Geoffrey Hinton. Visualiz-
ing data using t-sne. Journal of machine learning research,
9(11), 2008. 6
[49] Qizhe Xie, Zihang Dai, Yulun Du, Eduard Hovy, and Gra-
ham Neubig. Controllable invariance through adversarial
feature learning. In I. Guyon, U. V. Luxburg, S. Bengio,
H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, ed-
itors, Advances in Neural Information Processing Systems,
volume 30. Curran Associates, Inc., 2017. 6,7,8,11
[50] Shiqi Yang, Yaxing Wang, Joost van de Weijer, Luis Her-
ranz, and Shangling Jui. Generalized source-free domain
adaptation. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, pages 8978–8987, 2021. 2
[51] Zhuoran Yang, Yufeng Zhang, Yongxin Chen, and Zhao-
ran Wang. Variational transport: A convergent particle-
basedalgorithm for distributional optimization. arXiv
preprint arXiv:2012.11554, 2020. 3
[52] Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cyn-
thia Dwork. Learning fair representations. In International
conference on machine learning, pages 325–333. PMLR,
2013. 2
[53] Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell.
Mitigating unwanted biases with adversarial learning. In
Proceedings of the 2018 AAAI/ACM Conference on AI,
Ethics, and Society, pages 335–340, 2018. 2
[54] Deli Zhao, Jiapeng Zhu, and Bo Zhang. Latent variables on
spheres for autoencoders in high dimensions. arXiv preprint
arXiv:1912.10233, 2019. 4
[55] Hao Henry Zhou, Vikas Singh, Sterling C Johnson, Grace
Wahba, Alzheimer’s Disease Neuroimaging Initiative, et al.
Statistical tests and identifiability conditions for pooling and
analyzing multisite datasets. Proceedings of the National
Academy of Sciences, 115(7):1481–1486, 2018. 2,3,7,8
[56] Hao Henry Zhou, Yilin Zhang, Vamsi K. Ithapu, Sterling C.
Johnson, Grace Wahba, and Vikas Singh. When can multi-
site datasets be pooled for regression? Hypothesis tests, `2-
consistency and neuroscience applications. In Doina Precup
and Yee Whye Teh, editors, Proceedings of the 34th Interna-
tional Conference on Machine Learning, volume 70 of Pro-
ceedings of Machine Learning Research, pages 4170–4179.
PMLR, 06–11 Aug 2017. 1
A. Appendix
A.1. Proofs of theoretical results
In this section, we will provide the proofs of Lemma 4
and Lemma 5discussed in the main paper.
Lemma. Given two latent space representations `i,`j
Sn1, and the corresponding cosets giH=τ(`i)and
gjH=τ(`j),!gij =gjg1
iGsuch that `j=gij ·`i.
Proof. Given giH=τ(`i)and gjH=τ(`j), we use gij =
jGsuch that, gjH=gij giH.
Now using the equivariance fact (3) , we get,
gjH=gij giH
=τ(`j) = gij τ(`i)
=τ(`j) = τ(gij ·`i)
Now as τis an identification, i.e., a diffeomorphism, we
get `j=gij `i. Note that Sn1is a Riemannian homoge-
neous space and the group Gacts transitively on Sn1, i.e.,
given x,ySn1,gGsuch that, y=g·x. Hence
from `j=gij `iand the transitivity property we can con-
clude that gij is unique.
Lemma. For a τ:L → G/H as defined above, and a
mapping b:L→Z, the function Φ : L→Zdefined by
Φ(`) = τ(`)·bτ(`)1·`(9)
is G-equivariant, i.e., Φ(g·`) = gΦ(`).
Proof. Let `∈ L. Consider the Φmapping of g·`, that is
Φ(g·`) = τ(g·`)·bτ(g·`)1·`.
Using the fact (3) from the main paper, we have τ(g·
`) = (`)and τ(g·`)1=τ(`)1g1. Substituting
these in Φ(g·`), we get
Φ(g·`) = (`)·bτ(`)1g1g·`
= (`)bτ(`)1·`
Thus, Φ(g·`) = gΦ(`)
A.2. Details on Evaluation Metrics
Recall from Section 4of the paper, our discussion on
three metrics – Eq ,Adv and M. While Eq and M
are variants of distance measure on the latent space, Adv
assesses the ability to predict the nuisance attributes from
the latent representation (and is therefore probabilistic in
nature). Observe that Eq and Mare (euclidean) distance
measures and could be very different depending on the nor-
malization of the vectors. For our purposes of evaluating
these latent vectors/features in downstream tasks, we per-
form a simple feature normalization in order to obtain 01
latent vectors given by,
Our feature normalization is composed of two steps: (i) cen-
tering – the numerator in (10) ensures that the mean of z
(along its coordinates) is 0; and (ii) scale – the denominator
projects the features zon the sphere at origin with radius
= max(zi)min(zi)0. Note that our scaling
step can be thought of as the usual projection in a special
case: when ziis guaranteed to be nonnegative (for exam-
ple, when zirepresent activations), then kzik
simply cor-
responds to a lower bound of the usual infinity norm, kzk
(hence projection on a scaled `ball). We adopt this nor-
malization only to compute Eq and Mmeasures, and not
for model training.
For computing the Adv measure, we follow [49] to train
an adversarial neural network predicting the nuisance at-
tributes. We use a three-layered fully connected network
with batch normalization and train for 150 epochs. [34] uses
similar architecture for the adversaries with different hidden
layers of 0,1,2,3. We found that a three-layer adversary
is powerful enough to predict the nuisance attributes and
hence we use it to report the Adv measure.
A.3. Understanding ADNI dataset
Dataset. The data was downloaded from the Alzheimers
Disease Neuroimaging Initiative (ADNI) database
( The ADNI was launched in 2003
as a public-private partnership, led by Principal Investi-
gator Michael W. Weiner, MD. ADNI was set up with an
objective to measure the progression of mild cognitive
impairment (MCI) and early Alzheimers disease (AD)
using serial magnetic resonance imaging (MRI), positron
emission tomography (PET), other biological markers. We
have three imaging protocol (scanner) types in the dataset,
namely, GE, Siemens and Phillips. The count of samples
AD/CN in each of these imaging protocols are provided in
Table 3. An example illustration (borrowed from [2]) of
using different scanner on the images is shown in Figure 8.
Preprocessing. All images were first normalized and skull-
stripped using Freesurfer [15]. A linear (affine) registra-
(a) Variation due to scanner for particular age group. (b) Variation due to covariates (age) in scanner 3.
Figure 7. Sample Images from ADCP dataset. (a) MRI images on control subjects from the ADCP dataset for different sites in the age group 70-80.
(b) Images obtained from Site 3for three extreme age groups. The gantt chart on top of the image indicates the respective age range in the other sites.
(a) GE (b) Siemens
Figure 8. Scanner effects on images. Two imaging protocols are shown:
(a) Siemens, (b) GE. The yellow region is the cortical ribbon segmenta-
tion, and the green circle shows that the imaging protocol from different
manufacturers have an effect on the scan. Image borrowed from [2].
tion was performed to register each image to MNI template
A.4. Understanding ADCP dataset
Participants. The data for ADCP was collected through an
NIH-sponsored Alzheimer’s Disease Connectome Project
(ADCP) U01 AG051216. The study inclusion criteria for
AD (Alzheimer’s disease) / MCI (Mild Cognitive Impair-
ment) patients consisted of age between 55-90 years, will-
ing and able to undergo all procedures, retains decisional
capacity at initial visit, meets criteria for probable AD or
meets criteria for MCI.
Scanners. MRI images were acquired at three distinct sites
on GE scanners. T1-weighted structural images were ac-
quired using a 3D gradient-echo pulse sequence (repetition
time (TR) = 604 ms, echo time (TE) = 2.516 ms, inversion
time = 1060 ms, flip angle = 8o, field of view (FOV) = 25.6
cm, 0.8mm isotropic). T2-weighted structural images were
acquired using a 3D fast spin-echo sequence (TR = 2500
ms, TE = 94.398 ms, flip angle = 90o, FOV = 25.6cm, 0.8
Table 3. Sample counts for ADNI dataset
Imaging Protocol AD CN
Manufacturer=GE Medical Systems 44 78
Manufacturer=Philips Medical Systems 32 50
Manufacturer=Siemens 83 162
mm isotropic).
Preprocessing. The Human Connectome Project (HCP)
minimal preprocessing pipeline version 3.4.0[19] was fol-
lowed for data processing. This pipeline is based on FM-
RIB Software Library [27]. Next, the T1w and T2w images
are aligned, a B1 (bias field) correction is performed, and
the subject’s image in native structural volume space is reg-
istered to MNI space using FSL’s FNIRT [4]. Only T1w
images in the MNI space were used for further analysis and
Data Statistics. We plot the distributions of several at-
tributes in this dataset conditioned on the site. In Figure 10,
we show that the values of age and cognitive scores differ
across the three sites in this dataset. Cognitive scores are
computed based on an test assigned to the patients. Higher
scores indicate higher cognitive operation in the patient. Ta-
ble 4shows the sample counts for target variable of predic-
tion AD (Alzheimer’s disease) and Control group.
A.5. Visualizing the latent space
In the paper Figure 4, we have seen the latent space
τ(`)for the samples in the ADNI and the Adult datasets.
Here, we will see similar qualitative results for the
German and the ADCP dataset in Figure 9of the sup-
plement. In the plots, the latent representations for a
non-equivariant encoder are stretched thoughout the latent
space. In contrast, the representations of an equivariant
encoder, for a discretized value of Age, are localized to
specific regions. Further, these representations have a
monotonic behaviour with respect to the values of Age.
Table 4. Sample counts for ADCP dataset
AD Control Female Male
site 1 10 39 29 20
site 2 10 33 30 13
site 3 5 19 14 10
(a) ADCP Dataset (b) German Dataset
Figure 9. t-SNE plots of latent representations of τ(`). For both ADCP (left) and German (right), the the latent vectors of the equivariant encoder are
evenly distributed with respect to the age covariate value. The non-equivariant space is generated from the na¨
ıve pooling model. Different colors denote the
discretized set of age covariate value present in the data.
(a) Age (b) Cognitive Score
Figure 10. Distribution of attributes in the ADCP dataset. On the left
we observe the distribution of age for the three different sites present in the
ADCP dataset. On the right, we see the distribution of the cognitive scores.
The cognitive scores are computed based on a test that assesses executive
function. Higher scores indicate higher level of cognitive flexibility. Both
age and cognitive scores are observed to vary across the sites.
Listing 1. Residual Block
Swi s h
Swi s h
Listing 2. Fully Connected Block
L i n e a r
Swi s h
L i n e a r
A.6. Hyper-parameters and NN Architectures
For tabular datasets such as German and Adult, our en-
coders and decoders comprise of fully connected networks
and a hidden layer of 64 nodes. The dimension of the quo-
tient latent space τ(`i)is 30. Adam is used as a default
optimizer and the learning rate is adjusted based on the val-
idation set.
Imaging datasets like ADNI and ADCP require 3D con-
volutions and a ResNet architecture as the backbone. The
last layer is used to describe the quotient space τ(`i). We
present the residual and the fully connected block below.
Detailed architectures can be viewed in the code.
A.7. Scaling factors
Recall from the Algorithm 1of the main paper that our
loss function for each stage comprises of reconstruction
and prediction losses in addition to the objectives concern-
ing equivariance and invariance. These multi-objective loss
functions require scaling factors that upweight one objec-
tive over the other. These scaling factors group up as hyper-
parameters for the Algorithm. In our experiments, it was
observed that the results were robust to a range of scaling
factor choices. For the results reported in Table 1of the
paper, they were identified through cross-validation. Here
we provide an example for the scaling factors used for the
Adult dataset, please refer to the bash scripts available in
the code for the scaling factors of other datasets.
Stage one: Equivariance to Covariates
Equivariance Loss Lstage1
Scaling Factor : 1.0
Reconstruction Loss PikXiD(E(Xi))k
Scaling Factor : 0.02
Stage two: Invariance to Site
Invariance Loss MMD
Scaling Factor : 0.1
Prediction Loss kYh(Φ(`))k2
Scaling Factor : 1.0
Reconstruction Loss k`Ψ(Φ(`))k2
Scaling Factor : 0.1
We refer the reader to Algorithm 1and Section 3of the main
paper for the details on the notations used above.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Generalization of machine learning models trained on a set of source domains on unseen target domains with different statistics, is a challenging problem. While many approaches have been proposed to solve this problem, they only utilize source data during training but do not take advantage of the fact that a single target example is available at the time of inference. Motivated by this, we propose a method that effectively uses the target sample during inference beyond mere classification. Our method has three components-(i) A label-preserving feature or metric transformation on source data such that the source samples are clustered in accordance with their class irrespective of their domain (ii) A generative model trained on the these features (iii) A label-preserving projection of the target point on the source-feature manifold during inference via solving an optimization problem on the input space of the genera-tive model using the learned metric. Finally, the projected target is used in the classifier. Since the projected target feature comes from the source manifold and has the same label as the real target by design, the classifier is expected to perform better on it than the true target. We demonstrate that our method outperforms the state-of-the-art Domain Generalization methods on multiple datasets and tasks.
Full-text available
The two fields of machine learning and graphical causality arose and are developed separately. However, there is, now, cross-pollination and increasing interest in both fields to benefit from the advances of the other. In this article, we review fundamental concepts of causal inference and relate them to crucial open problems of machine learning, including transfer and generalization, thereby assaying how causality can contribute to modern machine learning research. This also applies in the opposite direction: we note that most work in causality starts from the premise that the causal variables are given. A central problem for AI and causality is, thus, causal representation learning, that is, the discovery of high-level causal variables from low-level observations. Finally, we delineate some implications of causality for machine learning and propose key research areas at the intersection of both communities.
Full-text available
Causal reasoning can shed new light on the major challenges in machine learning for medical imaging: scarcity of high-quality annotated data and mismatch between the development dataset and the target environment. A causal perspective on these issues allows decisions about data collection, annotation, preprocessing, and learning strategies to be made and scrutinized more transparently, while providing a detailed categorisation of potential biases and mitigation techniques. Along with worked clinical examples, we highlight the importance of establishing the causal relationship between images and their annotations, and offer step-by-step recommendations for future studies. Scarcity of high-quality annotated data and mismatch between the development dataset and the target environment are two of the main challenges in developing predictive tools from medical imaging. In this Perspective, the authors show how causal reasoning can shed new light on these challenges.
Learning invariant representations is a critical first step in a number of machine learning tasks. A common approach corresponds to the so-called information bottleneck principle in which an application dependent function of mutual information is carefully chosen and optimized. Unfortunately, in practice, these functions are not suitable for optimization purposes since these losses are agnostic of the metric structure of the parameters of the model. We introduce a class of losses for learning representations that are invariant to some extraneous variable of interest by inverting the class of contrastive losses, i.e., inverse contrastive loss (ICL). We show that if the extraneous variable is binary, then optimizing ICL is equivalent to optimizing a regularized MMD divergence. More generally, we also show that if we are provided a metric on the sample space, our formulation of ICL can be decomposed into a sum of convex functions of the given distance metric. Our experimental results indicate that models obtained by optimizing ICL achieve significantly better invariance to the extraneous variable for a fixed desired level of accuracy. In a variety of experimental settings, we show applicability of ICL for learning invariant representations for both continuous and discrete extraneous variables. The project page with code is available at
Neural networks derived from diffusion‐weighted MRI (DW‐MRI) may shed light on disease progression and pathology propagation in Alzheimer’s disease in vivo. In this work analysis of the path properties of such networks are presented. Geodesic or shortest paths are fundamental in understanding key network phenomenon such as the propagation rates of information, infection or pathology. For example, the ubiquitous small‐worldness property of natural occurring biological and social networks is based on having short paths between any pair of entities in the network. Connectome protocol based DW‐MRI data acquired from n=68 participants (Table 1) were analyzed. Neural networks were extracted from the data using state‐of‐the‐art image processing and tractography algorithms available in MRtrix3 (overview panel in Figure‐1). The average geodesic path lengths between frontal, temporal, parietal, occipital, subcortical regions were computed using Dijkstra algorithm. The regions were identified based on the IIT‐Desikan gray matter atlas. The paths can be used to reason about the average efficiency of communication of electrical signals or propagation of pathology between brain lobes. Statistical analysis was performed to test the geodesic path length differences between the CU, MCI and the AD groups controlling for age and sex. The path lengths were normalized so that they were at unity for the AD group, and differences were considered significant at p<=0.05. Statistical effects of consensus diagnosis on the relative geodesic path length (RGPL) differences between lobes are shown in Figure‐2. 80% of the connections showed statistically significant higher path lengths in AD compared to CU and 60% when compared to MCI. The distributions of the mean geodesic path lengths are shown in Figure‐3. For all the different pairs of lobes, the mean length was higher for the AD compared to the CU and MCI groups. The path lengths between all of the major lobes are higher for the AD group compared to both the CU and MCI groups. These findings suggest that network efficiency is reduced in AD and may explain cognitive dysfunction observed in the Alzheimer’s clinical syndrome. Future work entails incorporating better constraints on tractography using structural T1‐weighted images and separating disease groups by AD‐biomarker status.
Objective To determine the ordering of changes in Alzheimer disease (AD) biomarkers among cognitively normal individuals. Methods Cross-sectional data, including cerebrospinal fluid (CSF) analytes, molecular imaging of cerebral fibrillar β-amyloid with positron emission tomography (PET) using the [ ¹¹ C] benzothiazole tracer, Pittsburgh Compound-B (PiB), magnetic resonance imaging (MRI)-based brain structures, and clinical/cognitive outcomes harmonized from 8 studies, collectively involving 3,284 cognitively normal individuals of 18–101 years, were analyzed. The age at which each marker exhibited an accelerated change (called the change-point) was estimated, and compared across the markers. Results Accelerated changes in CSF Aβ1-42 (Aβ 42 ) occurred at 48.28 years of age and Aβ 42 /Aβ 40 ratio at 46.02 years, followed by PiB mean cortical standardized uptake value ratio (SUVR) with a change-point at 54.47 years. CSF total tau (Tau) and tau phosphorylated at threonine 181 (Ptau) had a change-point at about 60 years, similar to those for MRI hippocampal volume and cortical thickness. The change-point for a cognitive composite occurred at 62.41 years. The change-points for CSF Aβ 42 and Aβ 42 /Aβ 40 ratio, albeit not significantly different from that for PiB SUVR, occurred significantly earlier than that for CSF Tau, Ptau, MRI markers and the cognitive composite. Adjusted analyses confirmed that accelerated changes in CSF Tau, Ptau, MRI markers, and the cognitive composite occurred at ages not significantly different from each other. Conclusions Our findings support the hypothesized early changes of amyloid in preclinical AD, and suggest that changes in neuronal injury and neurodegeneration markers occur close in time to cognitive decline.
Conference Paper
Modern deep networks have proven to be very effective for analyzing real world images. However, their application in medical imaging is still in its early stages, primarily due to the large size of three-dimensional images, requiring enormous convolutional or fully connected layers - if we treat an image (and not image patches) as a sample. These issues only compound when the focus moves towards longitudinal analysis of 3D image volumes through recurrent structures, and when a point estimate of model parameters is insufficient in scientific applications where a reliability measure is necessary. Using insights from differential geometry, we adapt the tensor train decomposition to construct networks with significantly fewer parameters, allowing us to train powerful recurrent networks on whole brain image volume sequences. We describe the "orthogonal" tensor train, and demonstrate its ability to express a standard network layer both theoretically and empirically. We show its ability to effectively reconstruct whole brain volumes with faster convergence and stronger confidence intervals compared to the standard tensor train decomposition. We provide code and show experiments on the ADNI dataset using image sequences to regress on a cognition related outcome.
We introduce a new algorithm for the numerical computation of Nash equilibria of competitive two-player games. Our method is a natural generalization of gradient descent to the two-player setting where the update is given by the Nash equilibrium of a regularized bilinear local approximation of the underlying game. It avoids oscillatory and divergent behaviors seen in alternating gradient descent. Using numerical experiments and rigorous analysis, we provide a detailed comparison to methods based on \emph{optimism} and \emph{consensus} and show that our method avoids making any unnecessary changes to the gradient dynamics while achieving exponential (local) convergence for (locally) convex-concave zero sum games. Convergence and stability properties of our method are robust to strong interactions between the players, without adapting the stepsize, which is not the case with previous methods. In our numerical experiments on non-convex-concave problems, existing methods are prone to divergence and instability due to their sensitivity to interactions among the players, whereas we never observe divergence of our algorithm. The ability to choose larger stepsizes furthermore allows our algorithm to achieve faster convergence, as measured by the number of model evaluations.
Objective: To examine the long-term cognitive trajectories of individuals with normal cognition at baseline and distinct amyloid/tau/neurodegeneration (ATN) profiles. Methods: Pooling data across 4 cohort studies, 814 cognitively normal participants (mean baseline age = 59.6 years) were classified into 8 ATN groups using baseline CSF levels of β-amyloid 1-42 as a measure of amyloid (A), phosphorylated tau 181 as a measure of tau (T), and total tau as a measure of neurodegeneration (N). Cognitive performance was measured using a previously validated global factor score and with the Mini-Mental State Examination. We compared the cognitive trajectories across groups using growth curve models (mean follow-up time = 7 years). Results: Using different model formulations and cut points for determining biomarker abnormality, only the group with abnormal levels of amyloid, tau, and neurodegeneration (A+T+N+) showed consistently greater cognitive decline than the group with normal levels of all biomarkers (A-T-N-). Replicating prior findings using the 2011 National Institute on Aging-Alzheimer's Association/suspected non-Alzheimer disease pathophysiology schema, only individuals with abnormal levels of both amyloid and phosphorylated tau 181 or total tau (stage 2) showed greater cognitive decline than those with normal biomarker levels (stage 0). Conclusion: The results are consistent with the hypothesis that both elevated brain amyloid and neurofibrillary tangles are necessary to observe accelerated neurodegeneration, which in turn leads to cognitive decline.