Functional connectivity signatures of major depressive disorder: machine learning analysis of two multicenter neuroimaging studies


The promise of machine learning has fueled the hope for developing diagnostic tools for psychiatry. Initial studies showed high accuracy for the identification of major depressive disorder (MDD) with resting-state connectivity, but progress has been hampered by the absence of large datasets. Here we used regular machine learning and advanced deep learning algorithms to differentiate patients with MDD from healthy controls and identify neurophysiological signatures of depression in two of the largest resting-state datasets for MDD. We obtained resting-state functional magnetic resonance imaging data from the REST-meta-MDD (N = 2338) and PsyMRI (N = 1039) consortia. Classification of functional connectivity matrices was done using support vector machines (SVM) and graph convolutional neural networks (GCN), and performance was evaluated using 5-fold cross-validation. Features were visualized using GCN-Explainer, an ablation study and univariate t-testing. The results showed a mean classification accuracy of 61% for MDD versus controls. Mean accuracy for classifying (non-)medicated subgroups was 62%. Sex classification accuracy was substantially better across datasets (73–81%). Visualization of the results showed that classifications were driven by stronger thalamic connections in both datasets, while nearly all other connections were weaker with small univariate effect sizes. These results suggest that whole brain resting-state connectivity is a reliable though poor biomarker for MDD, presumably due to disease heterogeneity as further supported by the higher accuracy for sex classification using the same methods. Deep learning revealed thalamic hyperconnectivity as a prominent neurophysiological signature of depression in both multicenter studies, which may guide the development of biomarkers in future studies.
Functional connectivity signatures of major depressive
disorder: machine learning analysis of two multicenter
neuroimaging studies
Selene Gallo
, Ahmed El-Gazzar
, Paul Zhutovsky
, Rajat M. Thomas
, Nooshin Javaheripour
, Meng Li
Lucie Bartova
, Deepti Bathula
, Udo Dannlowski
, Christopher Davey
, Thomas Frodl
, Ian Gotlib
, Simone Grimm
Dominik Grotegerd
, Tim Hahn
, Paul J. Hamilton
, Ben J. Harrison
, Andreas Jansen
, Tilo Kircher
, Bernhard Meyer
Igor Nenadić
, Sebastian Olbrich
, Elisabeth Paul
, Lukas Pezawas
, Matthew D. Sacchet
, Philipp Sämann
Gerd Wagner
, Henrik Walter
, Martin Walter
PsyMRI* and Guido van Wingen
Molecular Psychiatry
With more than 163 million people affected [1], major depressive
disorder (MDD) is the most common psychiatric disorder in the
world. This number keeps increasing every year, adding urgency
to the question of how to diagnose, prevent, and treat it [2]. The
promise of articial intelligence for medicine also sparked the
interest for using machine learning techniques for the develop-
ment of biomarkers in psychiatry [3]. A meta-analysis of initial
small-scale studies suggested that resting-state functional
magnetic resonance imaging (fMRI) may provide highly accurate
biomarkers for MDD [4]. However, neuroimaging biomarkers
showed lower accuracies for other psychiatric disorders when
based on large scale datasets, presumably due to increased
heterogeneity within the patient group [5]. Until now, large scale
resting-state cohorts for MDD have not been available, limiting the
progress of the development of biomarkers for MDD.
In this work, we used data from two of the largest consortia
(REST-meta-MDD ([6] and
Molecular Psychiatry
PsyMRI (, from now on mddrest and psymri) that
obtained resting-state fMRI data across different research centers
from patients with MDD and matched healthy controls (HC) to
evaluate the potential of resting-state functional connectivity (FC)
as biomarker for MDD.
FC between brain regions refers to the statistical dependence of
neurophysiological signals [7], typically measured as Pearson
correlation [8]. Until recently, the gold standard to explore brain
differences was univariate-group-analysis, which interrogates one
voxel at the time, and has revealed consistent FC differences in
MDD [9]. However, univariate analysis potentially misses more
complex patterns and is only able to detect average group
differences. In the last few years, the increasing availability of
machinelearning (ML) and deep learning (DL) techniques [10] has
enabled researchers to look into multivariate patterns. Recent
results using the popular ML classier support vector machine
(SVM) obtained up to 95% classication accuracy in small datasets
[11,12]. DL algorithms are advanced ML techniques that learn
abstract representation of the input data as an integral part of the
training process. DL may have huge potential for high-
dimensional data such as neuroimaging [13]. DL has shown
convincing early results in many tasks involving image analysis,
including classication of psychiatric disorders (see [14] for a
review). Specic deep learning models on graphs (i.e., graph
convolutional networks; GCN) have recently emerged, and
demonstrated powerful performance on various tasks. Generally
speaking, GCN models are a type of neural network architecture
that can specically leverage the graph structure that is typical for
FC [15]. GCNs also enable the visualization of the important
features to counter the typical criticism of ML for being black-
boxes[16], and enable their use for uncovering the neural
signatures of psychiatric disorders.
In the research reported here, we trained linear and nonlinear
(rbf) SVM and spatial GCN classiers on the mddrest (selected
N=2338) and psymri (selected N=1039) datasets separately as
well as combined. We performed two complimentary post-hoc
visualization experiments: GCN-Explainer [17], which highlights
the important connections between those brain regions that are
necessary for the classier to distinguish between MDD and
controls; and an ablation study in which each brain region is
systematically excluded (virtually ablated) one by one from the
model. The consequent drop in accuracy from the original model
accuracy indicates the contribution of the excluded region to the
overall performance. To assess whether identied connections were
stronger or weaker in MDD, we used group-level t-tests. Further-
more, as clinical heterogeneity is expected to have a large inuence
on the classication accuracy, we performed additional classica-
tions for medicated and non-medicated patients separately.
The psymri consortium consists of 23 cohorts from across the world,
including raw data from 531 patients (60% Males, 33.7 +/11.6 years old)
and 508 controls (65% Males, 35.1 +/12.2 years old). The mddrest dataset
collected byREST-meta-MDD Project is currently the largest resting-state
fMRI database for MDD, including 1255 patients (57% Males, 36.6 +/15.7
years old) and 1083 HC (62% Males, 35.1 +/14.7 years old) from 25
cohorts in China. Supplementary Fig. S1 in the Supplementary Materials
shows the distribution of participants between sites of the datasets.
Demographic data are reported in Supplementary Table S1 for each of the
classication tasks separately (see Supplementary Information for more
details about sample composition).
We also utilized samples from two external rs-fMRI datasets that do
not target MDD to benchmark classication performance on an
independent task. Abide [18] is a comparable retrospective multicenter
neuroimaging consortium but with patients with autism spectrum
disorders (ASD) instead of MDD. In this study we used a sample of
(N=2000 (1590 M/410 F), 1030 ASD/970 TD) from both the rst and
second releases. The UK Biobank [19] is a prospective population cohort
with harmonized data acquisition. We used a randomly sampled subset
of the resting-state fMRI dataset with a comparable sample size to our
MDD consortia (N=2000, 1000 M/1000 F).
Anonymized data were made available for these consortia from studies
that were approved by local Institutional Review Boards. All study
participants provided written informed consent at their local institution.
Data processing
Standard preprocessing of the psymri dataset was done in house using FSL
and ANTs (see Supplementary Information). Standard preprocessing of the
mddrest data was done at each site using the Data Processing Assistant for
Resting-State fMRI (DPARSF), which is based on SPM [20,21] (see
Supplementary Information for preprocessing of Abide and UK Biobank).
Time courses of cortical and subcortical regions as dened by the Harvard-
Oxford atlas [22] were extracted for all datasets (112 regions in total, see
Supplementary Information for analyses on a functional atlas). Correlations
between all brain regions were estimated and the resulting correlation
matrices were used as features to predict class membership (Fig. 1). We
used medication status to dene more homogeneous groups, and
included sex to benchmark classication performance for a task that is
not dependent on the psychiatric diagnosis, resulting in the following
classication tasks:
I. MDD vs HC
II. Non-medicated MDD vs HC
III. Medicated patients MDD vs HC
IV. Medicated MDD vs non-medicated MDD
V. Male vs female
For each contrast separately, we subsampled the classes so that the
number of participants per class was equal. Supplementary Table S1
describes the sample compositions of the groups, for each contrast
Classier models
Three classes of models, linear SVM, non-linear rbf SVM, and GCN, were
used to evaluate prediction performance for all tests. To assess the
generalizability of our results to data that have not been used to train the
model, we used a 5-fold cross-validation (CV) scheme. Hyperparameter
search for each model was based on best practices from the literature, and
chosen empirically on the basis of a relative prediction accuracy on 20% of
the training set [10] (See Supplementary Information for details). After the
best hyperparameter combination had been determined, the actual
performance of the classiers was assessed on the test set [23]. The
overall performance was calculated by averaging balanced accuracy
performance in the ve rounds on the test splits. Other evaluation metrics,
namely F1-score, specicity and sensitivity, are reported in the Supple-
mentary Information. All performances were compared against chance
level using a random permutation test, then Bonferroni correction was
used to adjust for the number of comparisons (Supplementary Informa-
tion). Finally, for the contrast of MDD vs HC, to assess model generalization
between datasets, we trained a model on one dataset and evaluated the
model on the other dataset. For GCN we used a 5-fold cross-validation (CV)
scheme to perform model selection on 20% of the test set. The SVMs do
not need model selection and we applied a one-shotprocedure.
Linear and rbf SVM. We used linear and rbf SVM, popular classiers that
nd respectively linear and non-linear combinations of features that best
separate classes among the observations [24,25].The upper triangular
portion of the FC matrix was used as input for both linear and rbf SVM
(Fig. 2).
GCN. Spatial GCN, referred to simply as GCN here, is a particular class of
GCN. The rst step consisted in transforming the FC matrices in graph
representations. A graph representation is composed of nodes, nodal
features or embeddings, and edges connecting the nodes. In our case,
each node represents a region of interest (ROI). To construct the edges of
the graphs, we thresholded the FC and binarized it so that the top 50% of
connections in terms of connectivity strength were transformed into ones
and the rest into zeros, regulating the sparsity of the graph. This threshold
was derived from previous studies, including a systematic search for
optimal graph sparsity by our own group. The nodal features were dened
by the connectivity prole of that region to other regions, meaning the
corresponding row in the FC matrix before thresholding (Fig. 1). This
S. Gallo et al.
Molecular Psychiatry
allowed the model to abstract information from each group of regions that
have high temporal correlation. The GCN architecture used for this work
was optimized for each contrast and dataset. See Supplementary Fig. S2
for a visual representation of the model and related concepts and
Supplementary Information for details about the architecture. We used a
binary cross entropy loss function and optimized the weights using Adams
optimizer. The model is trained for 100 epochs with an initial learning rate
of 0.001 decaying by a factor of 10 every 30 epochs.
GCN-Explainer and ablation study
Two complementary experiments were carried out on the main MDD vs HC
contrast. To assess the consistency of the results (e.g., replication), we
performed the experiment on the psymri and the mddrest datasets
independently. Regions highlighted by both datasets are reported in the
results section. We focused on visualization of the GCN results because the
methods allowed us to use complementary visualization techniques and
strategies to enhance reliability of the results.
GCN-Explainer shows the manner in which the GCN classier made
the predictions. These explanations are in the form of a subgraph of the
entire graph the GCN was trained on, so that the subgraph maximizes
the mutual information with GCN prediction. This is achieved by
formulating a mean eld variational approximation and learning a real-
valued graph mask which selects the important subgraph of the GCNs
computation graph.
We additionally performed an ablation study to identify the regions that
inuenced the performance of the GCN model in separating HC and MDD
patients. This was done by masking the connectivity prole of each region,
i.e., deleting the corresponding row from the connectivity matrix of the test
set. The resultant drop in accuracy from the performance of the model
trained on the full connectivity matrix is attributed to the region. We
repeated the train-test process masking each region 10 times and calculated
the mean drop in accuracy. The repetition leverages the stochastic nature of
the GCN classier to enhance the replicability of the results.
Univariate group analyses
Univariate independent sample t-tests were performed on FC for the
psymri and mddrest datasets separately. Sex, age, recording site and
movement during scanning (average framewise displacement according to
Jenkinson) were regressed out before testing. For each contrast and
datasets, results were FDR corrected for multiple comparisons (p< 0.05).
Classication performance
Classications for the main comparison between MDD vs HC were
signicantly better than chance level after correction for multiple
comparisons (with the exception of linear SVM classication of
FC FC - thresholded
node features
GCN classifier
Flatten to 1D
SVM classifier
FC lower triangle
fMRI preprocessing FC
Brain parcellation in ROIs
BOLD signal
Fig. 1 Pipeline from 4D rs-fMRI data to input for the classication task. Visual representation of our pipeline. For the psymri dataset,
preprocessing of the raw 4D rs-fMRI and parcellation of the brain in regions of interest (ROIs) according to the Harvard-Oxford atlas was
performed in house, while the mddrest consortium provided us directly with the time course of the same ROIs. The functional connectivity
(FC) matrix was calculated using Pearson correlation between ROIs. Each entry in the FC represented the strength of functional connectivity
between two ROIs, each row represented the correlation prole between one ROI and other ROIs. Since the FC is symmetrical, only one of the
triangles was used as input for the SVM classiers. From the FC we constructed the graph, which was used as GCN input. The ROIs were used
as the nodes of the graphs. To construct the edges between nodes, i.e., the FC between ROIs, we rst binarized the FC matrix so that only the
50th highest absolute values of the correlations of the matrix were transformed into ones, while the rest were transformed into zeros. We then
drew an edge between ROIs whose correlation survived the binarization process. A feature was assigned to each node. The features were the
original (i.e., before binarization) correlation prole of the node itself with the rest of the ROIs in the brain, therefore an entire row of the FC.
SVM support vector machine, GCN graph convolutional network.
S. Gallo et al.
Molecular Psychiatry
psymri), though balanced accuracies averaged across folds were
low with a mean of 61% across datasets and classication models
(range 5763%; see Fig. 2and Supplementary Table S2). Average
balanced accuracies for the comparisons between medicated
MDD vs HC, non-medicated MDD vs HC, and medicated MDD vs
non-medicated MDD were comparable with a mean of 62% (range
5467%). At least one classication model was signicantly better
than chance for each of these three comparisons for mddrest and
the combination of mddrest +psymri, while none of the classica-
tions for psymri were signicant. The Supplementary Information
provides additional evaluation metrics showing that sensitivity
and specicity were balanced (Supplementary Tables S3S5), and
that site harmonization using Combat had little inuence on the
results (Supplementary Table S15). Comparable classication
results were obtained when using a fully connected deep learning
model or when using a functional instead of structural parcellation
atlas (see Supplementary Information).
The cross-dataset training procedure for the contrast MDD vs
HC resulted in lower performances. A GCN trained on psymri and
tested on mddrest performed with a mean accuracy of 54.16
(sd =0.66), while trained on the mddrest and tested on the psymri
performed with mean accuracy of 56.38 (sd =0.84), a SVM-linear
performed with accuracy of 55.7 and 54.8 respectively on the
same contrasts, and a SVM-rbf performed with accuracy of 53.1,
and accuracy of 56.1.
To investigate the inuence of subject and research site
characteristics on classication performance, we assessed the
accuracy for the different sexes, diagnostic statuses, scanner
manufacturers and recording sites for the SVM-rbf that performed
best. Particularly the variability in accuracy across sites was
appreciable (range 4887%), but was not signicantly associated
with sample size (r
=0.25, p=0.25). Additional univariate t-testing
revealed no signicant FC difference between correctly and
incorrectly classied participants (see Supplementary Information).
Symptom severity. To evaluate whether symptom severity could
be predicted from the FC matrices, we used GCN and support
vector regression (SVR) with the rbf kernel to predict Hamilton
depression scores (HAM-D) for 1113 patients in mddrest and 333
patients in psymri. SVR could only explain 3.5-7% of the variance
and GCN only predicted the training mean, indicating that
symptom severity could not be predicted reliably.
GCN-Explainer and ablation study results
To gain insight into the most important connections for the
classication of MDD vs HC, we performed two complementary
**** *
**** *
Fig. 2 Performance of each classier for each comparison, expressed as average balanced accuracy across ve folds. Error bars indicate
standard deviation across folds, * indicates classication results better than chance level after permutation testing. Signicance level was
corrected for the number of experiments performed, using the Bonferroni procedure.
S. Gallo et al.
Molecular Psychiatry
Content courtesy of Springer Nature, terms of use apply. Rights reserved
visualization experiments on the mddrest and psymri datasets
separately to assess whether results would be consistent. Visualiza-
tion of the GCN using GCN-Explainer identied the connections
between 1) the left and right thalamus, 2) the right lingual gyrus and
right supracalcarine cortex, 3) the left and right anterior divisions of
the supramarginal gyrus, and 4) left and right medial frontal cortex.
These connections were amongst the ten most inuential connec-
tions that were present in both datasets (Fig. 3A). An ablation study
showed the highest drops in balanced accuracy that were present
in both datasets for the thalamus (mean (sd) over 10 repetitions per
fold; 6.27(2.17)% for psymri:and4.62(1.08)% for mddrest:) and
Heschls gyrus (5.99(3.88)% for psymri,4.12(1.47)% for mddrest).
The results for psymri and mddrest are presented in Fig. 3Band
Supplementary Table S6.
Univariate group analyses
28% of the connections showed signicant differences between
MDD and HC in the mddrest dataset, with predominantly
reduced FC in MDD. For example, the amygdala showed reduced
connectivity with 154 other regions, the insula showed reduced
connectivity with 126 other regions, and the anterior cingulate
cortex showed reduced connectivity with 100 other regions, but
increased connectivity with the right precentral and postcentral
gyri. In contrast, the thalamus showed increased connectivity with
199 other brain regions, primarily with frontal and insular regions,
but decreased connectivity between interhemispheric homologs
(Fig. 4). Effect sizes were low [26], with an average Cohens-dof
0.14 (range 0.34, 0.08) across signicantly reduced connec-
tions, and an average Cohens-dof 0.12 (range 0.08, 0.18) for
signicantly increased thalamus connections.
In the psymri dataset, only the decreased connectivity between
the left and right supracalcarine cortices survived correction for
multiple comparisons for the comparison between MDD and HC.
Inspection of the uncorrected results showed a comparable
pattern of results as for the mddrest dataset (Fig. 4). Full results for
each contrast and dataset (FDR corrected) are presented in
Supplementary Tables S8S13.
Only two comparisons, MDD vs HC and MDD-med vs HC
showed replicable differences in the two datasets (Table 1). In the
rst contrast, connectivity between the left and right
Connection between:
Thalamus L Thalamus R
Lingual Gyrus R Supracalcarine Cortex R
Supramarginal Gyrus, ant, L Supramarginal Gyrus, ant, R
Frontal Medial Cortex L Frontal Medial Cortex R
Paracingulate Gyrus L Paracingulate Gyrus R
Inf Frontal Gyrus, R Inferior Frontal Gyrus, R
Paracingulate Gyrus R Cingulate Gyrus, post, R
Cingulate Gyrus, ant, L Cingulate Gyrus, R
Putamen L Putamen R
Intracalcarine Cortex L Intracalcarine Cortex R
Connection between:
Thalamus L Thalamus R
Lingual Gyrus R Supracalcarine Cortex R
Supramarginal Gyrus,ant, L Supramarginal Gyrus, ant, R
Frontal Medial Cortex L Frontal Medial Cortex R
Lingual Gyrus L Lingual Gyrus R
Planum Temporale L Planum Temporale R
Occipital Pole L Occipital Pole R
Insular Cortex L Insular Cortex R
Inf Frontal Gyrus, triangularis L Inf Frontal Gyrus, opercularis L
Brain-Stem L Brain-Stem R
-10% -8% -6% -4% -2% 0%
Precentral Gyrus R
Thalamus L
Sup. Parietal Lobule R
Heschl's Gyrus L
Occipital Pole R
Inf. Frontal Gyrus R
Temporal Pole R
Lingual Gyrus R
Pallidum R
Middle Temporal Gyrus, ant. L
Ablation study results
GCN-Explainer results
mean acc drop mean acc drop
-10% -8% -6% -4% -2% 0%
Thalamus R
Middle Frontal Gyrus L
Frontal Medial Cortex R
Postcentral Gyrus R
Heschl's Gyrus L
Insular Cortex L
Frontal Orbital Cortex L
Cuneal Cortex L
Caudate R
Frontal Pole R
Fig. 3 GCN explainer and ablation results for the classication of MDD and HC. A Results of the GCN explainer experiment obtained using
the psymri dataset (left panel) and on the mddrest dataset (right panel): on top is the graphic representation of the functional connections
between areas identied as necessary to discriminate MDD from HC, which are listed. The results on the left panel were obtained from the
experiment on the psymri dataset, while those on the right are from the mddrest dataset. Connections identied by experiments in both
datasets are shaded in gray. BResults of the ablation experiment obtained using the psymri dataset (left panel) and the mddrest dataset (right
panel). Regions identied by experiments in both datasets are shaded in gray. L left, R right, ant anterior, inf. inferior, post posterior, acc
balanced accuracy.
S. Gallo et al.
Molecular Psychiatry
supracalcarine cortex showed a reduction in patients (psymri t:
4.73 p-corr < 0.05, mddrest t: 3.69, p-corr < 0.0005), out of the
3536 signicant connectivity in the mddrest dataset. For the
second comparison, connectivity between the left thalamus and
left prefrontal gyrus showed increased connectivity in medicated
patients (psymri t: 3.56 p-corr < 0.05, mddrest t: 3.03, p-corr < 0.05),
while another nine FCs showed decreased FC in medicated
patients, out of the 3527 signicant connectivity in the psymri
dataset and 596 in the mddrest.
Sex classication
Classication of sex was beyond chance level for all the classiers
and datasets, with a mean across datasets and models of 68%
(range 6571%). To assess whether sex classication accuracy is
comparable in other datasets, we performed similar analyses with
comparable sample sizes (N=2000) in the Abide and UK Biobank
datasets. Sex classication accuracy was comparable in the
retrospective Abide cohort (73%) and higher in the prospective
harmonized UK Biobank cohort (81%).
The results showed that ML and DL classiers were able to
distinguish patients from controls beyond chance level, but that
classication performance was low. Classication accuracies for
(non-)medicated patients separately were comparable, suggesting
that medication use had little inuence on the results. Visualiza-
tion of the functional connections that were most inuential
revealed hyperconnectivity of the thalamus. This was
corroborated by two distinct visualization techniques and
replicated in two datasets, suggesting that thalamic hypercon-
nectivity may be the most prominent neurophysiological char-
acteristic of MDD. Interestingly, thalamic hyperconnectivity was
rather specic, as MDD was mainly associated with widespread
The 61% accuracy in these two datasets is considerably lower
than the average 84% accuracy across small-scale studies in a
recent meta-analysis [4]. Our results corroborate those from a
recent Japanese multicenter study that reported a balanced
accuracy of 6769% [27]. The lower accuracy with larger sample
sizes is paradoxical as ML and DL models only become better
when trained on larger samples [28]. However, neuroimaging
research has actually shown that prediction accuracy tends to
decline with increasing sample size [29,30]. This is presumably
due to the increase in clinical heterogeneity when recruiting larger
samples, as sample heterogeneity reduces model performance
[5,27]. We used data from two consortia that both consist of small
samples obtained at many different research centers, and
performances across sites ranged considerably (see Supplemen-
tary Information). Accordingly, the large total sample size came
together with large heterogeneity, which is probably responsible
for the poor accuracy of our model. Strategies to mitigate sites
effect were not successful (see Supplementary Information).
Heterogeneity is maximal when training and testing is performed
on two different datasets, and indeed the lowest results obtained
in the cross-datasets experiments conrmed the role of hetero-
geneity in compromising the nal performance, which may be
related to the distinct ancestry of the Chinese and European
Univariate t-test
(FDR corrected)
Frontal LimbicOccipital Parietal
Sub-cortical Temporal
Cohen’s d
cohen’s d
Fig. 4 Univariate t-test results and Cohensd.Left: Results of the univariate t-test for the classication task MDD vs HC for the mddrest
dataset (top) and for the psymri (bottom). The mddrest results are corrected for multiple comparison and thresholded using FDR < 0.05. The
red lines correspond to the left and the right thalami. For the psymri dataset, t-tests did not survive correction for multiple comparisons, and
the results are thresholded at p-uncorr < 0.05 to illustrate the comparable pattern as for mddrest. The clustering for lobes is done merely for
illustration purposes. Right: Cohensdfor the classication task MDD vs HC for the mddrest dataset (top) and for the psymri (bottom), calculated
for each voxel group comparison.
S. Gallo et al.
Molecular Psychiatry
cohorts. Nevertheless, combination of both datasets led to
comparable performance, indicating that it is possible to construct
an MDD model that can generalize across cohorts. Though where
sample homogeneity can lead to optimal model performance,
sample heterogeneity ensures optimal generalization of the
model to new data [5], suggesting that our results could form a
lower bound on classication accuracy.
One way to reduce sample heterogeneity is to take clinical
variability into account. We therefore split the sample into
medicated and non-medicated patients. Although antidepressants
are known to affect resting-state connectivity [31], this did not
increase classication performance, suggesting that medication use
had little inuence on the classication results. Attempts to classify
patients based on symptom severity or demographics were
unsuccessful (see Supplementary Fig. S3). The diagnosis of MDD
depends on the subjective evaluation of nine different symptoms
and as little as one symptom may overlap between two patients
[32], comorbidity is common, and symptoms may overlap with
other disorders [33], leading to low interrater reliability of the
diagnosis [34]. Such uncertainty associated with the diagnosis can
obscure the relationship between a patients data and the category
it belongs to [3538], and thereby decrease accuracy [39]. Data-
driven denition of the disorder and the use of biotypes could help
arrive at more homogeneous psychiatric groups. The search for
MDD-biotypes triggered a urry of publications [4045]and
discussions in the last few years, but no consensus has emerged yet.
Another way to reduce heterogeneity of the dataset is to
statistically harmonize data across centers. We performed data
harmonization using Combat, which had little inuence on the
results. While this procedure can increase the power to detect group
differences, it also had little inuence on the classication of
Alzheimers disease in the ADNI dataset [46]. While site harmoniza-
tion will have a large effect on the ability to distinguish centers from
the data, we expect that it will not have a large inuence on the
classication of interest when the dataset is balanced and site
information is independent of group membership.
Despite our hypothesis that exploiting the graph-like structure
of FC would be benecial for the classication tasks, we found no
clear advantage in using GCN over SVM, nor by another tested DL
model (Supplementary Information). DL methods like GCNs
perform especially well when applied to very large samples such
as the 14 million images in ImageNet [47]. However, in
applications like ours, the numerosity of the dataset is limited.
Even this largest sample of MDD fMRI data might simply not be
enough to exploit the potential of GCNs [48].
Although the results did not meet the accuracy criteria for a
clinical diagnostic tool, the insights they provide go beyond
mereprediction, and can help connect neurobiological pro-
cesses to their psychiatric consequences (but see ref. [49]). In the
classication between MDD and HC, the thalamus stood out as the
only region whose importance is supported both by two different
visualization techniques and the replication across two datasets.
The GCN-Explainer identied the inter-hemispheric connections
between left and right thalamus as most important to the
classication. Of note, the prominent inter-hemispheric connec-
tions in the results (Fig. 3A) may reect the nature of the way we
constructed the GCN-layers (Supplementary Information). We
therefore assume that the entire FC prole of the thalamus is
driving the result, rather than only the interhemispheric con-
nectivity. This interpretation is supported by the ablation study
performed for the GCN for which we removed each thalamus and
its FC with the rest of the brain and witnessed a ~5% drop in
accuracy, conrming its role in discriminating between MDD and
HC. Importantly, the relatively low accuracy of the classiers
inuences the reliability of visualization techniques. For the GCN
model, we were able to take advantage of the stochastic nature of
the algorithm to increase the replicability of the results by
repeating the training-test procedure 10 times. SVM algorithms
are deterministic and this augmentation is not possible, limiting
reliability of the results even further (reported in the Supplemen-
tary Information)
Additional univariate t-testing showed thalamic hyperconnec-
tivity in mddrest, while most other brain regions showed
hypoconnectivity. This pattern was also observed in psymri, but
this did not withstand correction for multiple comparisons.
Univariate effect sizes in the two datasets were comparable,
suggesting that the psymri results could have been penalized by
the smaller sample size. In general, the univariate effect sizes were
negligible to small [26]. This highlights the usefulness of multi-
variate analysis, as the obtained ~60% accuracy translates into a
medium effect size [50].
Initial studies [51,52] as well as recent meta-analyses [5357]
have already pointed to thalamic hyperactivity, during rest as well
as during cognitive and emotion processing. Other studies have
Table 1. Replicated univariate t-test results in the psymri and mddrest datasets.
Connection between psymri mddrest
tp-value tp-value
R Supracalcarine Cortex L Supracalcarine Cortex 4.73 0.017 #### ######
MDD med vs HC
L Thalamus L Precentral Gyrus 3.56 0.048 3.03 0.034
L Temporal Occipital Fusiform Cortex L Lateral Occipital Cortex, infer. 4.52 0.005 #### 0.037
R Supracalcarine Cortex L Intracalcarine Cortex 4.29 0.009 #### 0.021
R Superior Temporal Gyrus, pos. R Superior Temporal Gyrus, ant. 4.06 0.018 #### 0.020
R Planum Temporale R Precentral Gyrus 3.93 0.021 #### 0.015
L Frontal Orbital Cortex R Frontal Medial Cortex 3.75 0.032 #### 0.049
L Heschls Gyrus (includes H1 and H2) R Postcentral Gyrus 3.74 0.032 #### 0.044
R Supracalcarine Cortex R Intracalcarine Cortex 3.73 0.032 #### 0.016
R Putamen R Middle Frontal Gyrus 3.57 0.048 #### 0.046
L Central Opercular Cortex R Precentral Gyrus 3.56 0.048 #### 0.012
Functional connectivity results showing differences between the MDD and HC group and the medicated MDD and the HC groups that are replicated in the
two datasets. Tvalues and p-values, FDR corrected, are reported separately for the psymri and mddrest dataset.
RRight, LLeft.
S. Gallo et al.
Molecular Psychiatry
suggested metabolic abnormalities in the thalamus of patients
with depression [5860], and specically the mediodorsal
thalamus was implicated in onset of depression [61,62]. This
nuclei is responsible for integrating sensory, motor, visceral and
olfactory information and subsequently relating it to the
individuals emotional state [63] and its connectivity prole is
congruent with the increased connectivity pattern we nd in our
study. This suggests that our results may be driven by
hyperconnectivity of the mediodorsal thalamus. Hypervigilant
brain states in MDD have been observed with electroencephalo-
graphy [64,65] and inversely, thalamic deactivation precedes
sleep onset. EEG-fMRI-studies in healthy subjects have reported a
thalamic BOLD signal decrease in lower vigilance states [64] and
directly relevant to our result thalamocortical uncoupling as a
general hallmark of (light) sleep [6668]. Thalamic hyperconnec-
tivity during MRI scanning (that represents a mild stress
experiment) may well hint towards a general MDD-related
dysfunction within the larger brain network, as recent views on
the thalamus hold that it is not a passive relay station but that it
has a central role in ongoing cortical functioning [69,70]. Overall,
this suggests a hypothesis that corticothalamic hyperconnectivity
may hijackthe corticocortical connectivity that was reduced
throughout the brain in our study.
This study has to be considered in light of its strengths and
limitations. Its main strength is the use of two of the largest
resting-state fMRI consortia with clinically conrmed MDD that
show converging evidence for poor discrimination of MDD and
the importance of thalamic hyperconnectivity. At the same time,
these large datasets come with the limitation of large clinical (e.g.,
differences in severity and chronicity) and technological (e.g.,
differences in scanners and MRI acquisitions) heterogeneity,
presumably reducing classication accuracy. To evaluate whether
technolgical heterogeneity could have inuenced the results, we
compared sex classication in our MDD cohorts with sex
classication in a comparable cohort for ASD (ABIDE) and a high
quality dataset with prospective data acquisition harmonization
(UK Biobank). The results show that sex classication accuracy can
increase from (7173%) in the MDD and ASD datasets to 81% in
the UK Biobank. This suggests that our MDD classication result is
as good as can be obtained from heterogenous retrospective
multicenter cohorts. Though the higher sex classication accuracy
in the UK Biobank suggests that the accuracy for MDD may
improve when all data could be collected on the same scanner. A
further limitation is that we analyzed FC as a stationary feature
even though it consists of dynamic changes in neural activity over
time, which may be important for the classication of MDD [71].
And nally, given the domain heterogeneity of psychiatric
disorders, classifying a disease based on one brain imaging
modality only is reductive. An integrative modeling of multimodal
data, such as molecular, genomic, clinical, medical imaging,
physiological signal and behavioral means comprehensively
considering different aspects of the disease, thus likely enhancing
the classication performance [13,72]
In conclusion, our study provides a realistic and possibly lower
bound estimate of the classication performance that can be
obtained with the application of FC on a large, ecologically valid,
multi-site sample of MDD patients. Our ndings show that FC can
distinguish between MDD patients and HCs, but that it is not
sufciently accurate for clinical use. Despite the low accuracy,
visualization of the DL classier enabled important insights into
the neural basis of MDD, and revealed consistent and reproducible
thalamic hyperconnectivity as the most prominent neurophysio-
logical characteristic of MDD.
Deidentied and anonymized data were contributed from studies approved by local
Institutional Review Boards. All study participants provided written informed consent
at their local institution. Data of the PsyMRI project are available at
Data of the REST-meta-MDD project are available at:
Data that were generated and the graph convolutional models used for this study are
available on request to the corresponding author.
Molecular Psychiatry
72. Yang J, Yin Y, Zhang Z, Long J, Dong J, Zhang Y, et al. Predictive brain networks for
major depression in a semi-multimodal fusion hierarchical feature reduction frame-
work. Neurosci Lett. 2018;665:1639.
Molecular Psychiatry
Nooshin Javaheripour3, Meng Li3, Lucie Bartova 4, Udo Dannlowski 6, Christopher Davey 7, Thomas Frodl 8,9, Ian Gotlib 10,
Simone Grimm11, Dominik Grotegerd6, Tim Hahn 6, Paul J. Hamilton 12, Ben J. Harrison7, Andreas Jansen13, Tilo Kircher13,
Bernhard Meyer4, Igor Nenadić13, Sebastian Olbrich14, Elisabeth Paul 12, Lukas Pezawas 4, Matthew D. Sacchet15,
Philipp Sämann 16, Gerd Wagner 3, Henrik Walter 17, Martin Walter 8,9 and Guido van Wingen1,2
S. Gallo et al.
Molecular Psychiatry
Content courtesy of Springer Nature, terms of use apply. Rights reserved
