PreprintPDF Available

High-resolution image reconstruction with latent diffusion models from human brain activity

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

Reconstructing visual experiences from human brain activity offers a unique way to understand how the brain represents the world, and to interpret the connection between computer vision models and our visual system. While deep generative models have recently been employed for this task, reconstructing realistic images with high semantic fidelity is still a challenging problem. Here, we propose a new method based on a diffusion model (DM) to reconstruct images from human brain activity obtained via functional magnetic resonance imaging (fMRI). More specifically, we rely on a latent diffusion model (LDM) termed Stable Diffusion. This model reduces the computational cost of DMs, while preserving their high generative performance. We also characterize the inner mechanisms of the LDM by studying how its different components (such as the latent vector of image Z, conditioning inputs C, and different elements of the denoising U-Net) relate to distinct brain functions. We show that our proposed method can reconstruct high-resolution images with high fidelity in straightforward fashion, without the need for any additional training and fine-tuning of complex deep-learning models. We also provide a quantitative interpretation of different LDM components from a neuroscientific perspective. Overall, our study proposes a promising method for reconstructing images from human brain activity, and provides a new framework for understanding DMs.
High-resolution image reconstruction with latent diffusion models from human
brain activity
Yu Takagi1,2*Shinji Nishimoto1,2
1Graduate School of Frontier Biosciences, Osaka University, Japan
2CiNet, NICT, Japan
{takagi.yuu.fbs,nishimoto.shinji.fbs}@osaka-u.ac.jp
Figure 1. Presented images (red box, top row) and images reconstructed from fMRI signals (gray box, bottom row) for one subject (subj01).
Abstract
Reconstructing visual experiences from human brain ac-
tivity offers a unique way to understand how the brain rep-
resents the world, and to interpret the connection between
computer vision models and our visual system. While deep
generative models have recently been employed for this
task, reconstructing realistic images with high semantic fi-
delity is still a challenging problem. Here, we propose a
new method based on a diffusion model (DM) to recon-
struct images from human brain activity obtained via func-
tional magnetic resonance imaging (fMRI). More specifi-
cally, we rely on a latent diffusion model (LDM) termed
Stable Diffusion. This model reduces the computational
cost of DMs, while preserving their high generative perfor-
mance. We also characterize the inner mechanisms of the
LDM by studying how its different components (such as the
latent vector of image Z, conditioning inputs C, and differ-
ent elements of the denoising U-Net) relate to distinct brain
functions. We show that our proposed method can recon-
struct high-resolution images with high fidelity in straight-
* Corresponding author
forward fashion, without the need for any additional train-
ing and fine-tuning of complex deep-learning models. We
also provide a quantitative interpretation of different LDM
components from a neuroscientific perspective. Overall, our
study proposes a promising method for reconstructing im-
ages from human brain activity, and provides a new frame-
work for understanding DMs. Please check out our web-
page at this https URL.
1. Introduction
A fundamental goal of computer vision is to construct
artificial systems that see and recognize the world as hu-
man visual systems do. Recent developments in the mea-
surement of population brain activity, combined with ad-
vances in the implementation and design of deep neu-
ral network models, have allowed direct comparisons be-
tween latent representations in biological brains and ar-
chitectural characteristics of artificial networks, providing
important insights into how these systems operate [3,8
10,13,18,19,21,42,43,54,55]. These efforts have in-
cluded the reconstruction of visual experiences (percep-
1
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 21, 2022. ; https://doi.org/10.1101/2022.11.18.517004doi: bioRxiv preprint
tion or imagery) from brain activity, and the examination
of potential correspondences between the computational
processes associated with biological and artificial systems
[2,5,7,24,25,27,36,4446].
Reconstructing visual images from brain activity, such
as that measured by functional Magnetic Resonance Imag-
ing (fMRI), is an intriguing but challenging problem, be-
cause the underlying representations in the brain are largely
unknown, and the sample size typically associated with
brain data is relatively small [17,26,30,32]. In recent
years, researchers have started addressing this task using
deep-learning models and algorithms, including generative
adversarial networks (GANs) and self-supervised learning
[2,5,7,24,25,27,36,4446]. Additionally, more recent
studies have increased semantic fidelity by explicitly using
the semantic content of images as auxiliary inputs for re-
construction [5,25]. However, these studies require train-
ing new generative models with fMRI data from scratch, or
fine-tuning toward the specific stimuli used in the fMRI ex-
periment. These efforts have shown impressive but limited
success in pixel-wise and semantic fidelity, partly because
the number of samples in neuroscience is small, and partly
because learning complex generative models poses numer-
ous challenges.
Diffusion models (DMs) [11,47,48,53] are deep genera-
tive models that have been gaining attention in recent years.
DMs have achieved state-of-the-art performance in several
tasks involving conditional image generation [4,39,49], im-
age super resolution [40], image colorization [38], and other
related tasks [6,16,33,41]. In addition, recently proposed
latent diffusion models (LDMs) [37] have further reduced
computational costs by utilizing the latent space generated
by their autoencoding component, enabling more efficient
computations in the training and inference phases. An-
other advantage of LDMs is their ability to generate high-
resolution images with high semantic fidelity. However, be-
cause LDMs have been introduced only recently, we still
lack a satisfactory understanding of their internal mecha-
nisms. Specifically, we still need to discover how they rep-
resent latent signals within each layer of DMs, how the la-
tent representation changes throughout the denoising pro-
cess, and how adding noise affects conditional image gen-
eration.
Here, we attempt to tackle the above challenges by re-
constructing visual images from fMRI signals using an
LDM named Stable Diffusion. This architecture is trained
on a large dataset and carries high text-to-image genera-
tive performance. We show that our simple framework can
reconstruct high-resolution images with high semantic fi-
delity without any training or fine-tuning of complex deep-
learning models. We also provide biological interpretations
of each component of the LDM, including forward/reverse
diffusion processes, U-Net, and latent representations with
different noise levels.
Our contributions are as follows: (i) We demonstrate
that our simple framework can reconstruct high-resolution
(512 512) images from brain activity with high seman-
tic fidelity, without the need for training or fine-tuning of
complex deep generative models (Figure 1); (ii) We quan-
titatively interpret each component of an LDM from a neu-
roscience perspective, by mapping specific components to
distinct brain regions; (iii) We present an objective interpre-
tation of how the text-to-image conversion process imple-
mented by an LDM incorporates the semantic information
expressed by the conditional text, while at the same time
maintaining the appearance of the original image.
2. Related Work
2.1. Reconstructing visual image from fMRI
Decoding visual experiences from fMRI activity has
been studied in various modalities. Examples include ex-
plicitly presented visual stimuli [17,26,30,32], semantic
content of the presented stimuli [15,31,52], imagined con-
tent [13,29], perceived emotions [12,20,51], and many
other related applications [14,28]. In general, these decod-
ing tasks are made difficult by the low signal-to-noise ratio
and the relatively small sample size associated with fMRI
data.
While early attempts have used handcrafted features to
reconstruct visual images from fMRI [17,26,30,32], recent
studies have begun to use deep generative models trained on
a large number of naturalistic images [2,5,7,24,25,27,36,
4446]. Additionally, a few studies have used semantic in-
formation associated with the images, including categorical
or text information, to increase the semantic fidelity of the
reconstructed images [5,25]. To produce high-resolution
reconstructions, these studies require training and possibly
fine-tuning of generative models, such as GANs, with the
same dataset used in the fMRI experiments. These require-
ments impose serious limitations, because training complex
generative models is in general challenging, and the num-
ber of samples in neuroscience is relatively small. Thus,
even modern implementations struggle to produce images,
at most 256 256 resolution, with high semantic fidelity
unless they are augmented with numerous tools and tech-
niques. DMs and LDMs are recent algorithms for image
generation that could potentially address these limitations,
thanks to their ability to generate diverse high-resolution
images with high semantic fidelity of text-conditioning, and
high computational efficiency. However, to the best of our
knowledge, no prior studies have used DMs for visual re-
construction.
2
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 21, 2022. ; https://doi.org/10.1101/2022.11.18.517004doi: bioRxiv preprint
2.2. Encoding Models
To understand deep-learning models from a biological
perspective, neuroscientists have employed encoding mod-
els: a predictive model of brain activity is built out of
features extracted from different components of the deep-
learning models, followed by examination of the poten-
tial link between model representations and corresponding
brain processes [3,810,13,18,19,21,42,43,54,55]. Be-
cause brains and deep-learning models share similar goals
(e.g., recognition of the world) and thus could implement
similar functions, the ability to establish connections be-
tween these two structures provides us with biological in-
terpretations of the architecture underlying deep-learning
models, otherwise viewed as black boxes. For example,
the activation patterns observed within early and late layers
of a CNN correspond to the neural activity patterns mea-
sured from early and late layers of visual cortex, suggest-
ing the existence of a hierarchical correspondence between
latent representations of a CNN and those present in the
brain [9,10,13,19,54,55]. This approach has been ap-
plied primarily to vision science, but it has recently been
extended to other sensory modalities and higher functions
[3,8,18,21,42,43].
Compared with biologically inspired architectures such
as CNNs, the correspondence between DMs and the brain
is less obvious. By examining the relationship between each
component and process of DMs and corresponding brain ac-
tivities, we were able to obtain biological interpretations of
DMs, for example in terms of how latent vectors, denois-
ing processes, conditioning operations, and U-net compo-
nents may correspond to our visual streams. To our knowl-
edge, no prior study has investigated the relationship be-
tween DMs and the brain.
Together, our overarching goal is to use DMs for high
resolution visual reconstruction and to use brain encoding
framework to better understand the underlying mechanisms
of DMs and its correspondence to the brain.
3. Methods
Figure 2presents an overview of our methods.
3.1. Dataset
We used the Natural Scenes Dataset (NSD) for this
project [1]. Please visit the NSD website for more de-
tails 1. Briefly, NSD provides data acquired from a 7-Tesla
fMRI scanner over 30–40 sessions during which each sub-
ject viewed three repetitions of 10,000 images. We ana-
lyzed data for four of the eight subjects who completed all
imaging sessions (subj01, subj02, subj05, and subj07). The
images used in the NSD experiments were retrieved from
MS COCO and cropped to 425 425 (if needed). We used
1http://naturalscenesdataset.org/
Freeze
Resized copy
Copy
Encoding Analysis
...
...
Decoding Analysis
Latent Diffusion Model
(i)
(ii)
(iii)
(i)
(ii)
(iii)
(iv)
Trained
(Linear model)
Features used
in each model
DP
DP
z
ε
D
c
zc
D
...
...
(+ Text)
X
Decoising U-net
Diffusion Process
X
Text
X’
DP
DP
z
ε
τ
D
c
...
zczT
zT
Xz
Xz
Xzc
zc
zT
zc
Figure 2. Overview of our methods. (Top) Schematic of LDM
used in this study. denotes an image encoder, Dis a im-
age decoder, and is a text encoder (CLIP). (Middle) Schematic
of decoding analysis. We decoded latent representations of the
presented image (z) and associated text cfrom fMRI signals
within early (blue) and higher (yellow) visual cortices, respec-
tively. These latent representations were used as input to produce a
reconstructed image Xzc. (Bottom) Schematic of encoding anal-
ysis. We built encoding models to predict fMRI signals from dif-
ferent components of LDM, including z,c, and zc.
27,750 trials from NSD for each subject (2,250 trials out of
the total 30,000 trials were not publicly released by NSD).
For a subset of those trials (N=2,770 trials), 982 images
were viewed by all four subjects. Those trials were used
as the test dataset, while the remaining trials (N=24,980)
were used as the training dataset.
For functional data, we used the preprocessed scans (res-
olution of 1.8 mm) provided by NSD. See Appendix Afor
details of the preprocessing protocol. We used single-trial
3
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 21, 2022. ; https://doi.org/10.1101/2022.11.18.517004doi: bioRxiv preprint
beta weights estimated from generalized linear models and
region of interests (ROIs) for early and higher (ventral) vi-
sual regions provided by NSD. For the test dataset, we used
the average of the three trials associated with each image.
For the training dataset, we used the three separate trials
without averaging.
3.2. Latent Diffusion Models
DMs are probabilistic generative models that restore a
sampled variable from Gaussian noise to a sample of the
learned data distribution via iterative denoising. Given
training data, the diffusion process destroys the structure of
the data by gradually adding Gaussian noise. The sample
at each time point is defined as xt=ptx0+p1tt
where xtis a noisy version of input x0,t2{1,...,T},
is a hyperparameter, and is the Gaussian. The inverse
diffusion process is modeled by applying a neural network
f(xt,t)to the samples at each step to recover the origi-
nal input. The learning objective is f(x,t)tt[11,47].
U-Net is commonly used for neural networks f.
This method can be generalized to learning conditional
distributions by inserting auxiliary input cinto the neural
network. If we set the latent representation of the text se-
quence to c, it can implement text-to-image models. Recent
studies have shown that, by using large language and image
models, DMs can create realistic, high-resolution images
from text inputs. Furthermore, when we start from source
image with input texts, we can generate new text conditional
images by editing the image. In this image-to-image trans-
lation, the degree of degradation from the original image is
controlled by a parameter that can be adjusted to preserve
either the semantic content or the appearance of the original
image.
DMs that operate in pixel space are computationally ex-
pensive. LDMs overcome this limitation by compressing
the input using an autoencoder (Figure 2, top). Specifi-
cally, the autoencoder is first trained with image data, and
the diffusion model is trained to generate its latent repre-
sentation zusing a U-Net architecture. In doing so, it refers
to conditional inputs via cross-attention. This allows for
lightweight inference compared with pixel-based DMs, and
for very high-quality text-to-image and image-to-image im-
plementations.
In this study, we used an LDM called Stable Diffu-
sion, which was built on LDMs and trained on a very large
dataset. The model can generate and modify images based
on text input. Text input is projected to a fixed latent repre-
sentation by a pretrained text encoder (CLIP) [34]. We used
version 1.4 of the model. See Appendix Afor details on the
training protocol.
We define zas the latent representation of the original
image compressed by the autoencoder, cas the latent rep-
resentation of texts (average of five text annotations asso-
ciated to each MS COCO image), and zcas the generated
latent representation of zmodified by the model with c.We
used these representations in the decoding/encoding models
described below.
3.3. Decoding: reconstructing images from fMRI
We performed visual reconstruction from fMRI signals
using LDM in three simple steps as follows (Figure 2, mid-
dle). The only training required in our method is to con-
struct linear models that map fMRI signals to each LDM
component, and no training or fine-tuning of deep-learning
models is needed. We used the default parameters of image-
to-image and text-to-image codes provided by the authors of
LDM 2, including the parameters used for the DDIM sam-
pler. See Appendix Afor details.
(i) First, we predicted a latent representation zof the
presented image Xfrom fMRI signals within early visual
cortex. zwas then processed by an decoder of autoen-
coder to produce a coarse decoded image Xzwith a size
of 320 320, and then resized it to 512 512.
(ii) Xzwas then processed by encoder of autoencoder,
then added noise through the diffusion process.
(iii) We decoded latent text representations cfrom fMRI
signals within higher (ventral) visual cortex. Noise-added
latent representations zTof the coarse image and decoded
cwere used as input to the denoising U-Net to produce zc.
Finally, zcwas used as input to the decoding module of
the autoencoder to produce a final reconstructed image Xzc
with a size of 512 512.
To construct models from fMRI to the components of
LDM, we used L2-regularized linear regression, and all
models were built on a per subject basis. Weights were
estimated from training data, and regularization parame-
ters were explored during the training using 5-fold cross-
validation. We resized original images from 425 425 to
320 320 but confirmed that resizing them to a larger size
(448 448) does not affect the quality of reconstruction.
As control analyses, we also generated images using
only zor c. To generate these control images, we simply
omitted cor zfrom step (iii) above, respectively.
The accuracy of image reconstruction was evaluated ob-
jectively (perceptual similarity metrics, PSMs) and subjec-
tively (human raters, N=6) by assessing whether the origi-
nal test images (N=982 images) could be identified from the
generated images. As a similarity metrics of PSMs, we used
early/middle/late layers of CLIP and CNN (AlexNet) [22].
Briefly, we conducted two-way identification experiments:
examined whether the image reconstructed from fMRI was
more similar to the corresponding original image than ran-
domly picked reconstructed image. See Appendix Bfor de-
tails and additional results.
2https://github.com/CompVis/stable-diffusion/blob/main/scripts/
4
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 21, 2022. ; https://doi.org/10.1101/2022.11.18.517004doi: bioRxiv preprint
3.4. Encoding: Whole-brain Voxel-wise Modeling
Next, we tried to interpret the internal operations of
LDMs by mapping them to brain activity. For this purpose,
we constructed whole-brain voxel-wise encoding models
for the following four settings (see Figure 2bottom and Ap-
pendix Afor implementation details):
(i) We first built linear models to predict voxel activity
from the following three latent representations of the LDM
independently: z,c, and zc.
(ii) Although zcand zproduce different images, they
result in similar prediction maps on the cortex (see 4.2.1).
Therefore, we incorporated them into a single model, and
further examined how they differ by mapping the unique
variance explained by each feature onto cortex [23]. To
control the balance between the appearance of the original
image and the semantic fidelity of the conditional text, we
varied the level of noise added to z. This analysis enabled
quantitative interpretation of the image-to-image process.
(iii) While LDMs are characterized as an iterative de-
noising process, the internal dynamics of the denoising pro-
cess are poorly understood. To gain some insight into this
process, we examined how zcchanges through the denois-
ing process. To do so, we extracted zcfrom the early, mid-
dle, and late steps of the denoising. We then constructed
combined models with zas in the above analysis (ii), and
mapped their unique variance onto cortex.
(iv) Finally, to inspect the last black box associated with
LDMs, we extracted features from different layers of U-Net.
For different steps of the denoising, encoding models were
constructed independently with different U-Net layers: two
from the first stage, one from the bottleneck stage, and two
from the second stage. We then identified the layer with
highest accuracy for each voxel and for each step.
Model weights were estimated from training data using
L2-regularized linear regression, and subsequently applied
to test data (see Appendix Afor details). For evaluation, we
used Pearson’s correlation coefficients between predicted
and measured fMRI signals. We computed statistical sig-
nificance (one-sided) by comparing the estimated correla-
tions to the null distribution of correlations between two
independent Gaussian random vectors of the same length
(N=982). The statistical threshold was set at P<0.05 and
corrected for multiple comparisons using the FDR proce-
dure. We show results from a single random seed, but we
verified that different random seed produced nearly identi-
cal results (see Appendix C). We reduced all feature dimen-
sions to 6,400 by applying principal component analysis, by
estimating components within training data.
Figure 3. Presented (red box) and reconstructed images for a sin-
gle subject (subj01) using z,c, and zc.
4. Results
4.1. Decoding
Figure 3shows the results of visual reconstruction for
one subject (subj01). We generated five images for each
test image and selected the generated images with high-
est PSMs. On the one hand, images reconstructed using
only zwere visually consistent with the original images,
but failed to capture their semantic content. On the other
hand, images reconstructed using only cgenerated images
with high semantic fidelity but were visually inconsistent.
Finally, images reconstructed using zccould generate high-
resolution images with high semantic fidelity (see Appendix
Bfor more examples).
Figure 4shows reconstructed images from all subjects
for the same image (all images were generated using zc.
Other examples are available in the Appendix B). Overall,
reconstruction quality was stable and accurate across sub-
jects.
We note that, the lack of agreement regarding specific
details of the reconstructed images may differences in per-
ceived experience across subjects, rather than failures of re-
construction. Alternatively it may simply reflect differences
in data quality among subjects. Indeed, subjects with high
(subj01) and low (subj07) decoding accuracy from fMRI
were subjects with high and low data quality metrics, re-
spectively (see Appendix B).
Figure 5plots results for the quantitative evaluation. In
5
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 21, 2022. ; https://doi.org/10.1101/2022.11.18.517004doi: bioRxiv preprint
Figure 4. Example results for all four subjects.
Accuracy (PSM)
Accuracy (Human)
zczczczc
CLIP (Early)
CLIP (Middle)
CLIP (Late)
CNN(Early)
CNN(Middle)
CNN(Late)
Figure 5. Identification accuracy calculated using objective (left)
and subjective (right) criteria (pooled across four subjects; chance
level corresponds to 50%). Error bars indicate standard error of
the mean.
the objective evaluation, images reconstructed using zcare
generally associated with higher accuracy values across dif-
ferent metrics than images reconstructed using only zor c.
When only zwas used, accuracy values were particularly
high for PSMs derived from early layers of CLIP and CNN.
On the other hand, when only cwas used, accuracy values
were higher for PSMs derived from late layers. In the sub-
jective evaluation, accuracy values of images obtained from
care higher than those obtained from z, while zcresulted
in the highest accuracy compared with the other two meth-
ods (P<0.01 for all comparisons, two-sided signed-rank
test, FWE corrected). Together, these results suggest that
our method captures not only low-level visual appearance,
but also high-level semantic content of the original stimuli.
It is difficult to compare our results with those re-
ported by most previous studies, because they used different
datasets. The datasets used in previous studies contain far
fewer images, much less image complexity (typically indi-
vidual objects positioned in the center of the image), and
lack full-text annotations of the kind available from NSD.
Only one study to date [25] used NSD for visual reconstruc-
tion, and they reported accuracy values of 78 ±4.5% for
one subject (subj01) using PSM based on Inception V3. It
is difficult to draw a direct comparison with this study, be-
cause it differed from ours in several respects (for example,
it used different training and test sample sizes, and differ-
ent image resolutions). Notwithstanding these differences,
their reported values fall within a similar range to ours for
the same subject (77% using CLIP, 83% using AlexNet,
and 76% using Inception V3). However, this prior study
relied on extensive model training and feature engineering
with many more hyper-parameters than those adopted in our
study, including the necessity to train complex generative
models, fine-tuning toward MS COCO, data augmentation,
and arbitrary thresholding of features. We did not use any
of the above techniques rather, our simple pipeline only
requires the construction of two linear regression models
from fMRI activity to latent representations of LDM.
Furthermore, we observed a reduction in semantic fi-
delity when we used categorical information associated
with the images, rather than full-text annotations for c.We
also found an increase in semantic fidelity when we used se-
mantic maps instead of original images for z, though visual
similarity was decreased in this case (see Appendix B).
4.2. Encoding Model
4.2.1 Comparison among Latent Representations
Figure 6shows prediction accuracy of the encoding models
for three types of latent representations associated with the
LDM: z, a latent representation of the original image; c,
a latent representation of image text annotation; and zc,a
noise-added latent representation of zafter reverse diffusion
process with cross-attention to c.
Although all three components produced high prediction
performance at the back of the brain, visual cortex, they
showed stark contrast. Specifically, zproduced high pre-
diction performance in the posterior part of visual cortex,
namely early visual cortex. It also showed significant pre-
diction values in the anterior part of visual cortex, namely
higher visual cortex, but smaller values in other regions. On
the other hand, cproduced the highest prediction perfor-
mance in higher visual cortex. The model also showed high
prediction performance across a wide span of cortex. zccar-
ries a representation that is very similar to z, showing high
prediction performance for early visual cortex. Although
this is somewhat predictable given their intrinsic similarity,
it is nevertheless intriguing because these representations
correspond to visually different generated images. We also
observed that using zcwith a reduced noise level injected
into zproduces a more similar prediction map to the predic-
6
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 21, 2022. ; https://doi.org/10.1101/2022.11.18.517004doi: bioRxiv preprint
Text
0
0.5
Prediction accuracy (r)
Anterior
Anterior
LH RH
Superior
Superior
...
zc
c
z
DP
Xετ
Figure 6. Prediction performance (measured using Pearson’s correlation coefficients) for the voxel-wise encoding model applied to held-
out test images in a single subject (subj01), projected onto the inflated (top, lateral and medial views) and flattened cortical surface (bottom,
occipital areas are at the center), for both left and right hemispheres. Brain regions with significant accuracy are colored (all colored voxels
P<0.05, FDR corrected).
tion map obtained from z, as expected (see Appendix C).
This similarity prompted us to conduct an additional analy-
sis to compare the unique variance explained by these two
models, detailed in the following section. See Appendix C
for results of all subjects.
4.2.2 Comparison across different noise levels
While the previous results showed that prediction accuracy
maps for zand zcpresent similar profiles, they do not tell us
how much unique variance is explained by each feature as
a function of different noise levels. To enhance our under-
standing of the above issues, we next constructed encoding
models that simultaneously incorporated both zand zcinto
a single model, and studied the unique contribution of each
feature. We also varied the level of noise added to zfor
generating zc.
Figure 7shows that, when a small amount of noise was
added, zpredicted voxel activity better than zcacross cor-
tex. Interestingly, when we increased the level of noise, zc
predicted voxel activity within higher visual cortex better
than z, indicating that the semantic content of the image
was gradually emphasized.
This result is intriguing because, without analyses like
this, we can only observe randomly generated images, and
we cannot examine how the text-conditioned image-to-
image process is able to balance between semantic content
and original visual appearance.
4.2.3 Comparison across different diffusion stages
We next asked how the noise-added latent representation
changes over the iterative denoising process.
Figure 8shows that, during the early stages of the de-
noising process, zsignals dominated prediction of fMRI
signals. During the middle step of the denoising process,
Original Images
.0
.3
.3
ZC
Noise Level = Middle
Noise Level = High
Z
Noise Level = Low
DP
...
zzc
c
Figure 7. Unique variance accounted for by zccompared with z
in one subject (subj01), obtained by splitting accuracy values from
the combined model. While fixing z, we used zcwith varying
amounts of noise-level added to the latent representation of stimuli
from low-level (top) to high-level (bottom). All colored voxels
P<0.05, FDR corrected.
zcpredicted activity within higher visual cortex much bet-
ter than z, indicating that the bulk of the semantic content
emerges at this stage. These results show how LDM refines
and generates images from noise.
7
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 21, 2022. ; https://doi.org/10.1101/2022.11.18.517004doi: bioRxiv preprint
...
DP
Progress=0%
Progress=44%
Progress=86%
zzc
.0
.3
.3
Z
ZC
Figure 8. Unique variance accounted for by zccompared with z
in one subject (subj01), obtained by splitting accuracy values from
the combined model. While fixing z, we used zcwith different de-
noising stages from early (top) to late (bottom) steps. All colored
voxels P<0.05, FDR corrected.
4.2.4 Comparison across different U-Net Layers
Finally, we asked what information is being processed at
each layer of U-Net.
Figure 9shows the results of encoding models for differ-
ent steps of the denoising process (early, middle, late), and
for the different layers of U-Net. During the early phase of
the denoising process, the bottleneck layer of U-Net (col-
ored orange) produces the highest prediction performance
across cortex. However, as denoising progresses, the early
layer of U-Net (colored blue) predicts activity within early
visual cortex, and the bottleneck layer shifts toward superior
predictive power for higher visual cortex.
These results suggest that, at the beginning of the reverse
diffusion process, image information is compressed within
the bottleneck layer. As denoising progresses, a functional
dissociation among U-Net layers emerges within visual cor-
tex: i.e., the first layer tends to represent fine-scale details
in early visual areas, while the bottleneck layer corresponds
to higher-order information in more ventral, semantic areas.
Progress=20%
Progress=66%
Progress=100%
Figure 9. Selective engagement of different U-Net layers for dif-
ferent voxels across the brain. Colors represent the most predictive
U-Net layer for early (top) to late (bottom) denoising steps. All
colored voxels P<0.05, FDR corrected.
5. Conclusions
We propose a novel visual reconstruction method using
LDMs. We show that our method can reconstruct high-
resolution images with high semantic fidelity from human
brain activity. Unlike previous studies of image reconstruc-
tion, our method does not require training or fine-tuning
of complex deep-learning models: it only requires simple
linear mappings from fMRI to latent representations within
LDMs.
We also provide a quantitative interpretation for the in-
ternal components of the LDM by building encoding mod-
els. For example, we demonstrate the emergence of seman-
tic content throughout the inverse diffusion process, we per-
form layer-wise characterization of U-Net, and we provide
a quantitative interpretation of image-to-image transforma-
tions with different noise levels. Although DMs are devel-
oping rapidly, their internal processes remain poorly under-
stood. This study is the first to provide a quantitative inter-
pretation from a biological perspective.
Acknowledgements
We would like to thank Stability AI for providing the
codes and models for Stable Diffusion, and NSD for
providing the neuroimaging dataset. YT was supported
8
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 21, 2022. ; https://doi.org/10.1101/2022.11.18.517004doi: bioRxiv preprint
by JSPS KAKENHI (19H05725). SN was supported
by MEXT/JSPS KAKENHI JP18H05522 as well as JST
CREST JPMJCR18A5 and ERATO JPMJER1801.
References
[1] Emily J. Allen, Ghislain St-Yves, Yihan Wu, Jesse L.
Breedlove, Jacob S. Prince, Logan T. Dowdle, Matthias
Nau, Brad Caron, Franco Pestilli, Ian Charest, J. Benjamin
Hutchinson, Thomas Naselaris, and Kendrick Kay. A mas-
sive 7t fmri dataset to bridge cognitive neuroscience and ar-
tificial intelligence. Nature Neuroscience, 25:116–126, 1
2022. 3
[2] Roman Beliy, Guy Gaziv, Assaf Hoogi, Francesca Strappini,
Tal Golan, and Michal Irani. From voxels to pixels and
back: Self-supervision in natural-image reconstruction from
fmri. Advances in Neural Information Processing Systems,
32, 2019. 2
[3] Charlotte Caucheteux and Jean-R´
emi King. Brains and al-
gorithms partially converge in natural language processing.
Communications biology, 5(1):1–10, 2022. 1,3
[4] Prafulla Dhariwal and Alexander Nichol. Diffusion models
beat gans on image synthesis. Advances in Neural Informa-
tion Processing Systems, 34:8780–8794, 2021. 2
[5] Tao Fang, Yu Qi, and Gang Pan. Reconstructing perceptive
images from brain activity by shape-semantic gan. Advances
in Neural Information Processing Systems, 33:13038–13048,
2020. 2
[6] Jin Gao, Jialing Zhang, Xihui Liu, Trevor Darrell, Evan
Shelhamer, and Dequan Wang. Back to the source:
Diffusion-driven test-time adaptation. arXiv preprint
arXiv:2207.03442, 2022. 2
[7] Guy Gaziv, Roman Beliy, Niv Granot, Assaf Hoogi,
Francesca Strappini, Tal Golan, and Michal Irani. Self-
supervised natural image reconstruction and large-scale se-
mantic classification from brain activity. NeuroImage, 254,
7 2022. 2
[8] Ariel Goldstein, Zaid Zada, Eliav Buchnik, Mariano Schain,
Amy Price, Bobbi Aubrey, Samuel A Nastase, Amir Feder,
Dotan Emanuel, Alon Cohen, et al. Shared computational
principles for language processing in humans and deep lan-
guage models. Nature neuroscience, 25(3):369–380, 2022.
1,3
[9] Iris IA Groen, Michelle R Greene, Christopher Baldassano,
Li Fei-Fei, Diane M Beck, and Chris I Baker. Distinct con-
tributions of functional and deep neural network features to
representational similarity of scenes in human brain and be-
havior. Elife, 7, 2018. 1,3
[10] Umut G¨
uc¸l¨
u and Marcel AJ van Gerven. Deep neural net-
works reveal a gradient in the complexity of neural represen-
tations across the ventral stream. Journal of Neuroscience,
35(27):10005–10014, 2015. 1,3
[11] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-
sion probabilistic models. Advances in Neural Information
Processing Systems, 33:6840–6851, 2020. 2,4
[12] Tomoyasu Horikawa, Alan S Cowen, Dacher Keltner, and
Yukiyasu Kamitani. The neural representation of visu-
ally evoked emotion is high-dimensional, categorical, and
distributed across transmodal brain regions. Iscience,
23(5):101060, 2020. 2
[13] Tomoyasu Horikawa and Yukiyasu Kamitani. Generic de-
coding of seen and imagined objects using hierarchical vi-
sual features. Nature communications, 8(1):1–15, 2017. 1,
2,3
[14] Tomoyasu Horikawa, Masako Tamaki, Yoichi Miyawaki,
and Yukiyasu Kamitani. Neural decoding of visual imagery
during sleep. Science, 340(6132):639–642, 2013. 2
[15] Alexander G Huth, Tyler Lee, Shinji Nishimoto, Natalia Y
Bilenko, An T Vu, and Jack L Gallant. Decoding the se-
mantic content of natural movies from human brain activity.
Frontiers in systems neuroscience, 10:81, 2016. 2
[16] Bahjat Kawar, Jiaming Song, Stefano Ermon, and Michael
Elad. Jpeg artifact correction using denoising diffusion
restoration models. arXiv preprint arXiv:2209.11888, 2022.
2
[17] Kendrick N Kay, Thomas Naselaris, Ryan J Prenger, and
Jack L Gallant. Identifying natural images from human brain
activity. Nature, 452(7185):352–355, 2008. 2
[18] Alexander JE Kell, Daniel LK Yamins, Erica N Shook,
Sam V Norman-Haignere, and Josh H McDermott. A task-
optimized neural network replicates human auditory behav-
ior, predicts brain responses, and reveals a cortical process-
ing hierarchy. Neuron, 98(3):630–644, 2018. 1,3
[19] Tim C Kietzmann, Courtney J Spoerer, Lynn KA S ¨
orensen,
Radoslaw M Cichy, Olaf Hauk, and Nikolaus Kriegeskorte.
Recurrence is required to capture the representational dy-
namics of the human visual system. Proceedings of the Na-
tional Academy of Sciences, 116(43):21854–21863, 2019. 1,
3
[20] Naoko Koide-Majima, Tomoya Nakai, and Shinji Nishi-
moto. Distinct dimensions of emotion in the human brain
and their representation on the cortical surface. NeuroImage,
222:117258, 2020. 2
[21] Takuya Koumura, Hiroki Terashima, and Shigeto Furukawa.
Cascaded tuning to amplitude modulation for natural sound
recognition. Journal of Neuroscience, 39(28):5517–5533,
2019. 1,3
[22] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
Imagenet classification with deep convolutional neural net-
works. Communications of the ACM, 60(6):84–90, 2017. 4,
13
[23] T. D. la Tour, M. Eickenberg, A. O. Nunez-Elizalde, and J. L.
Gallant. Feature-space selection with banded ridge regres-
sion. NeuroImage, page 119728, Nov 2022. 5
[24] Lynn Le, Luca Ambrogioni, Katja Seeliger, Ya˘
gmur
G¨
uc¸l¨
ut¨
urk, Marcel van Gerven, and Umut G¨
uc¸l¨
u. Brain2pix:
Fully convolutional naturalistic video reconstruction from
brain activity. BioRxiv, 2021. 2
[25] Sikun Lin, Thomas Sprague, and Ambuj K Singh. Mind
reader: Reconstructing complex images from brain activi-
ties. Advances in Neural Information Processing Systems,9
2022. 2,6
[26] Yoichi Miyawaki, Hajime Uchida, Okito Yamashita, Masa
aki Sato, Yusuke Morito, Hiroki C. Tanabe, Norihiro Sadato,
and Yukiyasu Kamitani. Visual image reconstruction from
9
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 21, 2022. ; https://doi.org/10.1101/2022.11.18.517004doi: bioRxiv preprint
human brain activity using a combination of multiscale local
image decoders. Neuron, 60:915–929, 12 2008. 2
[27] Milad Mozafari, Leila Reddy, and Rufin VanRullen. Recon-
structing natural scenes from fMRI patterns using bigbigan.
In 2020 International joint conference on neural networks
(IJCNN), pages 1–8. IEEE, 2020. 2
[28] Tomoya Nakai and Shinji Nishimoto. Quantitative models
reveal the organization of diverse cognitive functions in the
brain. Nature communications, 11(1):1–12, 2020. 2
[29] Thomas Naselaris, Cheryl A Olman, Dustin E Stansbury,
Kamil Ugurbil, and Jack L Gallant. A voxel-wise encod-
ing model for early visual areas decodes mental images of
remembered scenes. Neuroimage, 105:215–228, 2015. 2
[30] Thomas Naselaris, Ryan J Prenger, Kendrick N Kay, Michael
Oliver, and Jack L Gallant. Bayesian reconstruction of nat-
ural images from human brain activity. Neuron, 63(6):902–
915, 2009. 2
[31] Satoshi Nishida and Shinji Nishimoto. Decoding naturalistic
experiences from human brain activity via distributed repre-
sentations of words. Neuroimage, 180:232–242, 2018. 2
[32] Shinji Nishimoto, An T. Vu, Thomas Naselaris, Yuval Ben-
jamini, Bin Yu, and Jack L. Gallant. Reconstructing visual
experiences from brain activity evoked by natural movies.
Current Biology, 21:1641–1646, 10 2011. 2
[33] Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima
Sadekova, and Mikhail Kudinov. Grad-tts: A diffusion prob-
abilistic model for text-to-speech. In International Confer-
ence on Machine Learning, pages 8599–8608. PMLR, 2021.
2
[34] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
ing transferable visual models from natural language super-
vision. In International Conference on Machine Learning,
pages 8748–8763. PMLR, 2021. 4
[35] Zarina Rakhimberdina, Quentin Jodelet, Xin Liu, and
Tsuyoshi Murata. Natural image reconstruction from fmri
using deep learning: A survey. Frontiers in Neuroscience,
15, 2021. 13
[36] Ziqi Ren, Jie Li, Xuetong Xue, Xin Li, Fan Yang, Zhicheng
Jiao, and Xinbo Gao. Reconstructing seen image from brain
activity by visually-guided cognitive representation and ad-
versarial learning. NeuroImage, 228:117602, 2021. 2
[37] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Patrick Esser, and Bj¨
orn Ommer. High-resolution image
synthesis with latent diffusion models. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 10684–10695, 2022. 2
[38] Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee,
Jonathan Ho, Tim Salimans, David Fleet, and Mohammad
Norouzi. Palette: Image-to-image diffusion models. In
ACM SIGGRAPH 2022 Conference Proceedings, pages 1–
10, 2022. 2
[39] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed
Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi,
Rapha Gontijo Lopes, et al. Photorealistic text-to-image dif-
fusion models with deep language understanding. Advances
in Neural Information Processing Systems, 2022. 2
[40] Chitwan Saharia, Jonathan Ho, William Chan, Tim Sali-
mans, David J Fleet, and Mohammad Norouzi. Image super-
resolution via iterative refinement. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 2022. 2
[41] Hiroshi Sasaki, Chris G Willcocks, and Toby P Breckon.
Unit-ddpm: Unpaired image translation with denois-
ing diffusion probabilistic models. arXiv preprint
arXiv:2104.05358, 2021. 2
[42] Lea-Maria Schmitt, Julia Erb, Sarah Tune, Anna U Rysop,
Gesa Hartwigsen, and Jonas Obleser. Predicting speech from
a cortical hierarchy of event-based time scales. Science Ad-
vances, 7(49):eabi6070, 2021. 1,3
[43] Martin Schrimpf, Idan Asher Blank, Greta Tuckute, Ca-
rina Kauf, Eghbal A Hosseini, Nancy Kanwisher, Joshua B
Tenenbaum, and Evelina Fedorenko. The neural architecture
of language: Integrative modeling converges on predictive
processing. Proceedings of the National Academy of Sci-
ences, 118(45):e2105646118, 2021. 1,3
[44] Katja Seeliger, Umut G ¨
uc¸l¨
u, Luca Ambrogioni, Yagmur
G¨
uc¸l¨
ut¨
urk, and Marcel AJ van Gerven. Generative adver-
sarial networks for reconstructing natural images from brain
activity. NeuroImage, 181:775–785, 2018. 2
[45] Guohua Shen, Kshitij Dwivedi, Kei Majima, Tomoyasu
Horikawa, and Yukiyasu Kamitani. End-to-end deep image
reconstruction from human brain activity. Frontiers in Com-
putational Neuroscience, page 21, 2019. 2
[46] Guohua Shen, Tomoyasu Horikawa, Kei Majima, and
Yukiyasu Kamitani. Deep image reconstruction from human
brain activity. PLoS Computational Biology, 15, 2019. 2
[47] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
and Surya Ganguli. Deep unsupervised learning using
nonequilibrium thermodynamics. In International Confer-
ence on Machine Learning, pages 2256–2265. PMLR, 2015.
2,4
[48] Yang Song and Stefano Ermon. Generative modeling by esti-
mating gradients of the data distribution. Advances in Neural
Information Processing Systems, 32, 2019. 2
[49] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab-
hishek Kumar, Stefano Ermon, and Ben Poole. Score-based
generative modeling through stochastic differential equa-
tions. International Conference on Learning Representa-
tions, 2020. 2
[50] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon
Shlens, and Zbigniew Wojna. Rethinking the inception archi-
tecture for computer vision. In Proceedings of the IEEE con-
ference on computer vision and pattern recognition, pages
2818–2826, 2016. 13
[51] Yu Takagi, Yuki Sakai, Yoshinari Abe, Seiji Nishida, Ben J
Harrison, Ignacio Mart´
ınez-Zalaca´
ın, Carles Soriano-Mas,
Jin Narumoto, and Saori C Tanaka. A common brain network
among state, trait, and pathological anxiety from whole-
brain functional connectivity. Neuroimage, 172:506–516,
2018. 2
[52] Jerry Tang, Amanda LeBel, Shailee Jain, and Alexander G
Huth. Semantic reconstruction of continuous language from
non-invasive brain recordings. bioRxiv, 2022. 2
10
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 21, 2022. ; https://doi.org/10.1101/2022.11.18.517004doi: bioRxiv preprint
[53] Pascal Vincent. A connection between score matching and
denoising autoencoders. Neural computation, 23(7):1661–
1674, 2011. 2
[54] Haiguang Wen, Junxing Shi, Yizhen Zhang, Kun-Han Lu, Ji-
ayue Cao, and Zhongming Liu. Neural encoding and decod-
ing with deep learning for dynamic natural vision. Cerebral
cortex, 28(12):4136–4160, 2018. 1,3
[55] Daniel LK Yamins, Ha Hong, Charles F Cadieu,
Ethan A Solomon, Darren Seibert, and James J DiCarlo.
Performance-optimized hierarchical models predict neural
responses in higher visual cortex. Proceedings of the na-
tional academy of sciences, 111(23):8619–8624, 2014. 1,
3
11
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 21, 2022. ; https://doi.org/10.1101/2022.11.18.517004doi: bioRxiv preprint
... In the early stages of the research, DDPM performed the generation process in pixel space, resulting in high computational complexity and slow sampling speed, making them less prominent than GANs due to low generation speed. To overcome these two challenges, Latent Diffusion Model (LDM) [8] and progressive distillation [9] are common methods often emplyed. LDM transforms the diffusion model from pixel space to latent space [10], reducing the dimension of the intermediate process by lossy compression [11]. ...
... Considering the inherent issue of slow convergence [13] in diffusion Fig. 1 Comparison of Generated Images. Upper Section: LDM [8] was utilized for image sampling with a compression factor of 8, class label guidance [14] was incorporated (Guidance Scale = 7.5), and DDIM [15] sampling with 50 time steps was employed. However, noticeable distortions were observed in the generated images. ...
... 1) The introduction of lossy compression methods has made it difficult for diffusion models based on latent spaces to ensure the quality of details while aiming to improve train-ing efficiency and generalization. This challenge arises from high-dimensional latent variables, which typically contain more abstract and elusive image features [8], while lossy compression methods can result in feature omission [16]. 2) In addition to the lossy compression issue, most current diffusion models still use Mean Squared Error (MSE) [1,17] as their loss function. ...
Article
Full-text available
Diffusion models have achieved remarkable results in image generation. However, due to the slow convergence speed, room for enhancement remains in existing loss weight strategies. In one aspect, the predefined loss weight strategy based on signal-to-noise ratio (SNR) transforms the diffusion process into a multi-objective optimization problem. However, it takes a long time to reach the Pareto optimal. In contrast, the unconstrained optimization weight strategy can achieve lower objective values, but the loss weights of each task change unstably, resulting in low training efficiency. In addition, the imbalance of lossy compression and semantic information in latent space diffusion also leads to missing image details. To solve these problems, a new loss weight strategy combining the advantages of predefined and learnable loss weights is proposed, effectively balancing the gradient conflict of multi-objective optimization. A high-dimensional multi-space diffusion method called Multi-Space Diffusion is also introduced, and a loss function that considers both structural information and robustness is designed to achieve a good balance between lossy compression and fidelity. The experimental results indicate that the proposed model and strategy significantly enhance the convergence speed, being 3.7 times faster than the Const strategy, and achieve an advanced FID = 3.35 score on the ImageNet512.
... Additionally, sparse linear regression has been able to predict CNN features for natural images from fMRI data [5]. Recently, diffusion models, noted for their excellent image generation abilities, have become integral to decoding, often employing semantic techniques and multi-step decoding processes [6,7,8,9,10]. ...
... The fMRI module's decoding performance was particularly notable, achieving a CLIP 2way accuracy of 93% with the NSD dataset. Our results align well and even outperform findings from recent literature [6,10,9,7,24], though direct comparisons are challenging due to the varied focus and methodologies of these studies. Many of these works concentrate on the detailed reconstruction of stimuli using complex pipelines that involve regressing fMRI data to a latent space, generating images, and then computing CLIP 2-way accuracy between the generated and actual images. ...
Preprint
Full-text available
This paper presents a novel approach towards creating a foundational model for aligning neural data and visual stimuli across multimodal representationsof brain activity by leveraging contrastive learning. We used electroencephalography (EEG), magnetoencephalography (MEG), and functional magnetic resonance imaging (fMRI) data. Our framework's capabilities are demonstrated through three key experiments: decoding visual information from neural data, encoding images into neural representations, and converting between neural modalities. The results highlight the model's ability to accurately capture semantic information across different brain imaging techniques, illustrating its potential in decoding, encoding, and modality conversion tasks.
... [9] uses conditioned GAN to reconstruct images that are consistent with groundtruth in terms of semantic meanings. [13] apply diffusion model guided by semantic information of image content to reconstruct image from fMRI. [1] manages to reconstruct video from fMRI by using multimodal alignment to extract semantically rich representations as guidance for diffusionbased decoding. ...
Preprint
Full-text available
Decoding visual-semantic information from brain signals, such as functional MRI (fMRI), across different subjects poses significant challenges, including low signal-to-noise ratio, limited data availability, and cross-subject variability. Recent advancements in large language models (LLMs) show remarkable effectiveness in processing multimodal information. In this study, we introduce an LLM-based approach for reconstructing visual-semantic information from fMRI signals elicited by video stimuli. Specifically, we employ fine-tuning techniques on an fMRI encoder equipped with adaptors to transform brain responses into latent representations aligned with the video stimuli. Subsequently, these representations are mapped to textual modality by LLM. In particular, we integrate self-supervised domain adaptation methods to enhance the alignment between visual-semantic information and brain responses. Our proposed method achieves good results using various quantitative semantic metrics, while yielding similarity with ground-truth information.
Preprint
Full-text available
The history of art has seen significant shifts in the manner in which artworks are created, making understanding of creative processes a central question in technical art history. In the Renaissance and Early Modern period, paintings were largely produced by master painters directing workshops of apprentices who often contributed to projects. The masters varied significantly in artistic and managerial styles, meaning different combinations of artists and implements might be seen both between masters and within workshops or even individual canvases. Information on how different workshops were managed and the processes by which artworks were created remains elusive. Machine learning methods have potential to unearth new information about artists' creative processes by extending the analysis of brushwork to a microscopic scale. Analysis of workshop paintings, however, presents a challenge in that documentation of the artists and materials involved is sparse, meaning external examples are not available to train networks to recognize their contributions. Here we present a novel machine learning approach we call pairwise assignment training for classifying heterogeneity (PATCH) that is capable of identifying individual artistic practice regimes with no external training data, or "ground truth." The method achieves unsupervised results by supervised means, and outperforms both simple statistical procedures and unsupervised machine learning methods. We apply this method to two historical paintings by the Spanish Renaissance master, El Greco: The Baptism of Christ and Christ on the Cross with Landscape, and our findings regarding the former potentially challenge previous work that has assigned the painting to workshop members. Further, the results of our analyses create a measure of heterogeneity of artistic practice that can be used to characterize artworks across time and space.
Article
Full-text available
This paper introduces a neural network-based model designed for classifying emotional states by leveraging multimodal physiological signals. The model utilizes data from the AMIGOS and SEED-V databases. The AMIGOS database integrates inputs from electroencephalogram (EEG), electrocardiogram (ECG), and galvanic skin response (GSR) to analyze emotional responses, while the SEED-V database continuously updates EEG signals. We implemented a sequential neural network architecture featuring two hidden layers, which underwent substantial hyperparameter tuning to achieve optimal performance. Our model’s effectiveness was tested through binary classification tasks focusing on arousal and valence, as well as a more complex four-class classification that delineates emotional quadrants for the emotional tags: happy, sad, neutral, and disgust. In these varied scenarios, the model consistently demonstrated accuracy levels ranging from 79% to 86% in the AMIGOS database and up to 97% in SEED-V. A notable aspect of our approach is the model’s ability to accurately recognize emotions without the need for extensive signal preprocessing, a common challenge in multimodal emotion analysis. This feature enhances the practical applicability of our model in real-world scenarios where rapid and efficient emotion recognition is essential.
Article
Full-text available
Reconstructing visual stimulus information from evoked brain activity is a significant task of visual decoding of the human brain. However, due to the limited size of published functional magnetic resonance imaging (fMRI) datasets, it is difficult to adequately train a complex network with a large number of parameters. Furthermore, the dimensions of fMRI data in existing datasets are extremely high, and the signal-to-noise ratio of the data is relatively low. To address these issues, we design an fMRI-based visual decoding framework that incorporates additional self-supervised training on the encoder and decoder to alleviate the problem of insufficient model training due to limited datasets. Furthermore, we propose a iterative Two Part and Two Stage learning method involving a teacher (supervised)-student (self-supervised) setup and encoder-decoder asynchronous updates strategy. This approach allows the encoder and decoder to mutually reinforce and iteratively update each other under the guidance of a teacher model. The analysis of the ablation experiments demonstrates that the proposed framework can effectively improve the reconstruction accuracy. The experimental results show that the proposed method achieves better visual reconstruction from evoked brain activity of the human brain and that its reconstruction accuracy is superior to that of existing methods.
Article
Full-text available
In this paper, images containing mathematical ideas that were generated by different image-generating AI models (DALL-E 2, Leonardo.Ai, and Midjourney) are comparatively interpreted from a mathematics aesthetical perspective. Our exploratory study aimed to scrutinize how AI image models visualize abstract mathematical ideas and whether this imagination holds any potential for use in the context of learning mathematics. Two different mathematics-containing prompts, representing different aspects of mathematics, were used as input. Our results show that based on our examples, currently, Midjourney generates the most detailed outcome that can be used as an initial point for discussing mathematical ideas with students in the classroom.
Article
Full-text available
A brain–computer interface that decodes continuous language from non-invasive recordings would have many scientific and practical applications. Currently, however, non-invasive language decoders can only identify stimuli from among a small set of words or phrases. Here we introduce a non-invasive decoder that reconstructs continuous language from cortical semantic representations recorded using functional magnetic resonance imaging (fMRI). Given novel brain recordings, this decoder generates intelligible word sequences that recover the meaning of perceived speech, imagined speech and even silent videos, demonstrating that a single decoder can be applied to a range of tasks. We tested the decoder across cortex and found that continuous language can be separately decoded from multiple regions. As brain–computer interfaces should respect mental privacy, we tested whether successful decoding requires subject cooperation and found that subject cooperation is required both to train and to apply the decoder. Our findings demonstrate the viability of non-invasive language brain–computer interfaces.
Article
Full-text available
Reconstructing complex and dynamic visual perception from brain activity remains a major challenge in machine learning applications to neuroscience. Here, we present a new method for reconstructing naturalistic images and videos from very large single-participant functional magnetic resonance imaging data that leverages the recent success of image-to-image transformation networks. This is achieved by exploiting spatial information obtained from retinotopic mappings across the visual system. More specifically, we first determine what position each voxel in a particular region of interest would represent in the visual field based on its corresponding receptive field location. Then, the 2D image representation of the brain activity on the visual field is passed to a fully convolutional image-to-image network trained to recover the original stimuli using VGG feature loss with an adversarial regularizer. In our experiments, we show that our method offers a significant improvement over existing video reconstruction techniques.
Article
Full-text available
Encoding models provide a powerful framework to identify the information represented in brain recordings. In this framework, a stimulus representation is expressed within a feature space and is used in a regularized linear regression to predict brain activity. To account for a potential complementarity of different feature spaces, a joint model is fit on multiple feature spaces simultaneously. To adapt regularization strength to each feature space, ridge regression is extended to banded ridge regression, which optimizes a different regularization hyperparameter per feature space. The present paper proposes a method to decompose over feature spaces the variance explained by a banded ridge regression model. It also describes how banded ridge regression performs a feature-space selection, effectively ignoring non-predictive and redundant feature spaces. This feature-space selection leads to better prediction accuracy and to better interpretability. Banded ridge regression is then mathematically linked to a number of other regression methods with similar feature-space selection mechanisms. Finally, several methods are proposed to address the computational challenge of fitting banded ridge regressions on large numbers of voxels and feature spaces. All implementations are released in an open-source Python package called Himalaya.
Article
Full-text available
We present SR3, an approach to image Super-Resolution via Repeated Refinement. SR3 adapts denoising diffusion probabilistic models [1], [2] to image-to-image translation, and performs super-resolution through a stochastic iterative denoising process. Output images are initialized with pure Gaussian noise and iteratively refined using a U-Net architecture that is trained on denoising at various noise levels, conditioned on a low-resolution input image. SR3 exhibits strong performance on super-resolution tasks at different magnification factors, on faces and natural images. We conduct human evaluation on a standard 8× face super-resolution task on CelebA-HQ for which SR3 achieves a fool rate close to 50%, suggesting photo-realistic outputs, while GAN baselines do not exceed a fool rate of 34%. We evaluate SR3 on a 4× super-resolution task on ImageNet, where SR3 outperforms baselines in human evaluation and classification accuracy of a ResNet-50 classifier trained on high-resolution images. We further show the effectiveness of SR3 in cascaded image generation, where a generative model is chained with super-resolution models to synthesize high-resolution images with competitive FID scores on the class-conditional 256×256 ImageNet generation challenge.
Article
Full-text available
Departing from traditional linguistic models, advances in deep learning have resulted in a new type of predictive (autoregressive) deep language models (DLMs). Using a self-supervised next-word prediction task, these models generate appropriate linguistic responses in a given context. In the current study, nine participants listened to a 30-min podcast while their brain responses were recorded using electrocorticography (ECoG). We provide empirical evidence that the human brain and autoregressive DLMs share three fundamental computational principles as they process the same natural narrative: (1) both are engaged in continuous next-word prediction before word onset; (2) both match their pre-onset predictions to the incoming word to calculate post-onset surprise; (3) both rely on contextual embeddings to represent words in natural contexts. Together, our findings suggest that autoregressive DLMs provide a new and biologically feasible computational framework for studying the neural basis of language.
Article
Full-text available
Deep learning algorithms trained to predict masked words from large amount of text have recently been shown to generate activations similar to those of the human brain. However, what drives this similarity remains currently unknown. Here, we systematically compare a variety of deep language models to identify the computational principles that lead them to generate brain-like representations of sentences. Specifically, we analyze the brain responses to 400 isolated sentences in a large cohort of 102 subjects, each recorded for two hours with functional magnetic resonance imaging (fMRI) and magnetoencephalography (MEG). We then test where and when each of these algorithms maps onto the brain responses. Finally, we estimate how the architecture, training, and performance of these models independently account for the generation of brain-like representations. Our analyses reveal two main findings. First, the similarity between the algorithms and the brain primarily depends on their ability to predict words from context. Second, this similarity reveals the rise and maintenance of perceptual, lexical, and compositional representations within each cortical region. Overall, this study shows that modern language algorithms partially converge towards brain-like solutions, and thus delineates a promising path to unravel the foundations of natural language processing. Charlotte Caucheteux and Jean-Rémi King examine the ability of transformer neural networks trained on word prediction tasks to fit representations in the human brain measured with fMRI and MEG. Their results provide further insight into the workings of transformer language models and their relevance to brain responses.
Article
Reconstructing natural images and decoding their semantic category from fMRI brain recordings is challenging. Acquiring sufficient pairs of images and their corresponding fMRI responses, which span the huge space of natural images, is prohibitive. We present a novel self-supervised approach that goes well beyond the scarce paired data, for achieving both: (i) state-of-the art fMRI-to-image reconstruction, and (ii) first-ever large-scale semantic classification from fMRI responses. By imposing cycle consistency between a pair of deep neural networks (from image-to-fMRI & from fMRI-to-image), we train our image reconstruction network on a large number of “unpaired” natural images (images without fMRI recordings) from many novel semantic categories. This enables to adapt our reconstruction network to a very rich semantic coverage without requiring any explicit semantic supervision. Specifically, we find that combining our self-supervised training with high-level perceptual losses, gives rise to new reconstruction & classification capabilities. In particular, this perceptual training enables to classify well fMRIs of never-before-seen semantic classes, without requiring any class labels during training. This gives rise to: (i) Unprecedented image-reconstruction from fMRI of never-before-seen images (evaluated by image metrics and human testing), and (ii) Large-scale semantic classification of categories that were never-before-seen during network training. Such large-scale (1000-way) semantic classification from fMRI recordings has never been demonstrated before. Finally, we provide evidence for the biological consistency of our learned model.