PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

The human brain possesses remarkable abilities in visual processing, including image recognition and scene summarization. Efforts have been made to understand the cognitive capacities of the visual brain, but a comprehensive understanding of the underlying mechanisms still needs to be discovered. Advancements in brain decoding techniques have led to sophisticated approaches like fMRI-to-Image reconstruction, which has implications for cognitive neuroscience and medical imaging. However, challenges persist in fMRI-to-image reconstruction, such as incorporating global context and contextual information. In this article, we propose fMRI captioning, where captions are generated based on fMRI data to gain insight into the neural correlates of visual perception. This research presents DreamCatcher, a novel framework for fMRI captioning. DreamCatcher consists of the Representation Space Encoder (RSE) and the RevEmbedding Decoder, which transform fMRI vectors into a latent space and generate captions, respectively. We evaluated the framework through visualization, dataset training, and testing on subjects, demonstrating strong performance. fMRI-based captioning has diverse applications, including understanding neural mechanisms, Human-Computer Interaction, and enhancing learning and training processes.
Content may be subject to copyright.
DreamCatcher: Revealing the Language of the Brain with fMRI using GPT
Embedding
Subhrasankar Chatterjee
Indian Institute of Technology, Kharagpur
Kharagpur-721302, West Bengal, India.
subhrasankarphd@iitkgp.ac.in
Debasis Samanta
Indian Institute of Technology, Kharagpur
Kharagpur-721302, West Bengal, India.
dsamanta@iitkgp.ac.in
Abstract
The human brain possesses remarkable abilities in visual
processing, including image recognition and scene summa-
rization. Efforts have been made to understand the cog-
nitive capacities of the visual brain, but a comprehensive
understanding of the underlying mechanisms still needs to
be discovered. Advancements in brain decoding techniques
have led to sophisticated approaches like fMRI-to-Image
reconstruction, which has implications for cognitive neu-
roscience and medical imaging. However, challenges per-
sist in fMRI-to-image reconstruction, such as incorporat-
ing global context and contextual information. In this arti-
cle, we propose fMRI captioning, where captions are gen-
erated based on fMRI data to gain insight into the neu-
ral correlates of visual perception. This research presents
DreamCatcher, a novel framework for fMRI captioning.
DreamCatcher consists of the Representation Space En-
coder (RSE) and the RevEmbedding Decoder, which trans-
form fMRI vectors into a latent space and generate cap-
tions, respectively. We evaluated the framework through
visualization, dataset training, and testing on subjects,
demonstrating strong performance. fMRI-based captioning
has diverse applications, including understanding neural
mechanisms, Human-Computer Interaction, and enhancing
learning and training processes.
1. Introduction
The human brain manifests excellent proficiencies in vi-
sual processing, encompassing image recognition and scene
summarization. Considerable endeavors have been devoted
to elucidating the cognitive capacities inherent in the visual
brain. Nevertheless, a comprehensive understanding of the
fundamental mechanisms that underlie human visual pro-
cessing remains an unresolved endeavor.
Functional magnetic resonance imaging (fMRI) has
emerged as an invaluable instrument for scrutinizing the
ORIGINAL IMAGE RECONSTRUCTED
IMAGE
FRAGMENTED
RECONSTRUCTION
CONTEXTUALLY
IRRELEVENT
Figure 1. Illustrative example of current issues with fMRI-to-
Image reconstruction. First reconstruction example successfully
captures the low-level features however misses the high-level fea-
tures. Second reconstruction example is adequate at a object-level
replication, however misses the context in which the objects are to
be placed.
neural activity of the human brain [12,13,22,26]. Over
time, fMRI-based brain decoding techniques have pro-
gressed from rudimentary fMRI classification approaches
[7,10,11,15] to the more sophisticated realm of fMRI-to-
Image reconstruction [20,23,25]. This evolutionary tra-
jectory holds significant implications for both the compre-
hension of neural mechanisms and the practical application
of such knowledge. Particularly, domains such as fMRI-
to-Image reconstruction have the potential to revolution-
ize fields, including cognitive neuroscience [14,16,28] and
even medical imaging [4,6,30]. These advancements paved
the way for proposing a more sophisticated approach to
brain decoding known as fMRI captioning(Please refer to
1
arXiv:2306.10082v1 [cs.CV] 16 Jun 2023
INPUT
STIMULUS
fMRI Based on
INPUT STIMULUS
RIGHT HEMISPHERELEFT HEMISPHERE
OUTPUT CAPTION: A small room with a computer and bookcase.
OUTPUT CAPTION: A machine that holds a lot of donuts and cooks them.
Figure 2. fMRI-Captioning: Subject is presented with an image stimulus and fMRI Neural Responses were captured. Given an fMRI
response, the task is to predict captions based on the visual stimulus.
Fig. 2). In this domain, captions are generated for input
stimuli based on fMRI data, affording a deeper understand-
ing of the neural correlates of visual perception.
Deep generative models, including Variational Autoen-
coders (VAEs), Generative Adversarial Networks (GANs),
and Latent Diffusion Models (LDMs), have significantly
advanced visual reconstruction. These models have been
widely used to reconstruct complete images by mapping
brain signals to latent variables. Successful applications
include face reconstruction [8,34] , single-object-centered
image reconstruction [32] , and complex scene reconstruc-
tion [1,17]. Previous studies have focused on datasets like
Generic Object Decoding and Deep Image Reconstruction
derived from ImageNet, and demonstrated improvements in
reconstruction quality using approaches such as deep gen-
erator networks [32], supervised and unsupervised train-
ing [2], BigBiGAN-based models [21], and dual VAE-GAN
models [31]. The Natural Scenes Dataset (NSD), curated
by Allen et al. (2022) [1], has emerged as a benchmark
for fMRI-based natural scene reconstruction, with studies
employing models such as StyleGAN2 [17], Stable Diffu-
sion [33], and improved IC-GAN frameworks [9] to recon-
struct images and estimate pose.
Despite the advancement in architectures for fMRI-to-
image reconstruction, specific inherent challenges persist.
Firstly, the reconstruction process is typically fragment-
based, necessitating understanding the image’s global con-
text within most frameworks. Secondly, while fragment-
based reconstructions can effectively capture low-level ob-
ject features, they often need help to incorporate the con-
textual information of the image. As illustrated in Fig. 1,
taken from the work of Gu et al. [9], the first part of the im-
age demonstrates the reconstruction of an airplane, reveal-
ing the successful capture of low-level features but a failure
to reproduce high-level features resulting in fragmented re-
construction. Similarly, the human objects are adequately
replicated in the second part of the image, but the contex-
tual aspects remain absent. These substantial challenges are
addressed through the application of fMRI-captioning tech-
niques.
In this research paper, we introduce a novel framework
called DreamCatcher for fMRI captioning. The Dream-
Catcher framework comprises two key components: the
Representation Space Encoder (RSE) and the RevEmbed-
ding Decoder(Please refer to Fig. 3). The RSE is designed
as a standard neural network architecture, which takes pre-
processed fMRI vectors as input and projects them onto
an N-Dimensional Representation space. Our framework
adopts a 1536-D GPT Embedding [24] as the representa-
tion space, as it is well-suited for language-based transfor-
mations. The RSE acts as a transformation function [35]
that maps the fMRI input space to a latent space based on
a pre-trained Large Language Model (LLM). However, ex-
isting reverse embedding techniques often rely on approx-
2
VANILLA
NN
GPT EMBEDDING
SPACE
STIMULUS HUMAN BRAIN
fMRI
LSTM LSTM
softmax
LSTM
softmax softmax
PREDICTED CAPTIONS
WORD WORD WORD
REPRESENTATION
SPACE ENCODER
REVEMBEDDING
DECODER
ORIGINAL
CAPTIONS
Figure 3. Algorithmic Pipeline: The human subject was presented with visual stimuli while fMRI was being recorded. For each visual
stimuli, captions were also produced by the subject. The DreamCatcher Framework has two components: the RSE takes in the fMRI neural
response and generates a 1536-D GPT Embedding, the RevEmbedding Decoder uses a pone-to-many sequential model that takes the GPT
Embedding and produces a caption. The loss is calculated using the original human caption and predicted fMRI caption.
imations unsuitable for our purpose. Hence, the RevEm-
bedding Decoder, the second part of our model, is imple-
mented as a One-to-many LSTM Decoder. It takes the GPT
Embedding as input and generates the desired captions. To
evaluate the performance of our framework, we conducted
three sets of experiments. Firstly, we visualized the gener-
ated representation space to assess its ability to capture ad-
equate representations. Secondly, we trained and tested the
entire framework using the Natural Scene Dataset [1] and
MS-COCO Dataset [18] to establish its feasibility. Since
fMRI captioning is still a nascent field, we could not com-
pare our framework directly with state-of-the-art models.
Finally, we evaluated the effectiveness of our framework by
testing it on two subjects, employing metrics such as ME-
TEOR, Sentence, and Perplixity. The results indicate that
our DreamCatcher framework exhibits strong performance
as an fMRI captioning model.
fMRI-based captioning finds application in understand-
ing neural mechanisms and domains like Human-Computer
Interaction. Researchers can design more intuitive and re-
sponsive interfaces that adapt to users’ cognitive states by
discerning the brain’s response to diverse visual inputs. The
utilization of fMRI-based caption generation has the po-
tential to support learning and training processes, particu-
larly in educational settings. Researchers can discern pat-
terns associated with successful learning or engagement by
analyzing neural responses to visual stimuli during edu-
cational tasks. This amalgamation of neuroimaging tech-
niques and educational contexts holds promise for enhanc-
ing pedagogical practices through a nuanced understanding
of the brain’s cognitive responses to visual stimuli.
DreamCatcher effectively addresses the limitations in-
herent in fragment-based reconstructions by incorpo-
rating contextual information through an LSTM mod-
ule. This approach enables more comprehensive and
coherent reconstructions of visual stimuli, capturing not
only low-level object features but also high-level con-
textual aspects. Notably, the DreamCatcher framework
has the potential for versatility by being independent
of neural response modality, making it applicable to
other modalities such as EEG(Electroencephalogram) or
ECoG(Electrocorticogram). This adaptability extends the
potential of DreamCatcher in real-time applications.
To summarize, the contributions of this research are as
follows:
Introduction of fMRI Captioning: This study intro-
duces the fMRI captioning domain as an alternative
approach to traditional fMRI-to-Image Reconstruction
for Neural Decoding.
Proposal of the DreamCatcher Framework: The
DreamCatcher framework is proposed as a feasibility
test for fMRI captioning.
Verification of GPT Embedding Space Applicabil-
ity: This research validates the use of GPT Embed-
ding space as a brain representation space within the
DreamCatcher framework. The effectiveness of this
representation space is demonstrated through empir-
ical evaluation, providing evidence for its utility in
fMRI captioning tasks.
3
Generated
Captions
LSTM LSTM LSTM
Softmax Softmax Softmax
RevEmbedding
Decoder
Cosine
Sim
GPT
Embedding
GPT Embedding
Space
Transformer
Encoder
GPT Embedding
Encoder
Captions
Figure 4. Block Diagram demonstrating the embedding mechanism of GPT Embedding and the Reverse Embedding Mechanism
of our Proposed Framework. The GPT Embedding Model uses a Transformer that takes in two texts xand y, for each instance i, and
calculates the cosine similarity between them. Hence, training such a language model over a large corpus generates the GPT Embedding
Space which adequately captures the contextual relationship among texts.
2. Literature Survey
Deep Brain Reconstruction: Deep generative models
have been pivotal in advancing visual reconstruction, par-
ticularly in deep learning. Prominent models such as Vari-
ational Autoencoders (VAEs), Generative Adversarial Net-
works (GANs), and Latent Diffusion Models (LDMs) have
been extensively employed to reconstruct entire images.
The standard approach involves using pre-trained deep gen-
erative models to learn mappings reconstructing the corre-
sponding latent variables from brain signals. This approach
has successfully reconstructed various images, including
faces, single-object-centered images, and complex scenes.
Previous studies in visual stimulus reconstruction have
primarily focused on the Generic Object Decoding and
Deep Image Reconstruction datasets, which are derived
from the ImageNet dataset and involve training and testing
images with varying fMRI repetitions. Notable research has
emerged from this line of investigation. The initial prob-
lem was addressed by Shen et al. (2019) through an op-
timization method utilizing a deep generator network and
fMRI-decoded CNN features [32]. They optimized image
pixel values to align with fMRI-decoded features. Beliy et
al. (2019) introduced supervised and unsupervised train-
ing approaches for fMRI-to-image reconstruction networks
to address the scarcity of labeled data, allowing training on
”unlabeled” data without fMRI or images [2]. Mozafari et
al. (2020) recognized the issue of unrecognizable objects
in reconstructed images due to an emphasis on pixel-level
similarity. They proposed the BigBiGAN-based reconstruc-
tion model, which focused on preserving object recogni-
tion [21]. Ren et al. (2021) tackled limitations in fMRI data,
such as low signal-to-noise ratio and limited spatial resolu-
tion, with a Dual VAE-GAN model that learned visually
guided latent cognitive representations from fMRI signals
and reconstructed image stimuli [31]. Ozcelik et al. (2022)
addressed the challenge of simultaneously reconstructing
low-level and high-level image features. They introduced
the Instance-Conditioned GAN model, which captured pre-
cise semantics and poses information [27]. Chen et al.
(2022) addressed the lack of fMRI-image pairs and effec-
tive biological guidance, leading to blurry and semantically
meaningless reconstructions [5]. Their solution involved
a sparse masked brain modeling approach and a double-
conditioned diffusion model for establishing a precise and
generalizable connection between brain activity and visual
stimuli.
In recent years, Allen et al. (2022) introduced the Nat-
ural Scenes Dataset (NSD) [1] as a benchmark for fMRI-
based natural scene reconstruction. Lin et al. (2022)
adapted the StyleGAN2 model using the Lafite framework
for text-to-image generation [17], while Takagi et al. (2022)
and Gu et al. (2022) utilized Stable Diffusion and an im-
proved IC-GAN framework for image reconstruction and
pose estimation, respectively [9,33].
3. Conceptual Background
Word Embedding: Word embedding is a fundamental
technique employed in natural language processing (NLP)
to represent words in a continuous and low-dimensional
vector space(Please refer to Fig. 4), referred to as the latent
space. Its purpose is to capture the semantic and syntactic
relationships among words based on their contextual usage
within a vast text corpus.
In contrast to the traditional approach of representing
words in NLP using one-hot encoding, where words are
represented as sparse binary vectors, word embedding over-
4
comes this limitation by mapping words into a dense vector
space. This transformation allows for the capture of mean-
ingful relationships between words, as the spatial proximity
of word vectors reflects the semantic similarity between the
corresponding words. This proximity arises from the obser-
vation that words appearing in similar contexts tend to have
similar vector representations.
Popular algorithms for generating word embeddings in-
clude Word2Vec [19], GloVe [29], and FastText [3], which
leverage large text datasets to learn word representations.
GPT, a prominent language model developed by OpenAI
[24], utilizes word embeddings as an integral component
of its architecture. GPT employs contextual word embed-
dings, also known as contextualized representations, which
capture the context-dependent meaning of words. Unlike
traditional word embeddings, which assign fixed vector rep-
resentations to each word, contextual word embeddings in
GPT consider the target word, its neighboring words, and
the overall sentence or document to generate word repre-
sentations. The word embedding mechanism in GPT is built
upon the Transformer architecture, a deep neural network
model designed explicitly for sequence-to-sequence tasks.
GPT employs a multi-layer Transformer encoder to process
input text and generate contextualized word representations.
Reverse Embedding: Reverse word embedding, also
called word decoding or word reconstruction, is a crucial
process in converting a numerical representation, typically
in the form of a vector, back into its corresponding word or
textual representation(Please refer to Fig. 4). The inverse
operation of word embedding maps continuous vector rep-
resentations within a high-dimensional space back onto its
respective word or text [37].
Reverse word embedding plays a significant role in var-
ious applications, including language generation, machine
translation, and text summarization, where the generated or
translated text needs to be transformed back into its original
word form. By reconstructing words from their numerical
representations, reverse word embedding bridges the con-
tinuous vector space and the discrete word space [36].
The underlying principle of reverse word embedding
lies in comprehending how word embeddings are learned
and utilized. It relies on the associations and relationships
learned within the embedding space to reconstruct the orig-
inal words. This inversion process entails finding the word
closest to a given vector representation in the embedding
space. Different approaches can be employed, such as near-
est neighbor search or computing the cosine similarity be-
tween the vector and all the word embeddings within a pre-
trained embedding model.
fMRI Vector
The Pre-processed
fMRI vector is
passed into the
Vanilla Neural
Network Model, and
their respective GPT
Embeddings were
used as output
targets. This model
is equivalent to a
Domain Adaptive
function that
projects the fMRI to
GPT Embedding
GPT Embedding
Raw fRMI
Vanilla Neural
Network
Representation
Space Encoder
Figure 5. Illustrative example for the RSE model. The RSE
model uses a vanilla Neural Network that uses a Mean Squared
Error Loss(MSE) to learn the GPT Embedding Space from the
input fMRI space.
4. Proposed Framework
The details graphical illustration of the DreamCatcher
Framework can be found in Fig. 3.
4.1. Representation Space Encoder
The first module of the DreamCatcher framework is
the Representation Space Encoder(Please refer to Fig. 5),
which facilitates the conversion of preprocessed fMRI vec-
tors into the GPT Embedding Space (GPTES). To obtain the
fMRI vectors, the Natural Scenes Dataset [1] is utilized. In
contrast, the corresponding captions are obtained from the
MS COCO dataset [18]. These captions undergo conversion
into GPT Embeddings via the Embedding API provided by
OpenAI.
In contrast to traditional word embeddings, the GPTES
incorporates the contextual meaning of words. It is trained
using a contrastive objective on paired data. Using cosine
similarity, Transformer Encoder G(.)calculates the appro-
priate distance between a given training pair (xi,yi). This
similarity measure ensures the preservation of contextual
meaning within the GPTES.
vx=G(xi)(1)
vy=G(yi)(2)
The Transformer encoder G(.)maps the given inputs,
denoted as xand y, to their corresponding embeddings,
namely vxand vy, respectively. The similarity between
two inputs is quantitatively assessed by measuring the co-
sine similarity between their respective embeddings, vxand
vy. The cosine similarity metric provides a measure of the
directional similarity between the two vectors in the embed-
ding space. It determines the cosine of the angle between
5
the vectors, which ranges from 1(indicating complete dis-
similarity) to 1(representing perfect similarity).
sim(x, y) = vx.vy
||vx||.||vy|| (3)
The notion of similarity in terms of word meanings and
contextual cues is a universally observed phenomenon. It
applies across various domains, encompassing different lan-
guages and image-based contextual similarities. Based on
this assumption, we propose the utilization of GPTES as
a potential candidate for the Brain Representation Space
through the concept of Domain Adaptability. It is impor-
tant to note that the primary objective of the DreamCatcher
framework is to generate meaningful captions from fMRI
data. The framework does not focus on developing a bi-
ologically plausible representation space or a biologically
interpretable model.
4.2. RevEmbedding Decoder
The second component of the DreamCatcher Framework
is the RevEmbedding Decoder, which is responsible for
converting GPT Embeddings into captions. However, due
to the absence of a Reverse Embedding API in OpenAI, ap-
proximations such as Nearest Neighbour are utilized, sig-
nificantly compromising the accuracy of the generated cap-
tions. Hence, the RevEmbedding Decoder model assumes
paramount importance in generating captions from the em-
beddings predicted by the Representation Space Encoder.
In the initial stage, the embeddings-caption pairs gen-
erated in the preceding step undergo preprocessing and are
stored to form a custom dataset. Subsequently, a Vocabulary
module is defined to facilitate the creation and mapping of
words to indices. This class also encompasses a function
that constructs the vocabulary based on the captions present
in the dataset. The function tokenizes the captions, calcu-
lates token frequencies, and filters out infrequent words.
Additionally, special tokens such as <pad>, <start>,’
<end>,’ and <unk> are incorporated into the vocabu-
lary.
Finally, a sequential one-to-many Long Short Term
Memory(LSTM) module is trained on the embedding-
vocabulary pair to generate the target captions. . It receives
GPT Embeddings and caption sequences as input and trains
as a one-to-many model. The resulting output corresponds
to the predicted captions based on the fMRI input to the
Representation Space Encoder.
5. Experimental Result and Analysis
5.1. Dataset
Natural Scene Dataset: A comprehensive account of the
Natural Scenes Dataset (NSD), specifications, and data
acquisition procedures can be found in a publication by
Allen et al. in Nature Neuroscience (2021) [1]. The
NSD dataset encompasses functional Magnetic Resonance
Imaging (fMRI) measurements collected from 8 partici-
pants who were presented with a substantial number of dis-
tinct color natural scenes, ranging from 9,000 to 10,000 im-
ages (22,000 to 30,000 trials) across 30 to 40 scan sessions.
The fMRI scanning was conducted at 7T using whole-brain
gradient-echo Echo Planar Imaging (EPI) at a resolution of
1.8 mm and a repetition time of 1.6 seconds.
The images utilized in the NSD dataset were sourced
from the Microsoft Common Objects in Context (COCO)
database, square-cropped, and displayed at a size of 8.4° x
8.4°. Out of the total images, a specific set of 1,000 im-
ages was shared across all participants, while the remaining
images were mutually exclusive for each participant. The
images were presented for a duration of 3 seconds, with 1-
second gaps between successive images. During the scan-
ning sessions, the participants fixated centrally and engaged
in a continuous long-term recognition task related to the
presented images.
Pre-processing of the fMRI data involved performing
temporal interpolation to correct slice time differences and
spatial interpolation to account for head motion artifacts.
Subsequently, a general linear model was employed to es-
timate single-trial beta weights. The NSD dataset also en-
compasses cortical surface reconstructions generated using
FreeSurfer, with both volume- and surface-based versions
of the beta weights being created for further analysis and
interpretation.
MS COCO Dataset: The MS COCO (Microsoft Com-
mon Objects in Context) [18] dataset has emerged as a sig-
nificant benchmark for advancing the field of computer vi-
sion. It provides a diverse and comprehensive collection of
annotated images, enabling the development and evaluation
of a wide range of vision tasks, including object detection,
segmentation, and image captioning. MS COCO Captions
provide detailed textual descriptions for the images in the
MS COCO dataset. With over 330,000 images, each ac-
companied by multiple human-generated captions, this an-
notation aspect of MS COCO has revolutionized image cap-
tioning research. The captions capture the salient objects,
their attributes, and contextual information concisely and
descriptively. They serve as a valuable resource for train-
ing and evaluating image captioning models, pushing the
boundaries of image understanding and natural language
processing. By bridging the gap between visual and textual
domains, MS COCO captions have enabled advancements
in generating human-like image descriptions, opening up
new avenues for multimodal research.
6
(a) (b)
(c) (d)
Figure 6. Illustrative Example of Original and Predicted Caption for each Visual Stimuli. The similarity in text can be observed
directly by comparing the captions, which clearly proves the feasibility of fMRI Captioning.
Subject 1 Subject 2
Encoder-Decoder GPT Embedding SentenceMeteorPerplexitySentenceMeteorPerplexity
- - 0.253 0.179 3.486 0.241 0.157 3.845
- 0.276 0.183 1.803 0.247 0.167 1.952
0.451 0.323 1.024 0.422 0.308 1.037
Table 1. Detailed Results of Component Analysis of each Component of DreamCatcher Framework. The and represents the
desired value for each metric.
5.2. Feasiblity test of fMRI Captioning
The potential of generating textual descriptions directly
from fMRI data offers a promising avenue for understand-
ing brain activity and decoding mental representations. A
feasibility test was conducted using a dataset of fMRI
recordings obtained from 8 participants engaged in visual
stimulus tasks(Natural Scenes Dataset). The fMRI data
were preprocessed and subjected to feature extraction tech-
niques to capture relevant brain activity patterns. Prelimi-
nary results from the feasibility test showed promising per-
formance of the fMRI captioning model. The generated
captions exhibited reasonable coherence and semantic rele-
vance, capturing essential aspects of the visual stimuli. The
graphical illustration can be found in Fig. 6.
7
Figure 7. t-SNE plot for GPT Embedding Space(Left) and input fMRI Space(Right). It can be visually observed that GPT Embedding
Space provides a category-wise segregation and therefore is a better latent representation in comparison to the input fMRI Space. The
segregation also demonstrates that the Embedding Space adequately captures the relationship among the fMRI class labels.
5.3. Efficacy of DreamCatcher Framework
Since fMRI captioning is still a nascent field, directly
comparing our framework with state-of-the-art models was
not feasible. Nevertheless, we conducted a comprehensive
evaluation to assess the effectiveness of our framework. The
evaluation of our proposed framework involved testing our
model on two subjects and employing several metrics, in-
cluding METEOR, Sentence, and Perplexity. The results
of our evaluation demonstrated promising performance and
the potential of our framework in generating accurate and
coherent captions from fMRI data. The METEOR met-
ric, which measures the quality of generated captions by
comparing them to reference captions, indicated favorable
scores across all subjects. The Sentence metric, which eval-
uates the syntactic and semantic similarity between gener-
ated and reference captions, also showed encouraging re-
sults. The detailed result for our model can be obtained in
Tab. 1.
5.4. Verification of GPT Embedding Space Appli-
cability
To evaluate the ability of our framework to capture ad-
equate representations, we conducted a visualization of the
generated representation space using PCA and t-SNE. This
analysis aimed to provide insights into the distribution and
clustering of the representations, indicating the framework’s
capacity to capture and differentiate between various fea-
tures and attributes effectively.
The visualization of the generated representation space
revealed encouraging results. The representations exhib-
ited clear separation and distinct clusters, suggesting that
our framework successfully captured the relevant informa-
tion from the input fMRI data. The graphical illustration
can be found in Fig. 7.
6. Conclusion
The human brain’s exceptional proficiency in visual pro-
cessing, including image recognition and scene summariza-
tion, have been the subject of extensive research. How-
ever, despite significant efforts, a comprehensive under-
standing of the fundamental mechanisms underlying hu-
man visual processing still needs to be discovered. Func-
tional magnetic resonance imaging (fMRI) has emerged as
a valuable tool for investigating the neural activity of the
human brain, leading to advancements in brain decoding
techniques from primary classification approaches to more
sophisticated fMRI-to-Image reconstruction methods. The
contributions of this research include the introduction of
fMRI captioning as an alternative approach to fMRI-to-
Image Reconstruction, the proposal and feasibility testing
of the DreamCatcher framework, and the validation of the
GPT Embedding space as a suitable brain representation
space for fMRI captioning tasks.
In summary, the advancements in fMRI-based brain de-
coding techniques, particularly in fMRI captioning, provide
valuable insights into the neural mechanisms underlying hu-
man visual processing. The DreamCatcher framework rep-
resents a significant step forward in capturing the rich cog-
nitive processes involved in visual perception, and its ver-
satility makes it adaptable to other modalities. Continued
research in this area holds great promise for expanding our
understanding of the human brain and improving various
applications, ranging from cognitive neuroscience to edu-
cational practices and human-computer interaction.
8
References
[1] Emily Allen, Ghislain St-Yves, Yihan Wu, Jesse Breedlove,
Jacob Prince, Logan Dowdle, Matthias Nau, Brad Caron,
Franco Pestilli, Ian Charest, J. Hutchinson, Thomas Nase-
laris, and Kendrick Kay. A massive 7t fmri dataset to bridge
cognitive neuroscience and artificial intelligence. Nature
Neuroscience, 25, 01 2022. 2,3,4,5,6
[2] Roman Beliy, Guy Gaziv, Assaf Hoogi, Francesca Strappini,
Tal Golan, and Michal Irani. From voxels to pixels and back:
Self-supervision in natural-image reconstruction from fmri,
07 2019. 2,4
[3] Piotr Bojanowski, Edouard Grave, Armand Joulin, and
Tomas Mikolov. Enriching word vectors with subword infor-
mation. Transactions of the Association for Computational
Linguistics, 5, 07 2016. 5
[4] Mercy Bore, Xiqin Liu, Xianyang Gan, Lan Wang, Ting
Xu, Stefania Ferraro, Liyuan Li, Bo Zhou, Jie Zhang, Deniz
Vatansever, Benjamin Klugah-Brown, and Benjamin Becker.
Distinct neurofunctional alterations during motivational and
hedonic processing of natural and monetary rewards in de-
pression - a neuroimaging meta-analysis, 12 2022. 1
[5] Zijiao Chen, Jiaxin Qing, Tiange Xiang, Wan Yue, and Juan
Zhou. Seeing beyond the brain: Conditional diffusion model
with sparse masked modeling for vision decoding, 11 2022.
4
[6] Huikai Chua, Andrew Caines, and Helen Yannakoudakis. A
unified framework for cross-domain and cross-task learning
of mental health conditions. In Proceedings of the Second
Workshop on NLP for Positive Impact (NLP4PI), pages 1–
14, Abu Dhabi, United Arab Emirates (Hybrid), Dec. 2022.
Association for Computational Linguistics. 1
[7] David Cox and R. Savoy. Functional magnetic resonance
imaging (fmri) ”brain reading”: Detecting and classifying
distributed patterns of fmri activity in human visual cortex.
NeuroImage, 19:261–70, 07 2003. 1
[8] Thirza Dado, Ya˘
gmur G¨
uc¸l ¨
ut¨
urk, Luca Ambrogioni,
Gabri¨
elle Ras, Sander Bosch, Marcel Gerven, and Umut
G¨
uc¸l ¨
u. Hyperrealistic neural decoding for reconstructing
faces from fmri activations via the gan latent space. Scientific
Reports, 12, 01 2022. 2
[9] Zijin Gu, Keith Jamison, Amy Kuceyeski, and Mert
Sabuncu. Decoding natural image stimuli from fmri data
with a surface-based convolutional network, 12 2022. 2,4
[10] James Haxby, Maria Gobbini, Maura Furey, Alumit Ishai,
Jennifer Schouten, and Pietro Pietrini. Distributed and over-
lapping representations of faces and objects in ventral tempo-
ral cortex. Science (New York, N.Y.), 293:2425–30, 10 2001.
1
[11] James Haxby, Jyothi Swaroop Guntupalli, Andrew Connolly,
Yaroslav Halchenko, Bryan Conroy, Maria Gobbini, Michael
Hanke, and Peter Ramadge. A common, high-dimensional
model of the representational space in human ventral tempo-
ral cortex. Neuron, 72:404–16, 10 2011. 1
[12] John-Dylan Haynes and Geraint Rees. Decoding mental
states from brain activity in human. Nature reviews. Neu-
roscience, 7:523–34, 08 2006. 1
[13] Yukiyasu Kamitani and Frank Tong. Decoding the visual
and subjective contents of the human brain. nature neurosci.
8, 679-685. Nature neuroscience, 8:679–85, 06 2005. 1
[14] Kohitij Kar, Simon Kornblith, and Evelina Fedorenko. Inter-
pretability of artificial neural network models in artificial in-
telligence versus neuroscience. Nature Machine Intelligence,
4, 12 2022. 1
[15] Kendrick Kay, Thomas Naselaris, Ryan Prenger, and Jack
Gallant. Identifying natural images from human brain activ-
ity. Nature, 452:352–5, 04 2008. 1
[16] Nikolaus Kriegeskorte and Pamela Douglas. Cognitive com-
putational neuroscience, 07 2018. 1
[17] Sikun Lin, Thomas Sprague, and Ambuj Singh. Mind reader:
Reconstructing complex images from brain activities, 09
2022. 2,4
[18] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
Pietro Perona, Deva Ramanan, Piotr Doll´
ar, and C. Zitnick.
Microsoft coco: Common objects in context. volume 8693,
04 2014. 3,5,6
[19] Tomas Mikolov, Kai Chen, G.s Corrado, and Jeffrey Dean.
Efficient estimation of word representations in vector space.
Proceedings of Workshop at ICLR, 2013, 01 2013. 5
[20] Yoichi Miyawaki, Hajime Uchida, Okito Yamashita, Masa-
aki Sato, Yusuke Morito, Hiroki Tanabe, Norihiro Sadato,
and Yukiyasu Kamitani. Visual image reconstruction from
human brain activity using a combination of multiscale local
image decoders. Neuron, 60:915–29, 01 2009. 1
[21] Milad Mozafari, Leila Reddy, and Rufin VanRullen. Recon-
structing natural scenes from fmri patterns using bigbigan,
01 2020. 2,4
[22] Thomas Naselaris, Kendrick Kay, Shinji Nishimoto, and
Jack Gallant. Encoding and decoding in fmri. NeuroImage,
56:400–10, 05 2011. 1
[23] Thomas Naselaris, Ryan Prenger, Kendrick Kay, Michael
Oliver, and Jack Gallant. Bayesian reconstruction of natu-
ral images from human brain activity. Neuron, 63:902–15,
09 2009. 1
[24] Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse
Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Kim,
Chris Hallacy, Johannes Heidecke, Pranav Shyam, Boris
Power, Tyna Nekoul, Girish Sastry, Gretchen Krueger, David
Schnurr, Felipe Such, Kenny Hsu, and Lilian Weng. Text and
code embeddings by contrastive pre-training, 01 2022. 2,5
[25] Shinji Nishimoto, An Vu, Thomas Naselaris, Yuval Ben-
jamini, B. Yu, and Jack Gallant. Reconstructing visual expe-
riences from brain activity evoked by natural movies. Cur-
rent biology : CB, 21:1641–6, 09 2011. 1
[26] Kenneth Norman, Sean Polyn, Greg Detre, and James
Haxby. Beyond mind-reading: Multi-voxel pattern analy-
sis of fmri data. Trends in cognitive sciences, 10:424–30, 10
2006. 1
[27] Furkan Ozcelik, Bhavin Choksi, Milad Mozafari, Leila
Reddy, and Rufin VanRullen. Reconstruction of perceived
images from fmri patterns and semantic brain exploration us-
ing instance-conditioned gans, 02 2022. 4
[28] Sebastian Palacio, Joachim Folz, J ¨
orn Hees, Federico Raue,
Damian Borth, and Andreas Dengel. What do deep networks
like to see? 03 2018. 1
9
[29] Jeffrey Pennington, Richard Socher, and Christopher Man-
ning. Glove: Global vectors for word representation. vol-
ume 14, pages 1532–1543, 01 2014. 5
[30] Roshini Randeniya, Jason B. Mattingley, and Marta I. Gar-
rido. Increased context adjustment is associated with audi-
tory sensitivities but not with autistic traits. Autism Research,
15(8):1457–1468, 2022. 1
[31] Ziqi Ren, Jie Li, Xuetong Xue, Xin Li, Fan Yang, Zhicheng
Jiao, and Xinbo Gao. Reconstructing seen image from brain
activity by visually-guided cognitive representation and ad-
versarial learning. NeuroImage, 228:117602, 03 2021. 2,
4
[32] Guohua Shen, Tomoyasu Horikawa, Kei Majima, and
Yukiyasu Kamitani. Deep image reconstruction from human
brain activity. PLOS Computational Biology, 15:e1006633,
01 2019. 2,4
[33] Yu Takagi and Shinji Nishimoto. High-resolution image re-
construction with latent diffusion models from human brain
activity, 11 2022. 2,4
[34] Rufin VanRullen and Leila Reddy. Reconstructing faces
from fmri patterns using deep generative neural networks.
Communications Biology, 2, 05 2019. 2
[35] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan Gomez, Lukasz Kaiser, and Illia
Polosukhin. Attention is all you need. 06 2017. 2
[36] Sudheendra Vijayanarasimhan, Jon Shlens, Rajat Monga,
and Jay Yagnik. Deep networks with large output spaces.
12 2014. 5
[37] Barret Zoph, Ashish Vaswani, Jonathan May, and Kevin
Knight. Simple, fast noise-contrastive estimation for large
rnn vocabularies. pages 1217–1222, 01 2016. 5
10
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Bayesian models of autism suggest that alterations in context‐sensitive prediction error weighting may underpin sensory perceptual alterations, such as hypersensitivities. We used an auditory oddball paradigm with pure tones arising from high or low uncertainty contexts to determine whether autistic individuals display differences in context adjustment relative to neurotypicals. We did not find group differences in early prediction error responses indexed by mismatch negativity. A dimensional approach revealed a positive correlation between context‐dependent prediction errors and subjective reports of auditory sensitivities, but not with autistic traits. These findings suggest that autism studies may benefit from accounting for sensory sensitivities in group comparisons. Lay Summary We aimed to understand if autistic and non‐autistic groups showed differences in their electrical brain activity measured by electroencephalography (EEG) when listening to surprising tones infrequently embedded in a statistical pattern. We found no differences between the autistic and the non‐autistic group in their EEG response to the surprising sound even if the pattern switched, indicating their ability to learn a pattern. We did find that, as subjective sensory sensitivities (but not autistic traits) increased, there were increasingly large differences between the EEG responses to surprising tones that were embedded in the different statistical patterns of tones. These findings show that perceptual alterations may be a function of sensory sensitivities, but not necessarily autistic traits. We suggest that future EEG studies in autism may benefit from accounting for sensory sensitivities.
Article
Full-text available
Neural decoding can be conceptualized as the problem of mapping brain responses back to sensory stimuli via a feature space. We introduce (i) a novel experimental paradigm that uses well-controlled yet highly naturalistic stimuli with a priori known feature representations and (ii) an implementation thereof for HYPerrealistic reconstruction of PERception (HYPER) of faces from brain recordings. To this end, we embrace the use of generative adversarial networks (GANs) at the earliest step of our neural decoding pipeline by acquiring fMRI data as participants perceive face images synthesized by the generator network of a GAN. We show that the latent vectors used for generation effectively capture the same defining stimulus properties as the fMRI measurements. As such, these latents (conditioned on the GAN) are used as the in-between feature representations underlying the perceived images that can be predicted in neural decoding for (re-)generation of the originally perceived stimuli, leading to the most accurate reconstructions of perception to date.
Article
Full-text available
Extensive sampling of neural activity during rich cognitive phenomena is critical for robust understanding of brain function. Here we present the Natural Scenes Dataset (NSD), in which high-resolution functional magnetic resonance imaging responses to tens of thousands of richly annotated natural scenes were measured while participants performed a continuous recognition task. To optimize data quality, we developed and applied novel estimation and denoising techniques. Simple visual inspections of the NSD data reveal clear representational transformations along the ventral visual pathway. Further exemplifying the inferential power of the dataset, we used NSD to build and train deep neural network models that predict brain activity more accurately than state-of-the-art models from computer vision. NSD also includes substantial resting-state and diffusion data, enabling network neuroscience perspectives to constrain and enhance models of perception and memory. Given its unprecedented scale, quality and breadth, NSD opens new avenues of inquiry in cognitive neuroscience and artificial intelligence.
Article
Full-text available
Although distinct categories are reliably decoded from fMRI brain responses, it has proved more difficult to distinguish visually similar inputs, such as different faces. Here, we apply a recently developed deep learning system to reconstruct face images from human fMRI. We trained a variational auto-encoder (VAE) neural network using a GAN (Generative Adversarial Network) unsupervised procedure over a large data set of celebrity faces. The auto-encoder latent space provides a meaningful, topologically organized 1024-dimensional description of each image. We then presented several thousand faces to human subjects, and learned a simple linear mapping between the multi-voxel fMRI activation patterns and the 1024 latent dimensions. Finally, we applied this mapping to novel test images, translating fMRI patterns into VAE latent codes, and codes into face reconstructions. The system not only performed robust pairwise decoding (>95% correct), but also accurate gender classification, and even decoded which face was imagined, rather than seen.
Article
The notion of ‘interpretability’ of artificial neural networks (ANNs) is of growing importance in neuroscience and artificial intelligence (AI). But interpretability means different things to neuroscientists as opposed to AI researchers. In this article, we discuss the potential synergies and tensions between these two communities in interpreting ANNs.