ArticlePDF Available

Neural mechanisms underlying the hierarchical construction of perceived aesthetic value

Authors:

Abstract and Figures

Little is known about how the brain computes the perceived aesthetic value of complex stimuli such as visual art. Here, we used computational methods in combination with functional neuroimaging to provide evidence that the aesthetic value of a visual stimulus is computed in a hierarchical manner via a weighted integration over both low and high level stimulus features contained in early and late visual cortex, extending into parietal and lateral prefrontal cortices. Feature representations in parietal and lateral prefrontal cortex may in turn be utilized to produce an overall aesthetic value in the medial prefrontal cortex. Such brain-wide computations are not only consistent with a feature-based mechanism for value construction, but also resemble computations performed by a deep convolutional neural network. Our findings thus shed light on the existence of a general neurocomputational mechanism for rapidly and flexibly producing value judgements across an array of complex novel stimuli and situations.
Neuroimaging experiments and the model of value construction a Neuroimaging experiments. We administered our task (ART: art rating task) to human participants in an fMRI experiment. Each participant completed 20 scan sessions spread over four separate days (1000 trials in total with no repetition of the same stimuli). On each trial, a participant was presented with a visual art stimulus (paintings) for 3 s. The art stimuli were the same as in our previous behavioral study⁴². After the stimulus presentation, a participant was presented with a set of possible ratings (0,1,2,3), where they had to choose one option within 3 s, followed by brief feedback with their selected rating (0.5 s). The positions of the numbers were randomized across trials, and the order of presented stimuli was randomized across participants. b Example stimuli. The images were taken from four categories from Wikiart.org.: Cubism, Impressionism, Abstract art and Color Fields, and supplemented with art stimuli previously used⁵². c The idea of value construction. An input is projected into a feature space, in which the subjective value judgment is performed. Importantly, the feature space is shared across stimuli, enabling this mechanism to generalize across a range of stimuli, including novel ones. d Schematic of the LFS model⁴². A visual stimulus (e.g., artwork) is decomposed into various low-level visual features (e.g., mean hue, mean contrast), as well as high-level features (e.g., concreteness, dynamics). We hypothesized that in the brain high-level features are constructed from low-level features, and that subjective value is constructed from a linear combination of all low and high-level features. e How features can help construct subjective value. In this example, preference was separated by the concreteness feature. Reproduced from⁴². f In this example, the value over the concreteness axis was the same for four images; but another feature, in this case, the brightness contrast, could separate preferences over art. Reproduced from⁴². g The LFS model successfully predicts participants' liking ratings for the art stimuli. The model was fit to each participant (cross-validated). Statistical significance was determined by a permutation test (one-sided). Three stars indicate p < 0.001. Due to copyright considerations, some paintings presented here are not identical to that used in our studies. Credit. Jean Metzinger, Portrait of Albert Gleizes (public domain; RISD Museum).
… 
FMRI signals in visual cortical regions show similarity to our LFS model and DCNN model a Encoding of low and high-level features in the visual ventral-temporal stream in a graded hierarchical manner. In general, the relative encoding of high-level features with respect to low-level features increases dramatically across the ventral-temporal stream. The maximum probabilistic map⁶⁵ is shown color-coded on the structural MR image at the top to illustrate the anatomical location of each ROI. The proportion of voxels that significantly correlated with low-level features (blue; one-sided F-test p < 0.001) against high-level features (red; one-sided F-test p < 0.001) are shown for each ROI. See the “Methods” section for detail. b Encoding low and high-level features in the dorsolateral visual stream. The anatomical location of each ROI⁶⁵ is color-coded on the structural MR image. c Encoding of DCNN features (hidden layers' activation patterns) in the ventral-temporal stream. The top three principal components (PCs) from each layer of the DCNN were used as features in this analysis. In general, early regions more heavily encode representations found in early layers of the DCNN, while higher-order regions encode representations found in deeper CNN layers. The proportion of voxels that significantly correlated with PCs of convolutional layers 1–4 (light blue), convolutional layers 5–9 (blue), convolutional layers 10–13 (purple), fully connected layers 14–15 (pink) are shown for each ROI. The significance was set at p < 0.001 by one-sided F-test. d Encoding of DCNN features in the dorsolateral visual stream. Credit. Jean Metzinger, Portrait of Albert Gleizes (public domain; RISD Museum).
… 
This content is subject to copyright. Terms and conditions apply.
Article https://doi.org/10.1038/s41467-022-35654-y
Neural mechanisms underlying the
hierarchical construction of perceived
aesthetic value
Kiyohito Iigaya
1,2,3
, Sanghyun Yi
1
,ImanA.Wahle
1
,SandyTanwisuth
1
,
Logan Cross
1,4
& John P. ODoherty
1
Little is known about how the brain computes the perceived aesthetic value of
complex stimuli such as visual art. Here, we used computational methods in
combination with functional neuroimaging to provide evidence that the aes-
thetic value of a visual stimulus is computed in a hierarchical manner via a
weighted integration over both low and high level stimulus features contained
in early and late visual cortex, extending into parietal and lateral prefrontal
cortices. Feature representations in parietal and lateral prefrontal cortex may
in turn be utilized to produce an overall aesthetic value in the medial pre-
frontal cortex. Such brain-wide computations are not only consistent with a
feature-based mechanism for value construction, but also resemble compu-
tations performed by a deep convolutional neural network. Our ndings thus
shed light on the existence of a general neurocomputational mechanism for
rapidly and exibly producing value judgements across an array of complex
novel stimuli and situations.
How it is that humans are capable of making aesthetic judgments has
long been a focus of enquiry in psychology, more recently gaining a
foothold in neuroscience with the emerging eld of
neuroaesthetics110. Yet in spite of the long tradition of studying value
judgments, we still have a very limited understanding of how people
form aesthetic value, let alone of the neural mechanisms underlying
this enigmatic process. So far, neuroscience studies into aesthetic
judgments have been largely limited to identifying brain regions
showing increased activity to stimuli with higher compared to lower
aesthetic value (e.g.,11,12), leaving it an open question of how the brain
computes aesthetic valuefrom visual stimuliin the rst place. T o ll this
gap, we approach this problem from a computational neuroscience
perspective, by leveraging computational methods to gain insight into
the neural computations underlying aesthetic value construction.
Considerable progress has been made toward understanding how
the brain represents the value of stimuli in the world. Value signals
have been found throughout the brain, but most prominently in the
medial prefrontal (mPFC) and adjacent orbitofrontal cortex. Activity
has been found in this region tracking the experienced value of reward
outcomes, as well as during anticipation of future rewards11,1325.The
mPFC, especially its ventral aspects, has been found to correlate with
the experienced pleasantnessof gustatory, olfactory, music, and visual
stimuli including faces, but also visual art12,2631. While much is known
about how the brain represents value, much less is known about how
those value signals come to be generated by the brain in the rst place.
A typical approach to this question in the literature to date is to
assume that stimuli acquire value through associative learning, in
which the value of a particular stimulus is modulated by being asso-
ciated with other stimuli with extant (perhaps even innate) value.
Seminal work has identied key neural computations responsible for
implementing this type of reward-based associative learning in the
brain3234. However, the current valuation of an object is not solely
dependent on its prior associative history. Even novel stimuli never
beforeseen, can be assigned a value judgment35. Moreover,the current
Received: 10 February 2021
Accepted: 15 December 2022
Check for updates
1
Division of Humanities and Social Sciences, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA.
2
Department of Psychiatry,
ColumbiaUniversity Irving Medical Center, New York, NY 10032, USA.
3
Center for Theoretical Neuroscience and Mortimer B.Zuckerman Mind Brain Behavior
Institute, Columbia University, New York, NY 10027, USA.
4
Department of Computer Science, Stanford University, Stanford, CA, USA.
e-mail: ki2151@columbia.edu;jdoherty@caltech.edu
Nature Communications | (2023) 14:127 1
1234567890():,;
1234567890():,;
Content courtesy of Springer Nature, terms of use apply. Rights reserved
value of a stimulus depends on ones internal motivational state, as
well as the context in which a stimulus is presented. Consequently, the
value of an object may be computed on-line in a exible manner that
goes beyond simple associative history.
Previous work in neuroeconomics has hinted at the notion that
value can be actively constructed by taking into account different
underlying features or attributes of a stimulus. For instance, a t-shirt
might have visual and semantic components36, a food item might vary
in its healthfulness and taste or in its nutritive content37,38,anodoris
composed of underlying odor molecules39. Furthermore, potential
outcomes in many standard economic decision-making problems can
be described in terms of the magnitude and probability of those
outcomes40,41. For a given outcome, these individual features of such
potential outcomes can be each weighted so that they are taken into
account when making an overall value determination.
Building upon these ideas, we recently proposed that the value of
a stimulus, including a piece of art, is actively constructed in a two-step
process by rst breaking down a stimulus into its constituent features
and then by recombining these features in a weighted fashion, to
compute an overall subjectivevalue judgment42,43. In a recent study we
showed that it is possible to demonstrate that this same feature-based
value construction process can be used to gain an understanding
about how humans might value works of art as well as varieties of
online photograph images42. Using a combination of computer vision
tools and machine learning, we showed that it is possible to predict an
individuals subjective aesthetic valuation for a work of art and pho-
tography, by segmenting a visual scene into its underlying visual fea-
tures, and then combining those features together in a weighted
fashion.
While this prior work42 provides empirical support for the
applicability of the value construction process for understanding
aesthetic valuation, nothing is yet known about whether this approach
is actually a plausible description of what might actually be occurring
in the brain, a crucial step for validating this model as a biological
mechanism.
Establishing how the brain might solve the feature integration
process for art is uniquely challenging because of the complexity and
diversity of visual art. Even in paintings alone, there are an over-
whelmingly broad range of objects, themes, as well as styles that are
used across artworks. The brains value computation mechanism
therefore needs to generalize across all of these diverse stimuli, in
order to compute the value of them reliably. However, it is not known
how the brain can transform heterogeneous, high-dimensional input,
into a simple output of an aesthetic judgment.
Here we address these challenges by combining computational
modelling with neuroimaging data. Following our prior behavioral
evidence, we propose that the brain performs aesthetic value com-
putations for visual art through extracting and integrating visual and
abstract features of a visual stimulus. In our linear feature summation
model (LFS)42,theinputisrst decomposed into various visual fea-
tures characterizing the color or shape of the whole and segments of
the paintings. These features are then transformed into abstract high-
level features that also affect value judgement (e.g., how concrete or
abstract the painting is). This feature space enables a robust value
judgment to be formed for visual stimuli even never before seen,
through a simple linear regression over the features. We also recently
reported that the features predicting value judgment in the LFS model
naturally emerge in a generic deep convolutional neural network
(DCNN) model, suggesting a close relationship between these two
models42. Here we test whether these computational models actually
approximate what is going on in the brain. By doing so, we will attempt
to link anexplicit, interpretable feature-basedvalue computation anda
generic DCNN model to actual neural computations. Because our
model of value construction is agnostic about the type of object that is
being valued, our proposed mechanism has the potential to not only
account for aesthetic value computation but also to value judgments
across stimulus domains beyond the domain of aesthetics for art.
Results
Linear feature summation (LFS) model predicts human valua-
tion of visual art
We conducted an fMRI study in which we aimed to link two com-
plementary computational models (LFS and DCNN) to neural data as a
test of how feature-based value construction might be realized in the
brain. Rather than collecting noisy data from a large number of parti-
cipants with very short scanning times to perform group averaging,
here, we engaged in deep fMRI scanning of a smaller group of indivi-
duals (n= 6), who each completed 1000 trials of our art rating task
(ART) each over four days of scanning. This allowed us to test for the
representation of the features in each individual participant with suf-
cient delity to perform reliable single subject inference. This well-
justied approach essentially treats the individual participant as the
replication unit, rather than relying on group-averaged data from
participants in different studies44. This has been a dominant and
highly successful approach in most non-human animal studies
(e.g.,14,18,20,22,24,33,45,46), as well as in two subelds of human psychology
and neuroscience: psychophysics and visual neuroscience, respec-
tively (e.g.,47,48).
On each trial, participants were presented with an image of a
painting on a computer screen and asked to report how much they
liked it ona scale of 0 (not at all) to 3 (very much) (Fig. 1a). Each of the
participants rated all of the paintings without repetition (1000 differ-
ent paintings). The stimulus set consisted of paintings from a broad
range of art genres (Fig. 1b)42.
We recently showed that a simple linear feature summation (LFS)
model can predict subjective valuations for visual art both for paint-
ings and photographs drawn from a broad range of scenery, objects,
and advertisements42. The idea is that the subjective value of an indi-
vidual painting can be constructed by integrating across features
commonly shared across all paintings. For this, each image was
decomposed intoits fundamentalvisual and emotional features.These
feature values are then integrated linearly, with each participant being
assigned a unique set of features weights from which the model con-
structs a subjective preference (Fig. 1c, d). This model embodies the
notion that subjective values are computed in a common feature
space, whereby overall subjective value is computed as a weighted
linear sum over feature content (Fig. 1e, f).
The LFS model extracts various low-level visual features from an
input image using a combination of computer vision methods (e.g.,49).
This approach computes numerical scores for different aspects of
visual content in the image, such as the average hue and brightness of
image segments, as well as the entirety of the image itself, as identied
by machine learning techniques, e.g., Graph-Cuts50 (Details of this
approach are described in the Methods section; see also42).
The LFS model also includes more abstract or high-levelfeatures
that can contribute to valuation. Based on previous literature51,52,we
introduced three features: the image is abstract or concrete53,
dynamic or still,hot or cold. These three features are introduced in
ref. 52, by taking principal components of features originally intro-
duced in ref. 51. We also included a fourth feature: whether the image
evinces a positive or negative emotional valence42.Notethatimage
valence is not the same as valuation because even negative valence
images can elicit a positive valuation (e.g., Edvard MunchsThe
Scream); moreover we have previously shown that valence is not
among the features that account for valuation the best42. These high-
level features across images were annotated by participants with
familiarity and experience in art (n= 13) as described in our previous
study42. We then took the average score over these expertsratings as
the input into the model representing the content of each high-level
attribute feature for each image. We have previously shown that it is
Article https://doi.org/10.1038/s41467-022-35654-y
Nature Communications | (2023) 14:127 2
Content courtesy of Springer Nature, terms of use apply. Rights reserved
possible to re-construct signicant, if not all of the variance explained
by high-level features using combinations of low-level features42,
supporting the possibility that low-level features can be used to con-
struct high level features.
The nal output of the LFS model is a linear combination of low-
and high-level features. We assumed that weights over the features are
xed for each individual across stimuli, so that we can derive gen-
eralizable conclusions about the features used to generate valuation
across images. In our behavioral tting, we treat low-level and high-
level features equally as features of a linear regression model, in order
to determine the overall predictive power of our LFS model.
In our recent behavioral study, we identied a minimal feature set
that predicts subjective ratings across participants using a group-level
lasso regression42. Here, we applied the same analysis, except that we
distinguished low- and high-level features for the purpose of the fMRI
analysis. Since our interests hinge on the representational relationship
between low- and high-level features in the brain, we rst identied a set
of low-level features that can predict behavioral liking ratings across
participants, further augmenting this to a richer feature set that includes
the four human-annotated features. By doing so, we aimed to identify
brain signals that are uniquely correlated with low-level and high-level
features (i.e., partial correlations between features and fMRI signals).
Before turning to the fMRI data, we rst aimed to replicate the
behavioral results reported in our previous study42 in these fMRI par-
ticipants. Indeed, using the LFS model with the shared feature set, we
conrmed that the model could predict subjective art ratings across
participants, replicating our previous behavioral ndings (Fig. 1g; see
Supplementary Figs. 1 and 2 for the estimated weights for each parti-
cipant and their correlations).
A deep convolutional neural network (DCNN) model also pre-
dicts human liking ratings for visual art
An alternative approach to predict human aesthetic valuation for
visual images is to use a generic deep neural network model that takes
as its input the visual images, and ordinarily generates outputs related
to object recognition. Here, we utilized a standard DCNN (VGG 1654)
that had been pre-trained for object recognition with ImageNet55,and
adapted it to instead output aesthetic ratings by training it on human
aesthetic ratings. This approach means that we do not need to identify
or label specic features that lead to aesthetic ratings, instead we can
use the network to automatically detect the relevance of an image and
use those to predict aesthetic ratings.
Thoughthenatureofcomputationthat DCNN performs is usually
very difcult to interpret, we have recently found that this type of
Fig. 1 | Neuroimaging experiments and the model of value construction.
aNeuroimaging experiments. We administered our task (ART: art rating task) to
human participants in an fMRI experiment. Each participant completed 20 scan
sessions spread over four separate days (1000 trials in total with no repetition of
the same stimuli). On each trial, a participant was presented with a visual art sti-
mulus (paintings) for 3 s. The art stimuli were the same as in our previous beha-
vioral study42. After the stimulus presentation, a participant was presented with a
set of possible ratings (0,1,2,3), where they had to choose one option within 3s,
followed by brief feedback with their selected rating (0.5 s). The positions of the
numbers were randomized across trials, and the order of presented stimuli was
randomized across participants. bExample stimuli. The images were taken from
four categories from Wikiart.org.: Cubism, Impressionism, Abstract art and Color
Fields, and supplemented with art stimuli previously used52.cThe idea of value
construction. An input is projected into a feature space, in which the subjective
value judgment is performed. Importantly, the feature space is shared across sti-
muli, enabling this mechanism to generalize across a range of stimuli, including
novel ones. dSchematic of the LFS model42. A visual stimulus (e.g., artwork) is
decomposed into various low-level visual features (e.g., mean hue, mean contrast),
as well as high-level features (e.g., concreteness, dynamics). We hypothesized that
in the brain high-level features are constructed from low-level features, and that
subjective value is constructed from a linear combination of all low and high-level
features. eHow features can help construct subjective value. In this example,
preference was separated bythe concretenessfeature. Reproduced from42.fIn this
example, the value over the concreteness axis was the same for four images; but
another feature, in this case, the brightness contrast, could separate preferences
over art. Reproduced from42.gThe LFS model successfully predicts participants'
liking ratings for the art stimuli. The model was t to each participant (cross-
validated). Statistical signicance was determined by a permutation test (one-
sided). Three stars indicate p< 0.001. Due to copyright considerations, some
paintings presented here are not identical to that used in our studies. Credit. Jean
Metzinger, Portrait of Albert Gleizes (public domain; RISD Museum).
Article https://doi.org/10.1038/s41467-022-35654-y
Nature Communications | (2023) 14:127 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved
DCNN model can produce results that are strongly related to the LFS
model42. In particular, we have found that the LFS model features are
represented in the DCNN model, even though we did not train the
DCNN on any features explicitly (Fig. 2a). By performing a decoding
analysis on each layer of the DCNN, we found that the low-level fea-
tures show decreased decoding accuracy with increased depth of the
layer, while the high-level features are more decodable in deeper lay-
ers. This suggests that the DCNN may also utilize similar features to
those that we introduced in the LFS model, and the fact that features
are represented hierarchically in a DCNN model that is blind to specic
features suggests that these features might emerge spontaneously via
a natural process of visual and cognitive development through inter-
acting with natural stimuli56.
Here we found that the DCNN model can predictsubjective liking
ratings of art across all fMRI participants (Fig. 2b), once again repli-
cating our previous nding in a different dataset42. Predictive accuracy
across participants was found to be similar to that of the LFS model,
though the DCNN model could potentially perform better with even
more training.
So far, we have conrmed the validity of our LFS model and the
use of a DCNN model to predict human behavioral ratings reported in
our new fMRI experiments, replicating our previous behavioral study42.
Now that we have validated our behavioral predictions,next we turn to
the brain data to address how the brain implements the value con-
struction process.
Thesubjectivevalueofartisrepresented in the medial pre-
frontal cortex (mPFC)
We rst tested for brain regions correlating with the subjective liking
ratings of each individual stimulus at the time of stimulus onset. We
expected to nd evidence for subjective value signals in the medial
prefrontal cortex (mPFC), given this isthe main area found to correlate
with value judgments for many different stimuli from an extensive
prior literature, including for visual art (e.g.,11,12,14,17,23,31,57). Consistent
with our hypothesis, we found that voxels in the mPFC are positively
correlated with subjective value across participants (Fig. 3;See
Supplementary Fig. 3 for the timecourse of the BOLD signals in the
mPFC cluster). Consistent with previous studies, e.g.,38,5863,other
regions are also correlated with liking value (Supplementary Figs. 5
and 6).
These subjective value signals could reect other psychological
processes such as attention. Therefore we performed a control ana-
lysis with the same GLM with additional regressors that can act as
proxies for the effects of attention and memorability of stimuli,
operationalized by reaction times, squared reaction times and the
deviation from the mean rating64. We found that subjective value sig-
nals in all participants that we report in Fig. 3c survived this control
analysis (Supplementary Fig. 7).
Visual stream shows hierarchical, graded, representations of
low-level and high-level features
As illustrated in Fig. 1d, and reecting our hypothesis regarding the
encoding of low vs. high-level features across layers of the DCNN, we
hypothesized that the brain would decompose visual input similarly,
with early visual regions rst representing low-level features, and with
downstream regions representing high-level features. Specically, we
analyzed visual cortical regions in the ventral and dorsal visualstream65
to test the degree to which low-level and high-level features are
encoded in a graded, hierarchical manner. In pursuit of this, we con-
structed a GLM that included the shared feature time-locked to sti-
mulus onset. We identied voxels that are signicantly modulated by
at least one low-level feature by performing an F-test over the low-level
feature beta estimates, repeating the same analysis with high-level
features. We then compared the proportion of voxels that were sig-
nicantly correlated with low-level features vs. high-level features in
each region of interest in both the ventral and dorsal visual streams.
This method allowed us to compare results across regions while con-
trolling for different signal-to-noise ratios in the BOLD signal across
different brain regions66. Regions of interest were independently
identied by means of a detailed probabilistic visual topographical
map65. Consistent with our hypothesis, our ndings suggest that low-
and high-level features relevant for aesthetic valuation are indeed
Fig. 2 | The deep convolutional neural network (DCNN) model naturally
encodes low-level and high-level features and predict participantschoice
behavior. a Schematic of the deep convolutional neural network (DCNN) model
and the results of decoding analysis42. The DCNN model was rst trained on Ima-
geNet object classications, and then the average ratings of art stimuli. We com-
puted correlations between each of the LFS model features and activity patterns in
each of the hidden layers of the DCNN model. We found that some low-level visual
features exhibit signicantly decreasing predictive accuracy over hidden layers
(e.g., the mean hue and the mean saturation). We also found that a few
computationallydemanding low-level features showed the opposite trend (see the
main text). We further found that some high-level visual features exhibit sig-
nicantly increasing predictive accuracy over hidden layers (e.g.,concreteness and
dynamics). Results reproduced from42.bThe DCNN model could successfully
predict human participants' liking ratings signicantly greater than chance across
all participants. Statistical signicance (p<0 .001, indicated by three stars) wa s
determined by a permutation test (one-sided). Credit. Jean Metzinger, Portrait of
Albert Gleizes (public domain; RISD Museum).
Article https://doi.org/10.1038/s41467-022-35654-y
Nature Communications | (2023) 14:127 4
Content courtesy of Springer Nature, terms of use apply. Rights reserved
represented in the visual stream in a graded hierarchical manner.
Namely, the relative encoding of high-level features with respect to
low-level features dramatically increases across the visual ventral
stream (Fig. 4a). We found a similar, hierarchical organization in the
dorsolateral visual stream (Fig. 4b), albeit less clearly demarcated than
in the ventral case. We also conrmed in a supplementary analysis that
referring to feature levels (high or low) according to our DCNN ana-
lysis, i.e., by using the slopes of our decoding results42, did not change
the results of our fMRI analyses qualitatively and does not affect our
conclusions (see Supplementary Fig. 8).
We also performed additional encoding analysis using cross vali-
dation at each voxel of each participant67.Specically, we performed a
lasso regression at each voxel with the low- and high-level features that
we considered in our original analyses. Hyperparameters are opti-
mized in 12-fold cross validation at each voxel across stimuli. As a
robustness check, we determined if our GLM results can be repro-
duced using the lasso regression analysis. We analyzed how low-level
feature weights and high-level feature weights changed across ROIs.
For this,we computed the sum of squares of low-level feature weights
and the sum of squares of high-level feature weights at each voxel.
Becausethese weights estimates include those that can be obtained by
chance, we also computed the same quantities by performing the lasso
regression with shufed stimuli labels (labels were shufed at every
regression). The null distribution of feature magnitudes (the sum of
squares) was estimated for low-level features and high-level features at
each ROI. For each voxel, we asked if estimated low-level features and
high-level features are signicantly larger than what is expected from
noise, by comparing the magnitude of weights against the weights
from null distribution (p< 0.001). We then examined how encoding of
low-level vs high-level features varied across ROIs, as we did in our
original GLM analysis. As seen in Supplementary Fig. 9, the original
GLM analysis results were largely reproduced in the lasso regression.
Namely, low-level features are more prominently encoded in early
visualregions, whilehigh-level features aremore prominently encoded
in higher visual regions. In this additional analysis, such effects were
clearly seen across ve out of six participants, while one participant
(P1) showed less clear early vs late region-specic differentiation with
regard to low vs high-level feature representation. We also note that
the models predictive accuracy in visual regions was lower for this
participant (P1) than for the rest of the participants (Supplemen-
tary Fig. 10).
Non-linear feature representations
We found that features of the LFS model are represented across brain
region and contribute to value computation. However, it is possible
that nonlinear combinations of these features are also represented in
the brain and that these may contribute to value computation. To
explore this possibility, we constructed a new set of nonlinear features
by multiplying pairs of the LFS modelsfeatures(interactionterms).
We grouped these new features into three groups: interactions
between pairs of low-level features (low-level × low-level), interactions
between pairs of low-level and high-level features (low-level × high-
level), and interactions between pairs of high-level features (high-level
× high-level).To control the dimensionality of the new feature groups,
we performed principal component analysis within each of the three
groups of non-linear features, and took the rst ve PCs to match the
number of the high-level features specied in our original LFS model.
We performed a LASSO regression analysis with these new features
and the original features.
We found that in most participants, non-linear features created
from pairs of high-level features produced signicant correlations with
neural activity across multiple regions, while also showing similar
evidence for a hierarchical organization from early to higher-order
regions, as found for the linear high-level features (Fig. 5,Supple-
mentary Fig. 11). Though comparisons between separately optimized
lasso regressions should be cautiously interpreted, the mean correla-
tions of the model with both linear and nonlinear features across ROIs
showed a slight improvement in predictive accuracy compared to the
original LFS model with only linear features (Supplementary Fig. 10),
while the DCNN model features out-performed both the original LFS
model and the LFS model + nonlinear features.
Indeed, nonlinear features created from pairs of high-level fea-
tures signicantly contribute more to behavioral choice predictions
than do other nonlinear features not built solely from high-level fea-
tures (Supplementary Fig. 12). The rst principal component of high
level x high level features well captured three participants (3, 5, 6)
behavior, while other participants show somewhat different weight
proles. However, we found that these newly added features only
P1
0
11
t-score
y=68
P2
19
0
y=63
0
7
t-score 6
0
y=46
P3
12
0
y=57
y=51
13
0
y=44
P4 P5 P6
Subjective value (liking rating)
Fig. 3 | Subjective value (i.e., liking rating). Subjective value for art stimuli at the time of stimulus onset was found in the medial prefrontal cortex in all six fMRI
participants (One-sided t-test. An adjustment was made for multiple comparisons: whole-brain cFWE p<0.05withheight threshold at p< 0.001).
Article https://doi.org/10.1038/s41467-022-35654-y
Nature Communications | (2023) 14:127 5
Content courtesy of Springer Nature, terms of use apply. Rights reserved
modestly improved the models behavioral predictions (Supplemen-
tary Fig. 13).
DCNN model representations
We then tested whether activity patterns in these regions resemble the
computations performed by the hidden layers of the DCNN model. We
extracted the rst three principal components from each layer of the
DCNN, and included each as regressors in a GLM. Indeed, we found
evidence that both the ventral and dorsal visual stream exhibits a
similar hierarchical organization to that of the DCNN, such that lower
visual areas correlated better with activity in the early hidden layers of
the DCNN, while higher-order visualareas (in both visualstreams) tend
to correlate better with activity in deeper hidden layers of the DCNN
(Fig. 4c, d).
We also performed additional analyses with LASSO regression
using the DCNN features. To test if we can reproduce the DCNN results
originally performed with the GLM approach (as shown in Fig. 4), we
rst performed LASSO regression with the same 45 features from all
hidden layers. Hyperparameters were optimized by 12-fold cross-vali-
dation. The estimated weights were compared against the null dis-
tribution of each ROI constructed from the same analysis with shufed
stimuli labels. We then also performed the same analysis but with a
larger set of features (150 features). In Supplementary Figs. 14 and 15,
we show how the weights on features from different layers varied
across different ROIs in the visual stream. We computed the sum of
squared weights of hidden layer groups (layer 14, 59, 1013, 1415).
Again, in order to discard weight estimates that can be obtained by
chance, we computed a null distribution by repeating the same ana-
lysis with shufed labels and took the weight estimates that are sig-
nicantly larger than the null distribution (at p< 0.001) in each ROI. We
again found that LASSO regression with within-subject cross validation
reproduced our original GLM analysis results.
Fig. 4 | fMRI signals in visual cortical regions show similarity to our LFS model
and DCNN model. a Encoding of low and high-level features in the visual ventral-
temporal stream in a graded hierarchical manner. In general, the relative encoding
of high-level features with respect to low-level features increases dramatically
across the ventral-temporal stream. The maximum probabilistic map65 is shown
color-coded on the structural MR image at the top to illustrate the anatomical
location of each ROI. The proportion of voxels that signicantly correlated with
low-level features (blue; one-sided F-test p< 0.001) against high-level features (red;
one-sided F-test p< 0.001) are shown for each ROI. See the Methodssection for
detail.bEncoding low and high-levelfeatures in the dorsolateral visualstream. The
anatomical location of each ROI65 is color-coded on the structural MR image.
cEncoding of DCNN features (hidden layers' activation patterns) in the ventral-
temporal stream. The top three principalcomponents (PCs) from each layer of the
DCNN were used as features in this analysis. In general, early regions more heavily
encode representations found in early layers of the DCNN, while higher-order
regions encode representations found in deeper CNN layers. The proportion of
voxels thatsignicantlycorrelated with PCsof convolutionallayers 14 (light blue),
convolutional layers 59 (blue), convolutional layers 1013 (purple), fully con-
nected layers 1415 (pink) are shown for each ROI. The signicance was set at
p< 0.001 by one-sided F-test. dEncoding of DCNN features in the dorsolateral
visual stream. Credit. Jean Metzinger, Portrait of Albert Gleizes (public domain;
RISD Museum).
Article https://doi.org/10.1038/s41467-022-35654-y
Nature Communications | (2023) 14:127 6
Content courtesy of Springer Nature, terms of use apply. Rights reserved
As a further control analysis, we asked whether similar results
could be obtained from a DCNN model with random, untrained,
weights68. We repeated the same LASSO regression analysis as we did
in our analysis with the trained DCNN model. We found that such a
model does not reproduce the nding of a hierarchical representation
of layers that we found across the visual stream and other cortical
areas as in the analysis with trained DCNN weights (Supplementary
Figs. 16 and 17).
PPC and PFC show mixed coding of low- and high-level features
We next probed these representations in downstream regions of
association cortex69,70. We performed the same analysis with the same
GLM as before in regions of interest that included the posterior par-
ietal cortex (PPC), lateral prefrontal cortex (lPFC) and medial pre-
frontalcortex (mPFC). We found thatboth the LFS model features and
the DCNN layers were represented in these regions in a mixed
manner71,72. We found no clear evidence for a progression of the hier-
archical organization that we had observed in the visual cortex;
instead, each of these regions appeared to represent both low and
high-level features to a similar degree (Fig. 6a).Activity in theseregions
also correlated with hidden layers of the DCNN model (Fig. 6b). We
obtained similar results using a LASSO regression analysis with cross
validation based on either the LFS model features (Supplementary
Fig. 18) or the DCNN features (Supplementary Figs. 19 and 20). These
ndings suggest that, as we will see, these regions appear to play a
primary role in feature integration as required for subjective value
computations.
Features encoded in PPC and lPFC are strongly coupled to the
subjective value of visual art in mPFC
Having established that both the engineered LFS model and the
emergent DCNN model features are hierarchically represented in the
brain, we asked if and how these features are ultimately integrated to
compute the subjective value of visual art. First, we analyzed how
aesthetic value is represented across cortical regions alongside the
model features by adding the participants subjective ratings to the
GLM. We found that subjective values are, in general, more strongly
represented in the PPC as well as in the lateral and medial PFC than in
early and late visual areas (Fig. 7a and Supplementary Fig. 21). Fur-
thermore, value signals appeared to become more prominent in
medial prefrontal cortex compared to the lateral parietal and
prefrontal regions (consistent with a large prior literature,
e.g.,11,14,17,23,31,57,73). This pattern was not altered when we control for
reaction times and the distance of individual ratings from the mean
ratings, proxy measures for the degree of attention paid to each image
(Supplementary Fig. 22). In a further validation of our earlier feature
encoding analyses, we found that the pattern of hierarchical feature
representation in visual regions was unaltered by the inclusion of rat-
ingsintheGLM(SupplementaryFig.23).Wenotethatevenwhenusing
the DCNN model to classify features as either high or low as opposed
to relying on the a-priori assignment from the LFS model, this did not
change the results of our fMRI analyses qualitatively and does not
affect our conclusions (Supplementary Fig. 8).
These results suggest that rich feature representations in the
PPC and lateral PFC could potentially be leveraged to construct
subjective values in mPFC. However, it is also possible that features
represented in visual areas are directly used to construct subjective
value in mPFC. To test this, we examined which of the voxels
representing the LFS model features across the brain are coupled
with voxels that represent subjective value in mPFC at the time
when participants make decisions about the stimuli. A strong cou-
pling would support the possibility that such feature representa-
tions are integrated at the time of decision-making in order to
support a subjective value computation.
To test for this, we rst performed a psychological-physiological
interaction (PPI) analysis, examining which voxels are coupled with
regions that represent subjective value when participants made deci-
sions (Fig. 7b and Supplementary Fig. 24). We stress that this is not a
trivial signal correlation, as in our PPI analysis all the value and feature
signals are regressed out. Therefore the coupling is due to noise cor-
relations between voxels. Then we asked how much of the feature-
encoding voxels overlap with these PPI voxels. Specically, we tested
for the fraction of feature-encodingvoxels that arealso correlated with
the PPI regressor across each ROI. Finding overlap between feature
encoding voxels and PPI connectivity effects would be consistent with
a role for these feature encoding representations in value construc-
tion. We found that the overlap was most prominent in the PPC and
lPFC, while there was virtually no overlap in the visual areas at all
(Fig. 7c), consistent with the idea that features in the PPC and lPFC,
instead of visual areas, are involved in constructing subjective value
Low level
High level
Low level x Low level
High level x Low level
High level x High level
P1
V1v
V2v
V3v
hV4
VO1
VO2
PHC1
PHC2
0
0.2
0.4
0.6
0.8
1P2
V1v
V2v
V3v
hV4
VO1
VO2
PHC1
PHC2
0
0.2
0.4
0.6
0.8
1P3
V1v
V2v
V3v
hV4
VO1
VO2
PHC1
PHC2
0
0.2
0.4
0.6
0.8
1
P4
V1v
V2v
V3v
hV4
VO1
VO2
PHC1
PHC2
0
0.2
0.4
0.6
0.8
1P5
V1v
V2v
V3v
hV4
VO1
VO2
PHC1
PHC2
0
0.2
0.4
0.6
0.8
1P6
V1v
V2v
V3v
hV4
VO1
VO2
PHC1
PHC2
0
0.2
0.4
0.6
0.8
1
P1
V1d
V2d
V3d
V3a
V3b
LO1
LO2
hMT
MST
0
0.2
0.4
0.6
0.8
1
Feature Breakdown
P2
V1d
V2d
V3d
V3a
V3b
LO1
LO2
hMT
MST
0
0.2
0.4
0.6
0.8
1
number survives
P3
V1d
V2d
V3d
V3a
V3b
LO1
LO2
hMT
MST
0
0.2
0.4
0.6
0.8
1
number survives
P4
V1d
V2d
V3d
V3a
V3b
LO1
LO2
hMT
MST
0
0.2
0.4
0.6
0.8
1P5
V1d
V2d
V3d
V3a
V3b
LO1
LO2
hMT
MST
0
0.2
0.4
0.6
0.8
1
number survives
P6
V1d
V2d
V3d
V3a
V3b
LO1
LO2
hMT
MST
0
0.2
0.4
0.6
0.8
1
number survives
V1v
V2v
V3v
hV4
VO1
VO2
PHC1
PHC2
Ventral
-temporal
Dorso
-lateral
V1d
V2d
V3d
V3a
V3b
LO1
LO2
hMT
MST
Regions Regions
Fig. 5 | Encoding of nonlinear feature representations. We performed encoding analysis of low-level, high-level, and interaction term features (low × low, high × high,
low × high), using lasso regression with cross validation within subject. The results of ROIs in the ventral-temporal and dorso-lateral visual streams are shown.
Article https://doi.org/10.1038/s41467-022-35654-y
Nature Communications | (2023) 14:127 7
Content courtesy of Springer Nature, terms of use apply. Rights reserved
representations in mPFC. A more detailed decomposition of the PFC
ROI from the same analysis shows the contribution of individual sub-
regions of lateral and medial PFC (Supplementary Fig. 25).
We also performed a control analysis to test the specicity of the
coupling to an experimental epoch by constructing a similar PPI
regressor locked to the epoch of inter-trial-intervals (ITIs). This analysis
showed a dramatically altered coupling that did not involve the same
PPC and PFC regions (Supplementary Fig. 26). These ndings indicate
that coupling between PPC and LPFC with mPFC value representations
occurs specically at the time that subjective value computations are
being performed, suggesting that these regions are playing an inte-
grativeroleoffeaturerepresentations at the time of valuation. We
however note that all of our analyses are based on correlations, which
do not provide information about the direction of the coupling.
Discussion
It is an open question how the human brain computes the value of
complex stimuli such as visual art1,3,6,74. Here, we addressed this
Fig. 6 | Parietal and prefrontal cortex encode features in a mixed manner.
aEncoding of low- and high-level features from the LFS model in posterior parietal
cortex (PPC), lateral prefrontal cortex (lPFC) and medial prefrontal cortex (mPFC).
The ROIs used in this analysis are indicated by colors shown in a structural MR
image at the top. bEncoding of the DCNN features (activation patterns in the
hidden layers) in PPC and PFC. The same analysis method as Fig. 4was used.
Credit.Jean Metzinger, Portrait of Albert Gleizes (public domain; RISD Museum).
Article https://doi.org/10.1038/s41467-022-35654-y
Nature Communications | (2023) 14:127 8
Content courtesy of Springer Nature, terms of use apply. Rights reserved
question by applying two different computational models to neuroi-
maging data, demonstrating how the brain transforms visual stimuli
into subjective value, all the way from the primary visual cortex to
parietal and prefrontal cortices. The linear feature summation (LFS)
model directly formulates our hypothesis, by extracting interpretable
stimulus features and integrating over the features to construct sub-
jective value. This linear regression model is related to a standard deep
convolutional neural network (DCNN) trained on object recognition,
because we found that the LFS model features are represented in
hidden layers of the DCNN in a hierarchical manner. Here we found
that both of these two models predict participantsactivity across the
brain from the visual cortex to prefrontal cortex. Though our
correlation-based analyses do not address the directionality of infor-
mation processing across regions, our results shed light on a possible
mechanism by which the brain could transform a complex visual sti-
mulus into a simple value that informs decisions, using a rich feature
space shared across stimuli.
Focusing rst on the visual system, we found that low-level fea-
tures thatpredict visual art preferences are represented more robustly
in early visual cortical areas, while high-level features that predict
preferences are increasingly represented in higher-order visual areas.
These results support a hierarchical representation of the features
required for valuation of visual imagery, and further support a model
whereby lower-level features extracted by early visual regions are
Fig. 7 | Features are integrated from PPC and lateral PFC to medial PFC when
constructing the subjective value of visual art. a Encoding of low- and high-level
features (green)and liking ratings(red) across brainregions. Note that the ROIs for
the visual areas are now grouped as V1-2-3 (V1, V2 and V3) and V-high (Visual areas
higher thanV3). See the Methods section for detail.bThe schematics of functional
coupling analysis to test how feature representations are coupled with subjective
value. We identied regions that encode features (green), by performing a one-
sided F-test (p< 0.05 whole-brain cFWE with the height threshold p<0.001).We
also performed a psychophysiological interaction (PPI) analysis (orange: p<0.001
uncorrected) to determine the regions that are coupled to the seed regions in
mPFC that encode subjective value (i.e., liking rating) during stimulus presentation
(red:seed, see SupplementaryFig. 24). We then tested forthe proportionof overlap
between voxels identied in these analyses in a given ROI. c The results of the
functional coupling analysis showthat featuresrepresented in thePPC and lPFC are
coupled with the region in mPFC encoding subjective value. This result dramati-
cally contrasts with a control analysis focusing on ITI instead of stimulus pre-
sentations (Supplementary Fig. 26). Credit. Jean Metzinger, Portrait of Albert
Gleizes (public domain; RISD Museum).
Article https://doi.org/10.1038/s41467-022-35654-y
Nature Communications | (2023) 14:127 9
Content courtesy of Springer Nature, terms of use apply. Rights reserved
integrated to produce higher-level features in the higher visual
system75. While the notion of hierarchical representations in the visual
system is well established in the domain of object recognition7679,our
results substantially extend these ndings by showing that features
relevant to a very different behavioral task-forming value judgments,
are also represented robustly in a similar hierarchical fashion.
We then showed that the process through which feature repre-
sentations are mapped into a singular subjective value dimension in a
network of brain regions, including the posterior parietal cortex (PPC),
lateral and medial prefrontal cortices (lPFC and mPFC). Whileprevious
studies have hinted at the use of such a feature-based framework in the
prefrontal cortex (PFC), especially in orbitofrontal cortex (OFC), in
thosepreviousstudiesthefeaturesweremoreexplicitpropertiesofa
stimulus (e.g., the movement and the color of dots45,80,81, or items that
are suited to a functional decomposition such as food odor39 or
nutritive components of food38; see also refs. 36,37). Here we show
that features relevant for computing subjective value of visual stimuli
are widely represented in lPFC and PPC, whereas subjective value
signals are more robustly represented in parietal and frontal regions,
with the strongest representation in mPFC.
Further, we showed that PFC and PPC regions encoding low- and
high-level features enhanced their coupling with the mPFC region
encoding subjective value at the time of image presentation. While
further experiments are needed to infer the directionality of the con-
nectivity effects, our ndings are compatible with a framework in
which low and high-level feature representations in lPFC and PPC are
utilized to construct value representations in mPFC, as we hypothe-
sized in the LFS model.
Going beyond our original LFS model, we also found that in most
participants, non-linear features created from pairs of high-level fea-
tures specied in the original model produced signicant correlations
with neural activity across multiple regions, while largely showing
similar evidence for a hierarchical organization from early to higher-
order regions, as found for the linear high-level features. These nd-
ings indicate that the brain encodes a much richer set of features than
our original proposed set of low-level and high-level features as spe-
cied in the original LFS model. It will be interesting to see if the
nonlinear features that we introduced here, especially the ones that
were constructed from pairs of high-level features, can also be used to
support behavioral judgments beyond the simple value judgments
studied here, such as object recognition and other more complex
judgements53. We also note that there are other ways to construct
nonlinear features. Further studies with richer set of features, e.g,
other forms of interactions, may improve behavioral and neural
predictions
While previous studies have suggested similarities between
representations of units in DCNN models for object recognition and
neural activity in the visual cortex (e.g.,8285), here we show that the
DCNN model can also be useful to inform how visual features are
utilized for value computation across a broader expanse of the brain.
Specically, we found evidence to support the hierarchical construc-
tion of subjective value, where the early layers of DCNN correlate early
areas of the visual system, and the deeper layers of DCNN correlate
with higher areas of the visual system. All of the DCNN layersinfor-
mation was equally represented in the PPC and PFC.
These ndings are consistent with the suggestion that the hier-
archical features which emerge in the visual system are projected into
the PPC and PFC to form a rich feature space to construct subjective
value. Further studies using neural network models with recurrent
connections46 may illuminate more detail, such as the temporal
dynamics, of value construction in such a feature space across brain
regions.
Accumulating evidence has suggested that value signals can be
found widely across the brain including even in sensory regions
(e.g.,38,5863), posing a question about the differential contribution of
different brain regions if value representations are so ubiquitous.
While we also saw multiple brain regions that appeared to correlate
with value signals during aesthetic valuation, our results suggest an
alternative account for the widespread prevalence of value signals,
which is that some components of the value signals especially in sen-
sory cortex might reect features that are ultimately used to construct
value in later stages of information processing, instead of the value
itself. Because neural correlates of features have not been probed
previously, our results suggest that it may be possible to reinterpret at
least some apparent value representations as reecting the encoding
of precursor features instead of value per se. In the present case even
after taking into account feature representations, value signals were
still detectable in the medial prefrontal cortex and elsewhere, sup-
porting the notion that some brain regions are especially involved in
value coding more than others. In future work it may be possible to
even more clearly dissociate value from its sensory precursors by
manipulating the context in which stimuli are presented, wherein
features remain invariant across contexts, while the value changes. In
doing so, further studies can illuminate ner dissociations between
features and value signals43.
Oneopenquestionishowthebrainhascometobeequippedwith
a feature-based value construction architecture. We recently showed
that a DCNN model trained solely on object recognition tasks repre-
sents the LFSs low- and high-level features in the hidden layers in a
hierarchical manner, suggesting the possibility that such features
could naturally emerge over development42. While the similarity
between the DCNN and the LFS model correlations with fMRI
responses in adult participants provides a promising link between
these models and the brain, further investigations applying these
models to studies with children or other species has the potential to
inform understanding of the origin of feature-based value construc-
tion across development and across species.
Following the typicalapproach utilized innon-human primate and
other animal neurophysiology as well as in human visual neuroima-
ging, we performed in-depth scanning (20 sessions) in a relatively
small number of participants (six) in order to address our neural
hypotheses. Because we were able to obtain a sufcientamountof
fMRI data in individual participants, we were able to reliably perform
single-subject inference in each participant and evaluate the results
across participants side-by-side. This approach contrasts witha classic
group-based neuroimaging study in which results are obtained from
the group average of many participants, where each participant per-
forms short sessions, thus providing data with low signal to noise. One
advantage of our approach over the group averaging approach is that
we can treat each participant as a replication unit, meaning thatwe can
obtainmultiplereplications
44 from one study instead of j ust one group
result. If every participant shows similar patterns, then it is unlikely
that those results are spurious,and much more likely they reect a true
property of human brain function. We indeed found that all partici-
pants similarly performed our-hypothesized feature-based value con-
struction across the brain. Another advantage of our methodological
approach concerns possible heterogeneity across participants. Not all
brains are the same, and there is known to be considerable variation in
the location and morphology of different brain areas across
individuals86. Thus, it is unlikely that all brains actually represent the
same variable at the same MNI coordinates. The individual subject-
based approach to fMRI analyses used here takes individual neuroa-
natomical variation into account, allowing for generalization that goes
beyond a spatially smoothed average that does not represent any real
brain. We note that one important limitation of this in-depth fMRI
method is that it is not ideal for studying and characterizing differ-
ences across individuals. To gain a comprehensive account of such
variability across individuals. it would be necessaryto collect data from
a much larger cohort of participants. As it is not feasible to scale the in-
depth approach to such large cohorts due to experimenter time and
Article https://doi.org/10.1038/s41467-022-35654-y
Nature Communications | (2023) 14:127 10
Content courtesy of Springer Nature, terms of use apply. Rights reserved
resource constraints, such individual difference studies would neces-
sarily require adopting more standard group-level scanning approa-
ches and analyses.
While we found that results from the visual cortex were largely
consistent across participants, the proportion of features represented
in PCC and PFC, as well as the features that were used, were quite
different across participants. Understanding such individual differ-
ences will be important in future work. For instance, there is evidence
that art experts tend to evaluate art differently from people with no
artistic training87,88. It would be interesting to study if feature repre-
sentations may differ between experts and non-experts, while probing
whether the computational motif that we found here (hierarchical
visual feature representation in visual areas, value construction in PPC
and PFC) might be conserved across different levels of expertise. We
should also note that the models predictive accuracy about liking
ratings varied across participants. It is likely that some participants
used features that our model did not consider, such as personal
experience associated with stimuli. Brain regions such as the hippo-
campus may potentially be involved in such additional feature com-
putations. Further, behavior andfMRI signals can be inherently noisy in
thattherewillbeaportionofdatathatcannotbepredicted(i.e.anoise
ceiling). Characterizing the contribution of these noise components
will require further experiments with repeated measurements of
decisions about the same stimuli.
Taken together,these ndingsare consistentwith the existence of
a large-scale processing hierarchy in the brain that extends from early
visual cortex to medial prefrontal cortex, whereby visual inputs are
transformed into various features through the visual stream. These
features are then projected to PPC and lPFC, and subsequently inte-
grated into subjective va lue judgment in mPFC. Crucially, the exibility
afforded by such a feature-based mechanism of value construction
ensures that value judgments can be formed even for stimuli that have
never before been seen, or in circumstances where the goal of valua-
tion varies (e.g., selecting a piece of art as a gift). Therefore, our study
proposes a brain-widecomputational mechanism thatdoes not limitto
aesthetics, but can be generalized to value constrictions of a wide
range of visual and other sensory stimuli.
Methods
Participants
All participants provided informed consent for their participation in
the study, which was approved by the Caltech IRB.
Six volunteers (female: 6; age 1824 yr: 4; 2534 yr: 1; 3544 yr : 1. 4
White, 2 Asian) were recruited into our fMRI study. one participant
completed masters degree or higher, four participants earned a col-
lege degree as the highestlevel, and one participant had a high-school
degree as the highest degree. None of the participants possessed an art
degree. All of the participants reported that they visit art museums less
than once a month.
In addition, thirteen art-experienced participants [reported in our
previous behavioral paper42](female:6;ages1824 yr: 3; 2534 yr: 9;
3544 yr: 1) were invited to evaluate the high-level feature values
(outside the scanner). These participants forannotation were primarily
recruited from the ArtCenter College of Design community.
Stimuli
The same stimuli as our recent behavioral study42 were used in the
current fMRI study. Painting stimuli were taken from the visual art
encyclopedia www.wikiart.org. Using a script that randomly selects
images in a given category of art, we downloaded 206 or 207 images
from four categories of art (825 in total). The categories were Abstract
Art,Impressionism,Color Fields,andCubism.Werandomly
downloaded images with each tag using our custom code in order to
avoid subjective bias. We supplemented this database with an addi-
tional 176 paintings that were used in a previous study52. For the fMRI
study reported here, one image was excluded from the full set of 1001
images to have an equal number of trials per run (50 images/run× 20
runs = 1000 images).
fMRI task
On each trial, participants were presented with an image of the artwork
on the computer screen for three seconds. Participants were then
presented with a scale from 0, 1, 2, 3 in which they had to indicate how
much they liked the artwork. The location of each numerical score was
randomized across trials. Participants had to press a button of a button
box that they hold with both hands to indicate their rating within three
seconds, where each of four buttons corresponded to a particular
location on the screen from left to right. The left (right) two buttons
were instructed to be pressed by their left (right) thumb. After a brief
feedback period showing their chosen rating (0.5 s), a center cross was
shown for inter-trial intervals (jittered between 2 and 9 s). Each run
consists of 50 trials. Participants were invited to the study over four
days to complete twenty runs, where participants completed on
average ve runs on each day.
fMRI data acquisition
fMRI datawere acquired on a Siemens Prisma3T scanner at the Caltech
Brain Imaging Center (Pasadena, CA). With a 32-channel radio-
frequency coil, a multi-band echo-planar imaging (EPI) sequence was
employed with the following parameters: 72 axial slices (whole-brain),
A-P phase encoding, 30 degrees slice tilt with respect to AC-PC line,
echo time (TE) of 30ms, multi-band acceleration of 4, repetition time
(TR) of 1.12 s, 54-degree ip angle, 2 mm isotropic resolution, echo
spacing of 0.56 ms. 192 mm × 192 mm eld of view, in-plane accelera-
tion factor 2, multi-band slice acceleration factor 4.
Positive and negative polarity EPI-based eld maps were collected
before each run with very similar factors as the functional sequence
described above (same acquisition box, number of slices, resolution,
echo spacing, bandwidth and EPI factor), single band, TE of 50 ms, TR
of 5.13 s, 90-degree ip angle.
T1-weighted and T2-weighted structural images were also
acquired once for each participant with 0.9 mm isotropic resolution.
T1sparameterswere:repetitiontime(TR)2.4s:echotime(TE),
0.00232s; inversion time (TI) 0.8 s; ip angle, 10 degrees; in-plane
acceleration factor 2. T2s parameters were: TR 3.2 s; TE 0.564 s; ip
angle, 120 degrees; in-plane acceleration factor 2.
fMRI data processing
Results included in this manuscript come from preprocessing per-
formed using fMRIPrep 1.3.2 (ref. 89; RRID:SCR_016216), which is based
on Nipype 1.1.9 (ref. 90; RRID:SCR_002502).
Anatomical data preprocessing. The T1-weighted (T1w) image was
corrected for intensity non-uniformity (INU) with N4Bias Field
Correction91, distributed with ANTs 2.2.0 [92, RRID:SCR_004757] and
used as T1w-reference throughout the workow. The T1w-reference
was then skull-stripped with a Nipype implementation of the
antsBrainExtraction.sh workow (from ANTs), using OASI-
S30ANTs as target template. Spatial normalization to the ICBM 152
Nonlinear Asymmetrical template version 2009c was performed
through nonlinear registration with antsRegistration (ANTs 2.2.0),
using brain-extracted versions of both T1w volume and template. Brain
tissue segmentation of cerebrospinal uid (CSF), white-matter (WM)
and gray-matter (GM) was performed on the brain-extracted T1w
using fast.
Functional data preprocessing. For each of the 20 BOLD runs found
per subject (across all tasks and sessions), the following preprocessing
was performed. First, a reference volume and its skull-stripped version
were generated using a custom methodology of fMRIPrep.A
Article https://doi.org/10.1038/s41467-022-35654-y
Nature Communications | (2023) 14:127 11
Content courtesy of Springer Nature, terms of use apply. Rights reserved
deformation eld to correct for susceptibility distortions was esti-
mated based on two echo-planar imaging (EPI) references with
opposing phase-encoding directions, using 3dQwarp (AFNI
20160207). Based on the estimated susceptibility distortion, an
unwarped BOLD reference was calculated for a more accurate co-
registration with the anatomical reference.
The BOLD reference was then co-registered to the T1w reference
using irt with the boundary-based registration cost-function. Co-
registration was congured with nine degrees of freedom to account
for distortions remaining in the BOLD reference. Head-motion para-
meters with respect to the BOLD reference (transformation matrices,
and six corresponding rotation and translation parameters) are esti-
mated before any spatiotemporal ltering using mcirt The BOLD
time-series (including slice-timing correction when applied) were
resampled onto their original, native space by applying a single,
composite transform to correct for head-motion and susceptibility
distortions. These resampled BOLD time series will be referred to as
preprocessed BOLD in original space, or just preprocessed BOLD. The
BOLD time series were resampled to MNI152NLin2009cAsym standard
space, generating a preprocessed BOLD run in MNI152NLin2009cA-
sym space. First, a reference volume and its skull-stripped version were
generated using a custom methodology of fMRIPrep.Severalcon-
founding time series were calculated based on the preprocessed
BOLD: framewise displacement (FD), DVARS and three region-wise
global signals. FD and DVARS are calculated for each functional run,
both using their implementations in Nipype.
The three global signals are extracted within the CSF, the WM, and
the whole-brain masks. In addition, a set of physiological regressors
were extracted to allow for component-based noise correction. Prin-
cipal components are estimated after high-pass ltering the pre-
processed BOLD time series (using a discrete cosine lter with 128 s cut-
off) for the two CompCor variants: temporal (tCompCor) and anato-
mical (aCompCor). Six tCompCor components are then calculated
from the top 5% variable voxels within a mask covering the subcortical
regions. This subcortical mask is obtained by heavily eroding the brain
mask, which ensures it does not include cortical GM regions. For
aCompCor, six components are calculated within the intersection of
the aforementioned mask and the union of CSF and WM masks cal-
culated in T1w space, after their projection to the native spaceof each
functional run (using the inverse BOLD-to-T1w transformation).
The head-motion estimates calculated in the correction step were
also placed within the corresponding confounds le. All resamplings
can be performed with a single interpolation step by composing all the
pertinent transformations (i.e. head-motion transform matrices, sus-
ceptibility distortion correction whe n available, and co-registrations to
anatomical and template spaces). Gridded (volumetric) resamplings
were performed using ants Apply Transforms (ANTs), congured with
Lanczos interpolation to minimize the smoothing effects of other
kernels.
Computational models
The computational methods and behavioral modeling reported in this
manuscript overlap with that reported in our recent article focusing
exclusively on behavior42. For completeness, we reproduce some of
the descriptions of these methods as rst described in ref. 42.
Linear feature summation model (LFS model)
We hypothesized that subjective preferences for visual stimuli are
constructed by the inuence of visual and emotional features of the
stimuli. As its simplest, we assumed that the subjective value of the i-th
stimulus v
i
is computed by a weighted sum of feature values f
i,j
:
vi=X
nf
j=0
wjfi,jð1Þ
where w
j
is a weight of the j-th feature, f
i,j
is the value of the j-th feature
for stimulus i,andn
f
is the number of features. The 0-th feature is a
constant f
i,0
=1forallis.
Importantly, w
j
is not a function of a particular stimulus but
shared across all visual stimuli, reecting the taste of a participant. The
same taste (w
j
s) can also be shared across different participants, as we
showed in our behavioral analysis. The features f
i,j
were computed
using visual stimuli; we used the same feature values to predict liking
ratings across participants. We used the simple linear model Eq. (1)to
predict liking ratings in our behavioral analysis (see below for how we
determined features and weights).
As we schematically showed in Fig. 1, we hypothesized that the
input stimulus is rst broke down into low-level features and then
transformed into high-level features, and indeed we found that a sig-
nicant variance of high-level features can be predicted bya set of low-
level features.This hierarchical structure of the LFS model was further
tested in our DCNN and fMRI analysis.
Features
Because we did not know a priori what features would best describe
human aesthetic values for visual art, we constructed a large feature
set using previously published methods from computer vision aug-
mented with additional features that we ourselves identied using
additional existing machine learning methods.
Visual low-level features introduced in ref. 49. We employed 40
visual features introduced in ref. 49. We do not repeat descriptions of
the features here; but briey, the feature sets consist of 12 global fea-
tures that are computed from the entire image that include color
distributions, brightness effects, blurring effects, and edge detection,
and 28 local features that are computed for separate segments of the
image (the rst, the second and the third largest segments). Most
features are computed straightforwardly in either HSL (hue, satura-
tion, lightness) or HSV (hue, saturation, value) space (e.g., average
hue value).
One feature that deserves description is a blurring effect.
Following49,93, we assumed that the image Iwas generated from a
hypothetical sharp image with a Gaussian smoothing lter with an
unknown variance σ. Assuming that the frequency distribution for the
hypothetical image is approximately the same as the blurred, actual
image, the parameter σrepresents the degree to which the image was
blurred. The σwas estimated by the Fourier transform of the original
image by the highest frequency, whose power is greater than a certain
threshold.
fblur =max kx,ky

/1
σ
ð2Þ
where k
x
=2(xn
x
/2)/n
x
and k
y
=2(yn
y
/2)/n
y
with (x,y)and(n
x
,n
y
)
are the coordinates of the pixel and the total number of pixel values,
respectively. The above max was taken within the components whose
powerislargerthanfour
49.
The segmentation for this feature set was computed by a techni-
que called kernel GraphCut50,94.Followingref.49, we generated a total
of at least six segments for each image using a C++ and Matlab package
for kernel graph cut segmentation94. The regularization parameter that
weighs the cost of cut against smoothness was adjusted for each image
in order to obtain about six segments. See refs. 49,94 for the full
description of this method and examples.
Of these 40 features, we included all of them in our initial feature
set except for local features for the third-largest segment, which were
highly correlated with features for the rst and second-largest seg-
ments and were thus deemed unlikely to add unique variance to the
feature prediction stage.
Article https://doi.org/10.1038/s41467-022-35654-y
Nature Communications