Shared spatiotemporal category representations in biological and artificial deep neural networks

Article (PDF Available)inPLoS Computational Biology 14(7):e1006327 · July 2018with 150 Reads
DOI: 10.1371/journal.pcbi.1006327
Cite this publication
Visual scene category representations emerge very rapidly, yet the computational transformations that enable such invariant categorizations remain elusive. Deep convolutional neural networks (CNNs) perform visual categorization at near human-level accuracy using a feedforward architecture, providing neuroscientists with the opportunity to assess one successful series of representational transformations that enable categorization in silico. The goal of the current study is to assess the extent to which sequential scene category representations built by a CNN map onto those built in the human brain as assessed by high-density, time-resolved event-related potentials (ERPs). We found correspondence both over time and across the scalp: earlier (0–200 ms) ERP activity was best explained by early CNN layers at all electrodes. Although later activity at most electrode sites corresponded to earlier CNN layers, activity in right occipito-temporal electrodes was best explained by the later, fully-connected layers of the CNN around 225 ms post-stimulus, along with similar patterns in frontal electrodes. Taken together, these results suggest that the emergence of scene category representations develop through a dynamic interplay between early activity over occipital electrodes as well as later activity over temporal and frontal electrodes.
Figures - uploaded by Bruce C. Hansen
Author content
All content in this area was uploaded by Bruce C. Hansen
Content may be subject to copyright.
Shared spatiotemporal category
representations in biological and artificial
deep neural networks
Michelle R. Greene
*, Bruce C. Hansen
1Neuroscience Program, Bates College, Lewiston, ME, United States of America, 2Department of
Psychological and Brain Sciences, Neuroscience Program, Colgate University, Hamilton,NY, United States
of America
Visual scene category representations emerge very rapidly, yet the computational transfor-
mations that enable such invariant categorizations remain elusive. Deep convolutional neu-
ral networks (CNNs) perform visual categorization at near human-level accuracy using a
feedforward architecture, providing neuroscientists with the opportunity to assess one suc-
cessful series of representational transformations that enable categorization in silico. The
goal of the current study is to assess the extent to which sequential scene category repre-
sentations built by a CNN map onto those built in the human brain as assessed by high-den-
sity, time-resolved event-related potentials (ERPs). We found correspondence both over
time and across the scalp: earlier (0–200 ms) ERP activity was best explained by early CNN
layers at all electrodes. Although later activity at most electrode sites corresponded to earlier
CNN layers, activity in right occipito-temporal electrodes was best explained by the later,
fully-connected layers of the CNN around 225 ms post-stimulus, along with similar patterns
in frontal electrodes. Taken together, these results suggest that the emergence of scene
category representations develop through a dynamic interplay between early activity over
occipital electrodes as well as later activity over temporal and frontal electrodes.
Author summary
We categorize visual scenes rapidly and effortlessly, but still have little insight into the
neural processing stages that enable this feat. In a parallel development, deep convolu-
tional neural networks (CNNs) have been developed that perform visual categorization
with human-like accuracy. We hypothesized that the stages of processing in a CNN may
parallel the stages of processing in the human brain. We found that this is indeed the case,
with early brain signals best explained by early stages of the CNN and later brain signals
explained by later CNN layers. We also found that category-specific information seems to
first emerge in sensory cortex and is then rapidly fed up to frontal areas. The similarities
between biological brains and artificial neural networks provide neuroscientists with the
PLOS Computational Biology | July 24, 2018 1 / 17
Citation: Greene MR, Hansen BC (2018) Shared
spatiotemporal category representations in
biological and artificial deep neural networks. PLoS
Comput Biol 14(7): e1006327.
Editor: Samuel J. Gershman, Harvard University,
Received: January 2, 2018
Accepted: June 26, 2018
Published: July 24, 2018
Copyright: ©2018 Greene, Hansen. This is an open
access article distributed under the terms of the
Creative Commons Attribution License, which
permits unrestricted use, distribution, and
reproduction in any medium, provided the original
author and source are credited.
Data Availability Statement: All data files may be
found at the Open Science Framework: http://osf.
Funding: This work was funded by National
Science Foundation (1736274) to MRG and BCH
(, and James S McDonnell
Foundation (220020439) to BCH (https://www. The funders had no role in study design,
data collection and analysis, decision to publish, or
preparation of the manuscript.
Competing interests: The authors have declared
that no competing interests exist.
opportunity to better understand the process of categorization by studying the artificial
Categorization, the act of grouping like with like, is a hallmark of human intelligence. Scene
categorization in particular is incredibly rapid [1,2] and may even be automatic [3]. Despite
the importance of this problem, little is known about the temporal dynamics of neural activity
that give rise to categorization. A common conceptual framework in the visual neuroscience
community considers representations in each visual area as a geometric space, with individual
images represented as points within this space [4,5]. According to this account, early visual
processing can be described as a tangled geometric surface in which individual categories can-
not be easily separated, and that over the course of processing, the ventral visual stream disen-
tangles these manifolds, allowing for categories to be distinguished [6,7]. Although this view
has provided a useful descriptive framework, we still know little about how this disentangling
occurs because it has been difficult to examine the representations at each stage of processing.
In a parallel development, work in computer vision has resulted in the creation of deep con-
volutional neural networks (CNNs) whose categorization abilities rival those of human observ-
ers [8]. Although CNNs do not explicitly seek to model the human visual system, their
architectures are inspired by the structure of the human visual system [9]: like the human
visual system, they are arranged hierarchically in discrete representational layers, and they
apply both linear and nonlinear operations across the whole of visual space. As CNNs are
explicitly trained for the purpose of visual categorization, and as they achieve near human-
level performance at this task, they provide neuroscientists with an unprecedented opportunity
to interrogate the types of intermediate level representations that are built en route to categori-
zation. Indeed, there is a growing literature detailing the similarities between aspects of CNNs
and activity in biological brains [1017].
Of particular interest to the current study is the correspondence between the stages of pro-
cessing within a CNN and the temporal order of processing in the human brain, as assessed
with M/EEG. Studies have demonstrated that the brain activity elicited by individual images
can be predicted by the CNN [12,18] and that upper layers of the CNN can predict semanti-
cally-relevant spatial properties such as overall scene volume [19]. However, these studies pool
across all sensors at a given time point, discarding potentially informative scalp patterns.
While this capitalizes on the fine temporal scale of M/EEG, a complete understanding of the
neural dynamics of scene understanding requires the characterization of information flow
across the cortex. Additionally, key questions remain open, including understanding the devel-
opment of scene category membership and how intermediate stages of visual representation
allow for these complex abstractions to take place.
Therefore, the goal of the current study is to assess the extent to which sequential represen-
tations in each layer of a pre-trained deep convolutional neural network predict the sequential
representations built by the human brain using high-density time-resolved event-related
potentials (ERPs). Previewing our results, we show a spatiotemporal correspondence between
sequential CNN layers and the order of processing in the human visual system. Specifically,
earlier layers of the CNN correspond better to early time points in human visual processing
and to electrodes over occipital and left temporal scalp regions. Later layers of the CNN corre-
spond to later information and best match electrodes in the frontal half of the scalp, as well as
over the right occipitotemporal cortex. The correspondence between computer models and
Shared category representations in brain and DNN
PLOS Computational Biology | July 24, 2018 2 / 17
human vision provides neuroscientists with the unique opportunity to probe intermediate-
level representations in silico, allowing for a more complete understanding of the neural com-
putations generating visual categorization.
We recorded high-density 256-channel EEG while 13 human participants viewed 2,250 photo-
graphs from 30 scene categories for 750 ms each while engaged in a three-alternative forced
choice (3AFC) task. An additional 15 participants viewed the same images with the same proce-
dure for 500 ms to serve as a full internal replication. Our approach is graphically illustrated in
Fig 1. To compare scene category representations in a pre-trained (Places-205 with 8-layer
“AlexNet” architecture [20]) deep convolutional neural network (CNN) to those of human
observers, we used the representational similarity analysis (RSA) framework [5]. This allowed us
to directly compare models and neuroelectric signals by abstracting both into a similarity space.
CNN decoding results
In order to understand the potential contributions of each CNN layer to category-specific neu-
ral responses, we assessed the extent to which each CNN layer contained decodable category
Fig 1. The image set consisted of 75 images from each of 30 scene categories. Example images are shown on the left. Activations for
each of the eight layers of a pre-trained deep convolutional neural network (CNN) were extracted for each of the 2250 images. We
averaged across exemplars to create 30x30 representational dissimilarity matrices (RDMs) for each layer of the CNN. For each
participant, and for each of the 256 electrodes, we ran a 40 ms sliding window from the 100 ms before stimulus presentation, and
running through the 750 ms where image was on screen. At each time point, a 30x30 RDMwas created from the average voltage values
at the given electrode and compared to each layer of the CNN using correlation. Note: although full matrices are shown, only the lower
triangle was used in analysis.
Shared category representations in brain and DNN
PLOS Computational Biology | July 24, 2018 3 / 17
information. All layers contained above-chance information for decoding the 30 scene catego-
ries (see Fig 2). The classification accuracy ranged from 40.8% in the first layer (Conv1) to
90.3% correct in the first fully-connected layer (FC6). Somewhat unexpectedly, this is higher
accuracy than we observed in the top layer (FC8: 84.6%). It is also noteworthy that this level of
classification is higher than the 69–70% top-1 classification accuracy reported in [20]. This is
likely due to the fact that some of the images in our dataset were from the training set for this
CNN. In sum, we can expect to see category-specific information in each of the eight layers of
the CNN, with maximum category information coming from the fully connected layers, peak-
ing in the sixth layer.
Time-resolved encoding results
To get an overall picture of human/CNN correspondence, we examined the extent to which
the set of all CNN layers have explanatory power for visual ERPs. Our data-driven clustering
method identified five groups of electrodes. We averaged ERPs within each cluster and then
examined the variance explained by all layers of the CNN within that cluster’s representational
Fig 2. Category decoding accuracy for each layer of the CNN. Error bars reflect 95% confidence intervals. Horizontal line indicates chance level
Shared category representations in brain and DNN
PLOS Computational Biology | July 24, 2018 4 / 17
dissimilarity matrices (RDMs) over time. For each cluster, we examined the onset of signifi-
cant explained variance, the maximum explained variance, and the latency of maximum
explained variance. We observed no statistically significant differences between the clusters for
onset (F(4,60)<1), maximum explained variance (F(4,60) = 1.11, p = 0.36), nor latency at max-
imum (F(4,60)<1). Thus, we show the average of all 256 electrodes in Fig 3. We found that the
CNN could predict neuroelectric activity starting at 54 ms after scene presentation (55 ms for
internal replication set), and it achieved maximal predictive power 93 ms after stimulus onset
(75 ms for internal replication set).
To assess how much of the explainable ERP variance was captured by the CNN, we esti-
mated the noise ceiling using a method suggested by [21]. We found that the maximum
explainable variance to be 48–54% (internal replication set: 45%-48%). Therefore, the average
maximum r
of 0.10 across all electrodes does not account for all variability in this dataset. As
CNNs are exclusively feedforward models, the remaining variability to be explained may
reflect the role of feedback in human scene categorization [22,23].
In order to gain additional insight into the relative successes and failures of the CNN as a
model for visual scene categorization, we examined the regression residuals between 50 and
Fig 3. Variance explained over time for all eight CNN layers together for the average of all 256 electrodes. Black line indicates mean with shaded
gray region reflecting 95% confidence interval. Top line indicates statistical significance.
Shared category representations in brain and DNN
PLOS Computational Biology | July 24, 2018 5 / 17
250 ms after stimulus onset. We averaged the residuals over participant, resulting in a 30x30
matrix. We then averaged over superordinate category (indoor, urban outdoor, and natural
outdoor) in order to visualize broad patterns of CNN/neuroelectric difference. As seen in Fig
4, we observed negative residuals among and between the two outdoor superordinate catego-
ries. This pattern indicates that the CNN predicted more similarity between these categories
than what was observed neurally. In other words, that the human category representations of
outdoor categories are more distinct compared to the predictions made by the CNN. By con-
trast, the residuals between indoor and urban categories were positive, suggesting that the
CNN had a finer-grained representation of these category differences compared to the human
brain. The residuals for the replication set demonstrated the same broad pattern (see Support-
ing Material).
In order to examine the correspondence between the processing stages of the CNN and the
representational stages in human visual processing, we next examined the variance explained
by each individual CNN layer. Averaged across all 256 electrodes, we observed a correspon-
dence between the onset of explained variability and CNN layer, with earlier layers explaining
ERP variability earlier than later layers (r = 0.38, p<0.001 for the main set and r = 0.40,
p<0.001 for the replication set, see Fig 5). Additionally, we observed a negative relationship
between CNN layer and explained variability, with earlier layers explaining more ERP variabil-
ity than later layers. This effect was pronounced in the first 100 ms post-stimulus (r = -0.51,
p<0.0001 for the main set and r = -0.49, p<0.0001 for the replication set), and was also
observed between 101 and 200 ms (r = -0.26, p<0.005 for the main set and r = -0.23, p<0.05
Fig 4. Residuals as a function of superordinate-level scene category for all eight layersof the CNN between 50 and 250
ms post-stimulus onset. Negative values indicate over-prediction of neural similarity, while positive values indicate under-
prediction of neural similarity.
Shared category representations in brain and DNN
PLOS Computational Biology | July 24, 2018 6 / 17
for the replication set). A one-way ANOVA found a significant difference in the onset of sig-
nificant explained variability as a function of CNN layer (F(7,96) = 5.46, p<0.0001) for the
main dataset and F(7,112) = .4.81, p<0.001 for the replication set) Additionally, a one-way
ANOVA revealed significant differences in the maximum explained variability across CNN
layers: F(7,96) = 15.7, p<0.0001 for the main dataset and F(7,112) = 10.1, p<0.0001 for the
replication set).
A chief advantage of conducting the encoding analyses at each electrode, rather than col-
lapsing across all sensors as has been done in previous work [12,18], is that we are able to
examine the spatiotemporal patterns of explained variability rather than just temporal. Using
data-driven electrode clustering (see Methods), we identified five groups of spatially contigu-
ous electrodes with differing voltage patterns. The clustering took microvolt patterns in the
topographic maps at each time point as input and tested for spatially continuous groups of
electrodes that had similar microvolt patterns. We tested these against a chance level defined
by a baseline topographic map. In each group, we examined the maximum explained
Fig 5. Main plot: Variance explained in each of the eight layers of the CNN taken alone. Each waveform represents the average of all 256 electrodes. Lowest
layers are in darker colors, and horizontal lines represent statistical significance. Insets, clockwise from top: (1) Onset of statistically significant encoding as a
function of CNN layer; (2) maximum variance explained between 101 and 200 ms as a function of CNN layer; (3) maximum variance explained between 0 and
100 ms as a function of CNN layer; (4) maximum variance explained between 101 and 200 ms as a function of CNN layer. Taken together, we can see that early
CNN layers explain more variance in ERP patterns, and do so earlier than later CNN layers. All error bars reflect +/- 1 SEM.
Shared category representations in brain and DNN
PLOS Computational Biology | July 24, 2018 7 / 17
variability for each layer of the CNN. As shown in Fig 6, we observed a striking dissocia-
tion. Central occipital electrodes were better explained by the earlier layers of the CNN
at all time points (Spearman’s rank-order correlation between layer and maximum
explained variance: (rho = -0.64, t(12) = -13.5, p<0.0001 for main set; rho = -0.68, t(14) =
-13.3, p<0.0001 for replication set). A similar trend was seen in left occipitotemporal elec-
trodes for time points before 200 ms in the main dataset, with a trend in the replication
Fig 6. Encoding time course for each of the five identified clusters. For each plot, main graph shows variability explained over time. Bar charts show
maximum explained variability for each of eight CNN layers in the post-stimulus time window indicated above.–A- Frontal cluster.–B- Central cluster.–C-
Left occipitotemporal cluster.–D- Central occipital cluster.–E- Right occipitotemporal cluster.
Shared category representations in brain and DNN
PLOS Computational Biology | July 24, 2018 8 / 17
dataset (80–110 ms: rho = -0.53, t(12) = -6.31, p<0.0001 for main set; rho = -0.78, t(14) =
-10.3, p<0.0001 for replication set; 120–200 ms: rho = -0.54, t(12) = -6.4, p<0.0001 for
main dataset, rho = -0.38, t(14) = 1.2, p = 0.13). By contrast, the right occipitotemporal
cluster was best explained by early layers in early time bins (80–110 ms: rho = -0.57, t(12) =
-8.37, p<0.0001 for the main dataset and rho = -0.88, t(14) = 10.5, p<0.0001 for the replica-
tion dataset), then by mid-to-late-level layers between 120–200 ms (peak in layer 4 for main
set, layer 5 for replication set; rho = 0.44, t(12) = 1.29, p = 0.11 for main; rho = -0.08, t(14)<1
for replication) and between 200–250 ms post stimulus onset (maximum in layer 6 for main
set, maximum in layer 3 for replication set; rho = 0.17, t(12)<1). As evident in Fig 6, the later
time bins did not have a significant rank-order correlation because of non-monotonic pat-
terns between CNN layer and explained variability. Post-hoc t-tests (Benjamini-Hochberg
corrected for multiple comparisons) revealed that for 120–200 ms, the numerical peak in
layer 4 was statistically significant from both earlier layer 1 (t(12) = 4.78, p<0.001) and later
layer 7 (t(12) = 3.66, p<0.001) and layer 8 (t(12) = 4.08, p<0.001). For 200–300 ms, the
numerical peak in layer 6 was significantly different from layer 1 (t(12) = 5.17, p<0.0001) and
layer 8 (t(12) = 3.3, p<0.005). Statistics for the replication set can be found in the Supporting
Similarly, while the frontal cluster best reflected lower layers of the CNN early 120–200 ms
time bin (rho = -0.62, t(12) = -4.98, p<0.0005; see Supporting Materials for replication set), it
best reflected information from the later layers, particularly the sixth CNN layer (FC6) in the
200–300 ms time bin (rho: 0.48, t(12)<1.) Post-hoc t-tests (corrected) revealed significantly
more variance explained by layer 3 compared with layer 1 (t(12) = 2.29, p<0.05), and more in
layer 6 compared with layer 7 (t(12) = 1.92, p<0.05) or layer 8 (t(12) = 1.87, p<0.05). Similarly,
the replication dataset also had a local maximum of explained variance at the sixth CNN layer
(see Supporting Materials for more information). Interestingly, the sixth layer had the maxi-
mum decodable category information (see Fig 2), suggesting that these signals reflect category-
specific information. When comparing the latencies of maximum explained variability in
these two clusters, we observed that the right occipitotemporal cluster peaked about 10 ms ear-
lier than the frontal cluster (227 versus 236 ms). However, this difference was not statistically
reliable (t(12)<1).
Ethics statement
Human subjects research was approved by the Colgate University Internal Review Board
Fourteen observers (6 female, 13 right-handed) participated in the study. One of the partici-
pant’s data contained fewer than half valid trials following artifact rejection and was removed
from all analyses. An additional 15 participants (9 female, 12 right-handed) were recruited as
part of an internal replication study (specific results from those participants are reported as
Supporting Material). The age of all participants ranged from 18 to 22 (mean age = 19.1 for
main experiment, and 19.4 for replication study). All participants had normal (or corrected to
normal) vision as determined by standard ETCRS acuity charts. All participants gave Institu-
tional Review Board-approved written informed consent before participating, and were com-
pensated for their time.
Shared category representations in brain and DNN
PLOS Computational Biology | July 24, 2018 9 / 17
The stimuli consisted of 2250 color images of real-world photographs from 30 different scene
categories (75 exemplars in each category) taken from the SUN database [24]. Categories were
chosen to include 10 indoor categories, 10 urban categories, and 10 natural landscape catego-
ries, and were selected from the larger set of 908 categories from the SUN database on the basis
of making maximally different RDMs when examining three different types of visual informa-
tion (layer 7 features from a CNN, a bag-of-words object model, and a model of a scene’s func-
tions, see [25] for more details on category selection). When possible, images were taken from
the SUN database. In cases where this database did not have 75 images, we sampled from the
internet (copyright-free images). Six of the 30 categories (bamboo forest, bar, butte, skyscraper,
stadium, and volcano) were represented in the Places-205 database that comprised the training
set for the CNN. Although the SUN and Places databases were designed to be complementary
with few overlapping images [20], it is possible that some images in our set were included in
the training set for the Places CNN. Care was taken to omit images with salient faces in them.
All images had a resolution of 512 x 512 pixels (which subtended 20.8˚ of visual angle) and
were processed to possess the same root-mean-square (RMS) contrast (luminance and color)
as well as mean luminance. All images were fit with a circular linear edge-ramped window to
obscure the square frame of the images, thereby uniformly distributing contrast changes
around the circular edge of the stimulus [26,27].
All stimuli were presented on a 23.6” VIEWPixx/EEG scanning LED-backlight LCD monitor
with 1ms black-to-white pixel response time. Maximum luminance output of the display was
100 cd/m
, with a frame rate of 120 Hz and resolution of 1920 x 1080 pixels. Single pixels sub-
tended 0.0406˚ of visual angle (i.e. 2.43 arc min.) as viewed from 32 cm. Head position was
maintained with an Applied Science Laboratories (ASL) chin rest.
Experimental procedure
Participants engaged in a 3 alternative forced-choice (3AFC) categorization task with each of
the 2250 images. As it was not feasible for participants to view all images in one sitting, all
images were randomly split into two sets, keeping roughly equal numbers of images within
each category. Each image set was presented within a different ~50-minute recording session,
run on separate days. The image set was counterbalanced across participants, and image order
within each set was randomized. Participants viewed the windowed scenes against a mean
luminance background under darkened conditions (i.e. the only light source in the testing
chamber was the monitor). All trials began with a 500 ms fixation followed by a variable dura-
tion (500–750 ms) blank mean luminance screen to enable any fixation-driven activity to dissi-
pate. Next, a scene image was presented for 750 ms followed by a variable 100–250 ms blank
mean luminance screen, followed by a response screen consisting of the image’s category
name and the names of two distractor categories presented laterally in random order (distrac-
tor category labels were randomly sampled from the set of 29 and therefore varied on a trial-
by-trial basis). Observers selected their choice by using a mouse to click on the correct category
name. Performance feedback was not given.
EEG recording and processing
Continuous EEGs were recorded in a Faraday chamber using EGI’s Geodesic EEG acquisition
system (GES 400) with Geodesic Hydrocel sensor nets consisting of a dense array of 256
Shared category representations in brain and DNN
PLOS Computational Biology | July 24, 2018 10 / 17
channels (electrolytic sponges). The on-line reference was at the vertex (Cz), and the imped-
ances were maintained below 50 kO(EGI amplifiers are high-impedance amplifiers–this value
is optimized for this system). All EEG signals were amplified and sampled at 1000 Hz. The dig-
itized EEG waveforms were band-pass filtered offline from 0.1 Hz to 45 Hz to remove the DC
offset and eliminate 60 Hz line noise.
All continuous EEGs were divided into 850 ms epochs (100 ms before stimulus onset and
the 750 ms of stimulus-driven data). Trials that contained eye movements or eye blinks during
data epochs were excluded from analysis. Further, all epochs were subjected to algorithmic
artifact rejection of voltage exceeding ±100 μV. These trial rejection routines resulted in no
more than 10% of the trials being rejected on a participant-by-participant basis. Each epoch
was then re-referenced offline to the net average, and baseline corrected to the last 100 ms of
the luminance blank interval that preceded the image. Grand average event-related potentials
(ERPs) were assembled by averaging all re-referenced and baseline corrected epochs across
participants. Topographic plots were generated for all experimental conditions using EEGLAB
[28] version 13.4.4b in MATLAB (ver. R2016a, The MathWorks, MA).
For all analyses, we improved the signal-to-noise ratio of the single trial data by using a
bootstrapping approach to build sub-averages across trials for each trial (e.g., [19]). Specifi-
cally, for each trial within a given scene category, we randomly selected 20% of the trials within
that category and averaged those to yield a sub-averaged ERP for that trial. This was repeated
until all valid trials within each category were built. This process was repeated separately for
each participant. This approach is desirable as we are primarily interested in category-level
neuroelectric signals that are time-locked to the stimulus.
Deep convolutional neural network (CNN)
In order to assess the representations available in a deep convolutional neural network at each
processing stage, we extracted the activations in each of eight layers in a pre-trained network.
Specifically, we used a CNN based on the AlexNet architecture [29] that was pre-trained on
the Places database [20] and implemented in Caffe [30]. The first five layers of this neural net-
work are convolutional, and the last three are fully connected. The convolutional layers have
three operations: convolution, pooling, and a rectified linear (ReLu) nonlinearity. For these
layers, we extracted features after the convolution step. This CNN was chosen because it is
optimized for 205-category scene classification, and because the eight-layer architecture,
loosely inspired by biological principles, is most frequently used when comparing CNNs and
brain activity [12,15,16]. For each layer, we averaged across images within a category, creating
30-category by N-feature matrices.
In order to assess the amount of category-related information available in each layer of the
CNN, we performed a decoding procedure on the feature vectors from each CNN layer using
a linear multi-class support vector machine (SVM) implemented as LIBSVM in Matlab [31].
Decoding accuracies for each layer were calculated using 5-fold cross validation.
For all analyses, statistical testing was done via permutation testing. Specifically, we fully
exchanged row and column labels for the RDMs (1000 permutation samples per participant)
to create an empirical chance distribution. To correct for multiple comparisons, we used clus-
ter extent with a threshold of p<0.05, Bonferroni-corrected for multiple comparisons (similar
to [12]).
Time-resolved encoding analysis
A forward pass of activations through the CNN was collected for each image and each layer,
and reshaped into a feature vector. Feature vectors were averaged across category to create
Shared category representations in brain and DNN
PLOS Computational Biology | July 24, 2018 11 / 17
eight 30-category by N-feature matrices. From these feature matrices, we created 30-category
by 30-category correlation matrices. Representational dissimilarity matrices (RDMs) were cre-
ated by transforming the correlation matrices into distance matrices using the metric of
1-Spearman rho [12,14]. Each RDM is a symmetrical matrix with an undefined diagonal.
Therefore, in all subsequent analyses, we will use only the lower triangle of each RDM to repre-
sent patterns of category similarity.
To create neural RDMs, for each participant and for each electrode, we extracted ERP sig-
nals within a 40 ms sliding window beginning 100 ms before image presentation, and extend-
ing to the entire 750 ms image duration. For each 40 ms window, we created a 30 x 30
correlation matrix from the voltage values at that electrode, averaged over category. A 30 x 30
RDM using the same 1-minus-correlation distance metric described above. The window size
was truncated at the end of each trial as to not extend beyond image presentation. Thus,
RDMs were created from each of 256 electrodes for each of 850 ms time points. The upper and
lower bounds of the noise ceiling for the data were computed as recommended in [21]. It is
worth noting that while previous MEG RDM results have performed time-resolved analyses
on a time point by time point manner taking the value at each sensor at each time point as a
feature (e.g. [12]), our approach allows for the understanding of feature correspondence at
each electrode, also enabling spatiotemporal analysis.
We computed a noise ceiling for the results following the method detailed in [14,21].
Briefly, the noise ceiling contains both lower- and upper-bound estimates of the group-average
correlation with an RDM that would be predicted by an unknown true model. The upper
bound consisted of the average correlation of each single participant to the group mean. As
this includes all participants, it is overfit and therefore an overestimation of the true model’s
fit. By contrast, the lower bound used a leave-one-participant out approach, computing each
single-participant RDM’s correlation with the average RDM of the other participants. This
avoids overfitting, but underestimates the true model’s average correlation due to the limited
Electrode clustering
In order to examine spatial relations in encoding patterns across electrodes, we adopted a
data-driven approach based on random field theory. We modified an algorithm initially cre-
ated by Chauvin and colleagues used to assess statistical significance of pixel clusters in classifi-
cation images [32]. This allowed us to identify spatially contiguous electrode clusters based on
voltage differences, while remaining agnostic to any encoding differences with the CNN. Spe-
cifically, we submitted participant-averaged and z-transformed voltage difference topographic
maps to the algorithm time point by time point. The spatial clustering algorithm then grouped
electrodes that contained normalized voltage differences that significantly deviated from base-
line noise (p <.001). As this spatial grouping procedure is based on assessing the statistically
significant peaks and troughs of the voltage patterns at a given point of time, its advantage is
that we did not need to set the number of clusters in advance. We selected clusters that per-
sisted for more than 20 ms, resulting in five clusters: an early central occipital cluster (100–125
ms), an early central cluster (80–105 ms), two bilateral occipitotemporal clusters (70–500 ms),
and a large frontal cluster (135–500 ms). We assessed the relative encoding strength of each of
the eight CNN layers within each of these five clusters in all analyses.
In this work, we demonstrated that there is a substantial resemblance between the sequential
scene category representations from each layer of a pre-trained deep convolutional neural
Shared category representations in brain and DNN
PLOS Computational Biology | July 24, 2018 12 / 17
network (CNN), and those in human ERP activity while observers are engaged in categoriza-
tion. Early layers of the CNN best predicted early ERP activity, while later layers best predicted
later activity. Furthermore, the total variability captured by the CNN was slightly less than half
of the variability given by the noise ceiling of the data.
Furthermore, we observed a spatial correspondence between CNN layers and ERP variabil-
ity. While electrodes over central- and left- occipitotemporal cortex were robustly predicted
early by early CNN layers, electrodes over right occipitotemporal cortex had sequential repre-
sentations that resembled the sequential representations of the CNN. Specifically, activity in
these electrodes was best predicted by the second and third CNN layers before 100 ms post-
stimulus, by the fourth layer between 120–200 ms, and by the sixth layer after 200 ms. A simi-
lar striking dissociation was observed over the frontal electrode cluster: early CNN layers best
predicted early activity, but the more conceptual, fully-connected layers captured activity at
the frontal electrodes about 100 ms later. Taken together, these results suggest a deeper homol-
ogy between the human visual system and the deep neural network than the few biological
principles that inspired the architecture [29]. This homology provides the unique intellectual
opportunity to probe the mid- and late- stages of visual representation that have thus far been
difficult to ascertain.
Because the CNN is an exclusively feedforward model, comparing the representations in
the CNN and the human brain allows us to speculate on the extent to which human category
representations are processed in a feedforward manner (such as [33]). We observed that earlier
layers of the CNN explained more ERP variance compared with later layers of the CNN. This
may be evidence that earlier visual processing is predominantly feedforward, while later visual
processing requires feedback, consistent with other relatively early accounts of top-down feed-
back [22,3436]. However, we also observed that the neural variability explained by the CNN
only reached about half of the maximum possible explained variability given by the noise ceil-
ing. This may suggest that feedback or recurrent processing plays a significant role in categori-
zation [22,34,37]. Of course, it is possible that a different feedforward model, or a different
weighting of features within this model may explain more variability in human ERP patterns.
However, recent literature suggests that performance differences between different deep CNN
architectures is smaller than the difference between CNNs and human observers, or between
CNNs and previous models [14,15,25,38,39]. Future work will examine both recurrent and
feedforward architectures to disentangle these possibilities.
In examining the residuals of the model fits, we found that the CNN over-estimated the
similarity of natural landscape images while simultaneously under-estimating the similarity
between different types of manufactured environments (indoor versus urban). This suggests a
coarser-grained representation for natural landscape images in the CNN compared to human
observers. This pattern may reflect the fact that only 29% of images and 28% of scene catego-
ries in Places-205 are natural landscape environments [20]. Having more training data for the
CNN may have resulted in the ability to form finer-grained representations for manufactured
Our results are largely in agreement with previous MEG studies that examined the corre-
spondence between CNNs and time-resolved neural signals [12,18]. These studies examined
whole-brain signals in a time point by time point manner, losing any spatial response pattern.
By contrast, by using a sliding window on each electrode, we were able to retain the patterns
across the scalp while still examining time-resolved data. Given the known representational
differences between MEG and EEG [40], this work provides unique but complementary infor-
mation about visual category representations. Specifically, [40] found that compared to EEG,
MEG signals had decodable information earlier, and more driven by early visual cortex. This
Shared category representations in brain and DNN
PLOS Computational Biology | July 24, 2018 13 / 17
suggests that the time points reported here might constitute an upper bound on information
A second difference between this work and previous is that we examined category-specific
information instead of image-level information. As the act of recognition is generally an act of
categorization [41], and because the CNN we used was pre-trained specifically to make cate-
gory-level distinctions [42,43], we argue that this is the most natural comparison for human
and artificial neural networks. Accordingly, we assessed the amount of category-level informa-
tion available in each layer of the CNN. While it is unsurprising that significant category infor-
mation exists in all eight layers, or that the amount of information increases across layers, we
were surprised to observe that category information peaked in the sixth layer. Given that the
utility of layer depth is still controversial within the computer vision community [44,45], this
result may be of interest to this community as well. Furthermore, that both right occipitotem-
poral and frontal electrodes also had explained variability that peaked in the sixth CNN layer
corroborates the view that a shallower artificial neural network might outperform deeper ones
on scene classification tasks.
Although CNNs differ significantly from biological networks, they are of interest to neuro-
scientists because they allow us to see a solution to a difficult information-processing problem
in a step-by-step manner. The extent to which hard problems such as visual recognition have
unique solutions is an open question. Thus, the growing number of similarities between bio-
logical and neural networks may indicate that artificial neural networks have honed in on the
same solution found by evolution and biology.
Supporting information
S1 Fig. Overall variance explained by all eight layers of the CNN for the supplementary
S2 Fig. Overall variance explained by all eight layers of the CNN as a function of electrode
S3 Fig. Variance explained by each layer of the CNN alone. As in Fig 5 in the main text,
CNN layers are ordered darkest (Conv 1) to lightest (FC8).
S4 Fig. Onset of significant explained variance as a function of CNN layer.
S5 Fig. Latency of maximum explained variability as a function of CNN layer for replica-
tion dataset.
S6 Fig. Residuals across superordinate category for replication set.
S7 Fig. Explained variability in frontal cluster between 200–300 ms after stimulus onset.
Each trace reflects an individual participant.
S1 Text. Text of supporting information, part 1.
Shared category representations in brain and DNN
PLOS Computational Biology | July 24, 2018 14 / 17
S2 Text. Text of supporting information, part 2.
Author Contributions
Conceptualization: Michelle R. Greene, Bruce C. Hansen.
Data curation: Bruce C. Hansen.
Formal analysis: Michelle R. Greene.
Funding acquisition: Michelle R. Greene, Bruce C. Hansen.
Methodology: Michelle R. Greene.
Visualization: Michelle R. Greene.
Writing – original draft: Michelle R. Greene, Bruce C. Hansen.
Writing – review & editing: Michelle R. Greene, Bruce C. Hansen.
1. Greene MR, Oliva A. The Briefest of Glances: The Time Course of Natural Scene Understanding. Psy-
chol Sci. 2009 Apr; 20:464–72. PMID: 19399976
2. Potter MC, Wyble B, Hagmann CE, McCourt ES. Detecting meaning in RSVP at 13 ms per picture.
Atten Percept Psychophys. 2014;1–10.
3. Greene MR, Fei-Fei L. Visual categorization is automatic and obligatory: Evidence from Stroop-like par-
adigm. J Vis. 2014 Jan 16; 14(1):14–14. PMID: 24434626
4. Edelman S. Representation is representation of similarities. Behav Brain Sci. 1998 Aug; 21(4):449–67;
discussion 467–498. PMID: 10097019
5. Kriegeskorte N, Mur M, Bandettini P. Representational Similarity Analysis–Connecting the Branches of
Systems Neuroscience. Front Syst Neurosci. 2008; 2.
6. DiCarlo JJ, Cox DD. Untangling invariant object recognition. Trends Cogn Sci. 2007 Aug; 11(8):333–41. PMID: 17631409
7. Riesenhuber M, Poggio T. Hierarchical models of object recognition in cortex. Nat Neurosci. 1999 Nov;
2(11):1019–25. PMID: 10526343
8. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, et al. ImageNet Large Scale Visual Recog-
nition Challenge. Int J Comput Vis. 2015 Apr 11;1–42.
9. Fukushima K. Neocognitron: A hierarchical neural network capable of visual pattern recognition. Neural
Netw. 1988; 1(2):119–30.
10. Agrawal P, Stansbury D, Malik J, Gallant JL. Pixels to Voxels: Modeling Visual Representation in the
Human Brain. ArXiv14075104 Cs Q-Bio [Internet]. 2014 Jul 18 [cited 2016 Aug 2]; Available from: http://
11. Cadieu CF, Hong H, Yamins DLK, Pinto N, Ardila D, Solomon EA, et al. Deep Neural Networks Rival
the Representation of Primate IT Cortex for Core Visual Object Recognition. PLOS Comput Biol. 2014
Dec 18; 10(12):e1003963. PMID: 25521294
12. Cichy RM, Khosla A, Pantazis D, Torralba A, Oliva A. Comparison of deep neural networks to spatio-
temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence. Sci
Rep. 2016 Jun 10; 6:27755. PMID: 27282108
13. Gu¨c¸lu¨U, Gerven MAJ van. Deep Neural Networks Reveal a Gradient in the Complexity of Neural Rep-
resentations across the Ventral Stream. J Neurosci. 2015 Jul 8; 35(27):10005–14.
1523/JNEUROSCI.5023-14.2015 PMID: 26157000
14. Khaligh-Razavi S-M, Kriegeskorte N. Deep Supervised, but Not Unsupervised, Models May Explain IT
Cortical Representation. PLOS Comput Biol. 2014 Nov 6; 10(11):e1003915.
journal.pcbi.1003915 PMID: 25375136
15. Kubilius J, Bracci S, Beeck HPO de. Deep Neural Networks as a Computational Model for Human
Shape Sensitivity. PLOS Comput Biol. 2016 Apr 28; 12(4):e1004896.
pcbi.1004896 PMID: 27124699
Shared category representations in brain and DNN
PLOS Computational Biology | July 24, 2018 15 / 17
16. Yamins DL, Hong H, Cadieu C, DiCarlo JJ. Hierarchical Modular Optimization of Convolutional Net-
works Achieves Representations Similar to Macaque IT and Human Ventral Stream. In: Burges CJC,
Bottou L, Welling M, Ghahramani Z, Weinberger KQ, editors. Advances in Neural Information Process-
ing Systems 26 [Internet]. Curran Associates, Inc.; 2013 [cited 2016 Aug 2]. p. 3093–3101. Available
17. Yamins DLK, DiCarlo JJ. Using goal-driven deep learning models to understand sensory cortex. Nat
Neurosci. 2016 Mar; 19(3):356–65. PMID: 26906502
18. Seeliger K, Fritsche M, Gu¨c¸lu¨U, Schoenmakers S, Schoffelen J-M, Bosch SE, et al. Convolutional neu-
ral network-based encoding and decoding of visual object recognition in space and time. NeuroImage.
2017 Jul 16;
19. Cichy RM, Khosla A, Pantazis D, Oliva A. Dynamics of scene representations in the human brain
revealed by magnetoencephalography and deep neural networks. NeuroImage. 2017 Jun; 153:346–58. PMID: 27039703
20. Zhou B, Lapedriza A, Khosla A, Oliva A, Torralba A. Places: A 10 million Image Database for Scene
Recognition. IEEE Trans Pattern Anal Mach Intell. 2017;PP(99):1–1.
21. Nili H, Wingfield C, Walther A, Su L, Marslen-Wilson W, Kriegeskorte N. A Toolbox for Representational
Similarity Analysis. PLOS Comput Biol. 2014 Apr 17; 10(4):e1003553.
pcbi.1003553 PMID: 24743308
22. Bar M, Kassam KS, Ghuman AS, Boshyan J, Schmid AM, Dale AM, et al. Top-down facilitation of visual
recognition. Proc Natl Acad Sci U S A. 2006 Jan 10; 103(2):449–54.
0507062103 PMID: 16407167
23. Clarke A, Devereux BJ, Randall B, Tyler LK. Predicting the Time Course of Individual Objects with
MEG. Cereb Cortex. 2015 Oct 1; 25(10):3602–12. PMID:
24. Xiao J, Ehinger KA, Hays J, Torralba A, Oliva A. SUN Database: Exploring a Large Collection of Scene
Categories. Int J Comput Vis. 2014 Aug 13;1–20.
25. Groen II, Greene MR, Baldassano C, Fei-Fei L, Beck DM, Baker CI. Distinct contributions of functional
and deep neural network features to representational similarity of scenes in human brain and behavior.
eLife. 2018 Mar 7;7.
26. Hansen BC, Essock EA. A horizontal bias in human visual processing of orientation and its correspon-
dence to the structural components of natural scenes. J Vis. 2004 Dec 1; 4(12):5–5.
27. Hansen BC, Hess RF. Discrimination of amplitude spectrum slope in the fovea and parafovea and the
local amplitude distributions of natural scene imagery. J Vis. 2006 Jun 2; 6(7):3–3.
28. Delorme A, Makeig S. EEGLAB: an open source toolbox for analysis of single-trial EEG dynamics
including independent component analysis. J Neurosci Methods. 2004 Mar 15; 134(1):9–21. https://doi.
org/10.1016/j.jneumeth.2003.10.009 PMID: 15102499
29. Krizhevsky A, Sutskever I, Hinton GE. ImageNet Classification with Deep Convolutional Neural Net-
works. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ, editors. Advances in Neural Information
Processing Systems 25 [Internet]. Curran Associates, Inc.; 2012 [cited 2014 Jul 1]. p. 1097–1105.
Available from:
30. Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, et al. Caffe: Convolutional Architecture
for Fast Feature Embedding. In: Proceedings of the 22Nd ACM International Conference on Multimedia
[Internet]. New York, NY, USA: ACM; 2014 [cited 2017 Jan 10]. p. 675–678. (MM ‘14). Available from:
31. Chang C-C, Lin C-J. LIBSVM: A Library for Support Vector Machines. ACM Trans Intell Syst Technol.
2011 May; 2(3):27:1–27:27.
32. Chauvin A, Worsley KJ, Schyns PG, Arguin M, Gosselin F. Accurate statistical tests for smooth classifi-
cation images. J Vis. 2005 Oct 5; 5(9):659–67. PMID: 16356076
33. Serre T, Oliva A, Poggio T. A feedforward architecture accounts for rapid categorization. Proc Natl
Acad Sci. 2007; 104(15):6424–9. PMID: 17404214
34. Goddard E, Carlson TA, Dermody N, Woolgar A. Representational dynamics of object recognition:
Feedforward and feedback information flows. NeuroImage. 2016 Mar; 128:385–97.
1016/j.neuroimage.2016.01.006 PMID: 26806290
35. Ahissar M, Hochstein S. The reverse hierarchy theory of visual perceptual learning. Trends Cogn Sci.
2004; 8(10):457–64. PMID: 15450510
36. Peyrin C, Michel CM, Schwartz S, Thut G, Seghier M, Landis T, et al. The Neural Substrates and Timing
of Top–Down Processes during Coarse-to-Fine Categorization of Visual Scenes: A Combined fMRI
Shared category representations in brain and DNN
PLOS Computational Biology | July 24, 2018 16 / 17
and ERP Study. J Cogn Neurosci. 2010 Jan 4; 22(12):2768–80.
21424 PMID: 20044901
37. Hochstein S, Ahissar M. View from the top: Hierarchies and reverse hierarchies in the visual system.
Neuron. 2002; 36:791–804. PMID: 12467584
38. Geirhos R, Janssen DHJ, Schu¨tt HH, Rauber J, Bethge M, Wichmann FA. Comparing deep neural net-
works against humans: object recognition when the signal gets weaker. ArXiv170606969 Cs Q-Bio Stat
[Internet]. 2017 Jun 21 [cited 2017 Dec 31]; Available from:
39. Jozwik KM, Kriegeskorte N, Storrs KR, Mur M. Deep Convolutional Neural Networks Outperform Fea-
ture-Based But Not Categorical Models in Explaining Object Similarity Judgments. Front Psychol [Inter-
net]. 2017 [cited 2018 Mar 31];8. Available from:
40. Cichy RM, Pantazis D. Multivariate pattern analysis of MEG and EEG: A comparison of representational
structure in time and space. NeuroImage. 2017 Sep 1; 158(Supplement C):441–54.
41. Bruner JS. On perceptual readiness. Psychol Rev. 1957; 64(2):123–52. PMID: 13420288
42. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015 May 28; 521(7553):436–44.
10.1038/nature14539 PMID: 26017442
43. Zhou B, Lapedriza A, Xiao J, Torralba A, Oliva A. Learning Deep Features for Scene Recognition using
Places Database. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ, editors.
Advances in Neural Information Processing Systems 27 [Internet]. Curran Associates, Inc.; 2014 [cited
2015 Oct 13]. p. 487–495. Available from:
44. Ba J, Caruana R. Do Deep Nets Really Need to be Deep? In: Ghahramani Z, Welling M, Cortes C, Law-
rence ND, Weinberger KQ, editors. Advances in Neural Information Processing Systems 27 [Internet].
Curran Associates, Inc.; 2014. p. 2654–2662. Available from:
45. He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition [Internet]. 2016 [cited 2017 Nov 15].
p. 770–8. Available from:
Shared category representations in brain and DNN
PLOS Computational Biology | July 24, 2018 17 / 17

Supplementary resources

  • Preprint
    Conscious perception is crucial for adaptive behaviour yet access to consciousness varies for different types of objects. The visual system comprises regions with widely distributed category information and exemplar-level representations that cluster according to category. Does this categorical organisation in the brain provide insight into object-specific access to consciousness? We address this question using the Attentional Blink (AB) approach with visual objects as targets. We find large differences across categories in the AB then employ activation patterns extracted from a deep convolutional neural network (DCNN) to reveal that these differences depend on mid- to high-level, rather than low-level, visual features. We further show that these visual features can be used to explain variance in performance across trials. Taken together, our results suggest that the specific organisation of the higher-tier visual system underlies important functions relevant for conscious perception of differing natural images.
  • Preprint
    Full-text available
    Our understanding of information processing by the mammalian visual system has come through a variety of techniques ranging from psychophysics and fMRI to single unit recording and EEG. Each technique provides unique insights into the processing framework of the early visual system. Here, we focus on the nature of the information that is carried by steady state visual evoked potentials (SSVEPs). To study the information provided by SSVEPs, we presented human participants with a population of natural scenes and measured the relative SSVEP response. Rather than focus on particular features of this signal, we focused on the full state-space of possible responses and investigated how the evoked responses are mapped onto this space. Our results show that it is possible to map the relatively high-dimensional signal carried by SSVEPs onto a 2-dimensional space with little loss. We also show that a simple biologically plausible model can account for a high proportion of the explainable variance (~73%) in that space. Finally, we describe a technique for measuring the mutual information that is available about images from SSVEPs. The techniques introduced here represent a new approach to understanding the nature of the information carried by SSVEPs. Crucially, this approach is general and can provide a means of comparing results across different neural recording methods. Altogether, our study sheds light on the encoding principles of early vision and provides a much needed reference point for understanding subsequent transformations of the early visual response space to deeper knowledge structures that link different visual environments.
  • Preprint
    Visual scene understanding often requires the processing of human-object interactions. Here we seek to explore if and how well Deep Neural Network (DNN) models capture features similar to the brain's representation of humans, objects, and their interactions. We investigate brain regions which process human-, object-, or interaction-specific information, and establish correspondences between them and DNN features. Our results suggest that we can infer the selectivity of these regions to particular visual stimuli using DNN representations. We also map features from the DNN to the regions, thus linking the DNN representations to those found in specific parts of the visual cortex. In particular, our results suggest that a typical DNN representation contains encoding of compositional information for human-object interactions which goes beyond a linear combination of the encodings for the two components, thus suggesting that DNNs may be able to model this important property of biological vision.
  • Article
    Full-text available
    Eye tracking has long been used to measure overt spatial attention, and computational models of spatial attention reliably predict eye movements to natural images. However, researchers lack techniques to noninvasively access spatial representations in the human brain that guide eye movements. Here, we use functional magnetic resonance imaging (fMRI) to predict eye movement patterns from reconstructed spatial representations evoked by natural scenes. First, we reconstruct fixation maps to directly predict eye movement patterns from fMRI activity. Next, we use a model-based decoding pipeline that aligns fMRI activity to deep convolutional neural network activity to reconstruct spatial priority maps and predict eye movements in a zero-shot fashion. We predict human eye movement patterns from fMRI responses to natural scenes, provide evidence that visual representations of scenes and objects map onto neural representations that predict eye movements, and find a novel three-way link between brain activity, deep neural network models, and behavior.
  • Preprint
    Full-text available
    We describe a framework of hybrid cognition by formulating a hybrid cognitive agent that performs hierarchical active inference across a human and a machine part. We suggest that, in addition to enhancing human cognitive functions with an intelligent and adaptive interface, integrated cognitive processing could accelerate emergent properties within artificial intelligence. To establish this, a machine learning part learns to integrate into human cognition by explaining away multi-modal sensory measurements from the environment and physiology simultaneously with the brain signal. With ongoing training, the amount of predictable brain signal increases. This lends the agent the ability to self-supervise on increasingly high levels of cognitive processing in order to further minimize surprise in predicting the brain signal. Furthermore, with increasing level of integration, the access to sensory information about environment and physiology is substituted with access to their representation in the brain. While integrating into a joint embodiment of human and machine, human action and perception are treated as the machine's own. The framework can be implemented with invasive as well as non-invasive sensors for environment, body and brain interfacing. Online and offline training with different machine learning approaches are thinkable. Building on previous research on shared representation learning, we suggest a first implementation leading towards hybrid active inference with non-invasive brain interfacing and state of the art probabilistic deep learning methods. We further discuss how implementation might have effect on the meta-cognitive abilities of the described agent and suggest that with adequate implementation the machine part can continue to execute and build upon the learned cognitive processes autonomously.
  • Article
    Full-text available
    Does the human mind resemble the machine-learning systems that mirror its performance? Convolutional neural networks (CNNs) have achieved human-level benchmarks in classifying novel images. These advances support technologies such as autonomous vehicles and machine diagnosis; but beyond this, they serve as candidate models for human vision itself. However, unlike humans, CNNs are “fooled” by adversarial examples—nonsense patterns that machines recognize as familiar objects, or seemingly irrelevant image perturbations that nevertheless alter the machine’s classification. Such bizarre behaviors challenge the promise of these new advances; but do human and machine judgments fundamentally diverge? Here, we show that human and machine classification of adversarial images are robustly related: In 8 experiments on 5 prominent and diverse adversarial imagesets, human subjects correctly anticipated the machine’s preferred label over relevant foils—even for images described as “totally unrecognizable to human eyes”. Human intuition may be a surprisingly reliable guide to machine (mis)classification—with consequences for minds and machines alike.
  • Article
    Full-text available
    Inherent correlations between visual and semantic features in real-world scenes make it difficult to determine how different scene properties contribute to neural representations. Here, we assessed the contributions of multiple properties to scene representation by partitioning the variance explained in human behavioral and brain measurements by three feature models whose inter-correlations were minimized a priori through stimulus preselection. Behavioral assessments of scene similarity reflected unique contributions from a functional feature model indicating potential actions in scenes as well as high-level visual features from a deep neural network (DNN). In contrast, similarity of cortical responses in scene-selective areas was uniquely explained by mid- and high-level DNN features only, while an object label model did not contribute uniquely to either domain. The striking dissociation between functional and DNN features in their contribution to behavioral and brain representations of scenes indicates that scene-selective cortex represents only a subset of behaviorally relevant scene information.
  • Article
    Full-text available
    Recent advances in Deep convolutional Neural Networks (DNNs) have enabled unprecedentedly accurate computational models of brain representations, and present an exciting opportunity to model diverse cognitive functions. State-of-the-art DNNs achieve human-level performance on object categorisation, but it is unclear how well they capture human behavior on complex cognitive tasks. Recent reports suggest that DNNs can explain significant variance in one such task, judging object similarity. Here, we extend these findings by replicating them for a rich set of object images, comparing performance across layers within two DNNs of different depths, and examining how the DNNs’ performance compares to that of non-computational “conceptual” models. Human observers performed similarity judgments for a set of 92 images of real-world objects. Representations of the same images were obtained in each of the layers of two DNNs of different depths (8-layer AlexNet and 16-layer VGG-16). To create conceptual models, other human observers generated visual-feature labels (e.g., “eye”) and category labels (e.g., “animal”) for the same image set. Feature labels were divided into parts, colors, textures and contours, while category labels were divided into subordinate, basic, and superordinate categories. We fitted models derived from the features, categories, and from each layer of each DNN to the similarity judgments, using representational similarity analysis to evaluate model performance. In both DNNs, similarity within the last layer explains most of the explainable variance in human similarity judgments. The last layer outperforms almost all feature-based models. Late and mid-level layers outperform some but not all feature-based models. Importantly, categorical models predict similarity judgments significantly better than any DNN layer. Our results provide further evidence for commonalities between DNNs and brain representations. Models derived from visual features other than object parts perform relatively poorly, perhaps because DNNs more comprehensively capture the colors, textures and contours which matter to human object perception. However, categorical models outperform DNNs, suggesting that further work may be needed to bring high-level semantic representations in DNNs closer to those extracted by humans. Modern DNNs explain similarity judgments remarkably well considering they were not trained on this task, and are promising models for many aspects of human cognition.
  • Conference Paper
    We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
  • Article
    Full-text available
    Representations learned by deep convolutional neural networks (CNNs) for object recognition are a widely investigated model of the processing hierarchy in the human visual system. Using functional magnetic resonance imaging, CNN representations of visual stimuli have previously been shown to correspond to processing stages in the ventral and dorsal streams of the visual system. Whether this correspondence between models and brain signals also holds for activity acquired at high temporal resolution has been explored less exhaustively. Here, we addressed this question by combining CNN-based encoding models with magnetoencephalography (MEG). Human participants passively viewed 1,000 images of objects while MEG signals were acquired. We modelled their high temporal resolution source-reconstructed cortical activity with CNNs, and observed a feed-forward sweep across the visual hierarchy between 75 and 200 ms after stimulus onset. This spatiotemporal cascade was captured by the network layer representations, where the increasingly abstract stimulus representation in the hierarchical network model was reflected in different parts of the visual cortex, following the visual ventral stream. We further validated the accuracy of our encoding model by decoding stimulus identity in a left-out validation set of viewed objects, achieving state-of-the-art decoding accuracy.
  • Article
    Multivariate pattern analysis of magnetoencephalography (MEG) and electroencephalography (EEG) data can reveal the rapid neural dynamics underlying cognition. However, MEG and EEG have systematic differences in sampling neural activity. This poses the question to which degree such measurement differences consistently bias the results of multivariate analysis applied to MEG and EEG activation patterns. To investigate, we conducted a concurrent MEG/EEG study while participants viewed images of everyday objects. We applied multivariate classification analyses to MEG and EEG data, and compared the resulting time courses to each other, and to fMRI data for an independent evaluation in space. We found that both MEG and EEG revealed the millisecond spatio-temporal dynamics of visual processing with largely equivalent results. Beyond yielding convergent results, we found that MEG and EEG also captured partly unique aspects of visual representations. Those unique components emerged earlier in time for MEG than for EEG. Identifying the sources of those unique components with fMRI, we found the locus for both MEG and EEG in high-level visual cortex, and in addition for MEG in low-level visual cortex. Together, our results show that multivariate analyses of MEG and EEG data offer a convergent and complimentary view on neural processing, and motivate the wider adoption of these methods in both MEG and EEG research.
  • Article
    The rise of multi-million-item dataset initiatives has enabled data-hungry machine learning algorithms to reach near-human semantic classification performance at tasks such as visual object and scene recognition. Here we describe the Places Database, a repository of 10 million scene photographs, labeled with scene semantic categories, comprising a large and diverse list of the types of environments encountered in the world. Using the state-of-the-art Convolutional Neural Networks (CNNs), we provide scene classification CNNs (Places-CNNs) as baselines, that significantly outperform the previous approaches. Visualization of the CNNs trained on Places shows that object detectors emerge as an intermediate representation of scene classification. With its high-coverage and high-diversity of exemplars, the Places Database along with the Places-CNNs offer a novel resource to guide future progress on scene recognition problems.
  • Article
    Human visual object recognition is typically rapid and seemingly effortless, as well as largely independent of viewpoint and object orientation. Until very recently, animate visual systems were the only ones capable of this remarkable computational feat. This has changed with the rise of a class of computer vision algorithms called deep neural networks (DNNs) that achieve human-level classification performance on object recognition tasks. Furthermore, a growing number of studies report similarities in the way DNNs and the human visual system process objects, suggesting that current DNNs may be good models of human visual object recognition. Yet there clearly exist important architectural and processing differences between state-of-the-art DNNs and the primate visual system. The potential behavioural consequences of these differences are not well understood. We aim to address this issue by comparing human and DNN generalisation abilities towards image degradations. We find the human visual system to be more robust to image manipulations like contrast reduction, additive noise or novel eidolon-distortions. In addition, we find progressively diverging classification error-patterns between man and DNNs when the signal gets weaker, indicating that there may still be marked differences in the way humans and current DNNs perform visual object recognition. We envision that our findings as well as our carefully measured and freely available behavioural datasets provide a new useful benchmark for the computer vision community to improve the robustness of DNNs and a motivation for neuroscientists to search for mechanisms in the brain that could facilitate this robustness.