ArticlePDF Available

Invariant Visual Representation by Single Neurons in the Human Brain

Authors:
  • Children's Hospital, Harvard Medical School

Abstract and Figures

It takes a fraction of a second to recognize a person or an object even when seen under strikingly different conditions. How such a robust, high-level representation is achieved by neurons in the human brain is still unclear. In monkeys, neurons in the upper stages of the ventral visual pathway respond to complex images such as faces and objects and show some degree of invariance to metric properties such as the stimulus size, position and viewing angle. We have previously shown that neurons in the human medial temporal lobe (MTL) fire selectively to images of faces, animals, objects or scenes. Here we report on a remarkable subset of MTL neurons that are selectively activated by strikingly different pictures of given individuals, landmarks or objects and in some cases even by letter strings with their names. These results suggest an invariant, sparse and explicit code, which might be important in the transformation of complex visual percepts into long-term and more abstract memories.
Content may be subject to copyright.
Invariant visual representation by single neurons in
the human brain
R. Quian Quiroga
1,2
, L. Reddy
1
, G. Kreiman
3
, C. Koch
1
& I. Fried
2,4
It takes a fraction of a second to recognize a person or an object
even when seen under strikingly different conditions. How such a
robust, high-level representation is achieved by neurons in the
human brain is still unclear
1–6
. In monkeys, neurons in the upper
stages of the ventral visual pathway respond to complex images
such as faces and objects and show some degree of invariance to
metric properties such as the stimulus size, position and viewing
angle
2,4,7–12
. We have previously shown that neurons in the human
medial temporal lobe (MTL) fire selectively to images of faces,
animals, objects or scenes
13,14
. Here we report on a remarkable
subset of MTL neurons that are selectively activated by strikingly
different pictures of given individuals, landmarks or objects and
in some cases even by letter strings with their names. These results
suggest an invariant, sparse and explicit code, which might be
important in the transformation of complex visual percepts into
long-term and more abstract memories.
The subjects were eight patients with pharmacologically intract-
able epilepsy who had been implanted with depth electrodes to
localize the focus of seizure onset. For each patient, the placement of
the depth electrodes, in combination with micro-wires, was deter-
mined exclusively by clinical criteria
13
. We analysed responses of
neurons from the hippocampus, amygdala, entorhinal cortex and
parahippocampal gyrus to images shown on a laptop computer in 21
recording sessions. Stimuli were different pictures of individuals,
animals, objects and landmark buildings presented for 1 s in pseudo-
random order, six times each. An unpublished observation in our
previous recordings was the sometimes surprising degree of
invariance inherent in the neurons (that is, unit’s) firing behaviour.
For example, in one case, a unit responded only to three completely
different images of the ex-president Bill Clinton. Another unit (from
a different patient) responded only to images of The Beatles, another
one to cartoons from The Simpson’s television series and another one
to pictures of the basketball player Michael Jordan. This suggested
that neurons might encode an abstract representation of an individ-
ual. We here ask whether MTL neurons can represent high-level
information in an abstract manner characterized by invariance to the
metric characteristics of the images. By invariance we mean that a
given unit is activated mainly, albeit not necessarily uniquely, by
different pictures of a given individual, landmark or object.
To investigate further this abstract representation, we introduced
several modifications to optimize our recording and data processing
conditions (see Supplementary Information) and we designed a
paradigm to systematically search for and characterize such invariant
neurons. In a first recording session, usually done early in the
morning (screening session), a large number of images of famous
persons, landmark buildings, animals and objects were shown. This
set was complemented by images chosen after an interview with the
patient. The mean number of images in the screening session was
93.9 (range 71–114). The data were quickly analysed offline to
determine the stimuli that elicited responses in at least one unit
(see definition of response below). Subsequently, in later sessions
(testing sessions) between three and eight variants of all the stimuli
that had previously elicited a response were shown. If not enough
stimuli elicited significant responses in the screening session, we
chose those stimuli with the strongest responses. On average, 88.6
(range 70–110) different images showing distinct views of 14 indi-
viduals or objects (range 7–23) were used in the testing sessions.
Single views of random stimuli (for example, famous and non-
famous faces, houses, animals, etc) were also included. The total
number of stimuli was determined by the time available with the
patient (about 30 min on average). Because in our clinical set-up the
recording conditions can sometimes change within a few hours, we
always tried to perform the testing sessions shortly after the screening
sessions in order to maximize the probability of recording from the
same units. Unless explicitly stated otherwise, all the data reported in
this study are from the testing sessions. To hold their attention,
patients had to perform a simple task during all sessions (indicating
with a key press whether a human face was present in the image).
Performance was close to 100%.
We recorded from a total of 993 units (343 single units and 650
multi-units), with an average of 47.3 units per session (16.3 single
units and 31.0 multi-units). Of these, 132 (14%; 64 single units and
68 multi-units) showed a statistically significant response to at least
one picture. A response was considered significant if it was larger
than the mean plus 5 standard deviations (s.d.) of the baseline and
had at least two spikes in the post-stimulus time interval considered
(300–1,000 ms). All these responses were highly selective: for the
responsive units, an average of only 2.8% of the presented pictures
(range: 0.9–22.8%) showed significant activations according to this
criterion. This high selectivity was also present in the screening
sessions, where only 3.1% of the pictures shown elicited responses
(range: 0.9–18.0%). There was no significant difference between the
relative number of responsive pictures obtained in the screening and
testing sessions (t-test, P ¼ 0.40). Responses started around 300 ms
after stimulus onset and had mainly three non-exclusive patterns of
activation (with about one-third of the cells having each type of
response): the response disappeared with stimulus offset, 1 s after
stimulus onset; it consisted of a rapid sequence of about 6 spikes
(s.d. ¼ 5) between 300 and 600 ms after stimulus onset; or it was
prolonged and continued up to 1 s after stimulus offset. For this
study, we calculated the responses in a time window between 300 and
1,000 ms after stimulus onset. In a few cases we also observed cells
that responded selectively only after the image was removed from
view (that is, after 1 s). These are not further analysed here.
LETTERS
1
Computation and Neural Systems, California Institute of Technology, Pasadena, California 91125, USA.
2
Division of Neurosurgery and Neuropsychiatric Institute, University of
California, Los Angeles (UCLA), California 90095, USA.
3
Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, Massachusetts 02142, USA.
4
Functional Neurosurgery Unit, Tel-Aviv Medical Center and Sackler Faculty of Medicine, Tel-Aviv University, Tel-Aviv 69978, Israel. †Present address: Department of
Engineering, University of Leicester, LE1 7RH, UK.
Vol 435|23 June 2005|doi:10.1038/nature03687
1102
Figure 1a shows the responses of a single unit in the left posterior
hippocampus to a selection of 30 out of the 87 pictures presented to
the patient. None of the other pictures elicited a statistically signifi-
cant response. This unit fired to all pictures of the actress Jennifer
Aniston alone, but not (or only very weakly) to other famous and
non-famous faces, landmarks, animals or objects. Interestingly, the
unit did not respond to pictures of Jennifer Aniston together with the
actor Brad Pitt (but see Supplementary Fig. 2). Pictures of Jennifer
Aniston elicited an average of 4.85 spikes (s.d. ¼ 3.59) between 300
and 600 ms after stimulus onset. Notably, this unit was nearly silent
during baseline (average of 0.02 spikes in a 700-ms pre-stimulus time
window) and during the presentation of most other pictures
(Fig. 1b). Figure 1b plots the median number of spikes (across trials)
in the 300–1,000-ms post-stimulus interval for all 87 pictures shown
to the patient. The histogram shows a marked differential response to
pictures of Jennifer Aniston (red bars).
Next, we quantified the degree of invariance using a receiver
operating characteristic (ROC) framework
15
. We considered as the
hit rate (y axis) the relative number of responses to pictures of a
specific individual, object, animal or landmark building, and as
Figure 1 | A single unit in the left posterior hippocampus activated
exclusively by different views of the actress Jennifer Aniston.
a, Responses to 30 of the 87 images are shown. There were no statistically
significant responses to the other 57 pictures. For each picture, the
corresponding raster plots (the order of trial number is from top to bottom)
and post-stimulus time histograms are given. Vertical dashed lines indicate
image onset and offset (1 s apar t). Note that owing to insurmountable
copyright problems, all original images were replaced in this and all
subsequent figures by very similar ones (same subject, animal or building,
similar pose, similar colour, line drawing, and so on). b, The median
responses to all pictures. The image numbers correspond to those in a. The
two horizontal lines show the mean baseline activity (0.02 spikes) and the
mean plus 5 s.d. (0.82 spikes). Pictures of Jennifer Aniston are denoted by
red bars. c, The associated ROC curve (red trace) testing the hypothesis that
the cell responded in an invariant manner to all seven photographs of
Jennifer Aniston (hits) but not to other images (including photographs of
Jennifer Aniston and Brad Pitt together; false positives). The grey lines
correspond to the same ROC analysis for 99 surrogate sets of 7 randomly
chosen pictures (P , 0.01). The area under the red curve is 1.00.
NATURE|Vol 435|23 June 2005 LETTERS
1103
the false positive rate (x axis) the relative number of responses to
other pictures. The ROC curve corresponds to the performance of a
linear binary classifier for different values of a response threshold.
Decreasing the threshold increases the probability of hits but also of
false alarms. A cell responding to a large set of pictures of different
individuals will have a ROC curve close to the diagonal (with an area
under the curve of 0.5), whereas a cell that responds to all pictures of
an individual but not to others will have a convex ROC curve far from
the diagonal, with an area close to 1. In Fig. 1c we show the ROC
curve for all seven pictures of Jennifer Aniston (red trace, with an area
equal to 1). The grey lines show 99 ROC surrogate curves, testing
invariance to randomly selected groups of pictures (see Methods). As
expected, these curves are close to the diagonal, having an area of
about 0.5. None of the 99 surrogate curves had an area equal or larger
than the original ROC curve, implying that it is unlikely (P , 0.01)
that the responses to Jennifer Aniston were obtained by chance. A
responsive unit was defined to have an invariant representation if the
area under the ROC curve was larger than the area of the 99 surrogate
curves.
Figure 2 shows another single unit located in the right anterior
hippocampus of a different patient. This unit was selectively acti-
vated by pictures of the actress Halle Berry as well as by a drawing of
her (but not by other drawings; for example, picture no. 87). This
unit was also activated by several pictures of Halle Berry dressed as
Catwoman, her character in a recent film, but not by other images of
Catwoman that were not her (data not shown). Notably, the unit was
selectively activated by the letter string ‘Halle Berry. Such an
invariant pattern of activation goes beyond common visual features
of the different stimuli. As with the previous unit, the responses were
mainly localized between 300 and 600 ms after stimulus onset.
Figure 2 | A single unit in the right anterior hippocampus that responds to
pictures of the actress Halle Berry (conventions as in Fig. 1).
ac, Strikingly, this cell also responds to a drawing of her, to herself dressed
as Catwoman (a recent movie in which she played the lead role) and to the
letter string ‘Halle Berry’ (picture no. 96). Such an invariant response cannot
be attributed to common v isual features of the stimuli. This unit also had a
very low baseline firing rate (0.06 spikes). The area under the red curve in c is
0.99.
LETTERS NATURE|Vol 435|23 June 2005
1104
Figure 2c shows the ROC curve for the pictures of Halle Berry (red
trace) and for 99 surrogates (grey lines). The area under the ROC
curve was 0.99, larger than that of the surrogates.
Figure 3 illustrates a multi-unit in the left anterior hippocampus
responding to pictures of the Sydney Opera House and the Baha’i
Temple. Because the patient identified both landmark buildings as
the Sydney Opera House, all these pictures were considered as a single
landmark building for the ROC analysis. This unit also responded to
the letter string ‘Sydney Opera’ (pictures no. 2 and 8) but not to other
letter strings, such as ‘Eiffel Tower’ (picture no. 1). More examples of
invariant responses are shown in the Supplementary Figs 2–11.
Out of the 132 responsive units, 51 (38.6%; 30 single units and 21
multi-units) showed invariance to a particular individual (38 units
responding to Jennifer Aniston, Halle Berry, Julia Roberts, Kobe
Bryant, and so on), landmark building (6 units responding to the
Tower of Pisa, the Baha’i Temple and the Sydney Opera House),
animal (5 units responding to spiders, seals and horses) or object (2
units responding to specific food items), with P , 0.01 as defined
above by means of the surrogate tests. A one-way analysis of variance
(ANOVA) yielded similar results (see Methods). Eight of these units
(two single units and six multi-units) responded to two different
individuals (or to an individual and an object). Figure 4 presents the
distribution of the areas under the ROC curves for all 51 units that
showed an invariant representation to individuals or objects. The
areas ranged from 0.76 to 1.00, with a median of 0.94. These units
were located in the hippocampus (27 out of 60 responsive units;
45%), parahippocampal gyrus (11 out of 20 responsive units; 55%),
amygdala (8 out of 30 responsive units; 27%) and entorhinal cortex
Figure 3 | A multi-unit in the left anterior hippocampus that responds to
photographs of the Sydney Opera House and the Baha’i Temple
(conventions as in Fig. 1).
ac, The patient identified all pictures of both of
these buildings as the Sydney Opera, and we therefore considered them as a
single landmark. This unit also responded to the presentation of the letter
string ‘Sydney Opera’ (pictures no. 2 and 8), but not to other strings, such as
‘Eiffel Tower’ (picture no. 1). In contrast to the previous two figures, this unit
had a higher baseline firing rate (2.64 spikes). The area under the red curve
in c is 0.97.
NATURE|Vol 435|23 June 2005 LETTERS
1105
(5 out of 22 responsive units; 23%). There were no clear differences in
the latencies and firing patterns among the different areas. However,
more data are needed before making a conclusive claim about
systematic differences between the various structures of the MTL.
As shown in Figs 2 and 3, one of the most extreme cases of an
abstract representation is the one given by responses to pictures of a
particular individual (or object) and to the presentation of the
corresponding letter string with its name. In 18 of the 21 testing
sessions we also tested responses to letter strings with the names of
the individuals and objects. Eight of the 132 responsive units (6.1%)
showed a selective response to an individual and its name (with no
response to other names). Six of these were in the hippocampus, one
was in the entorhinal cortex and one was in the amygdala.
These neuronal responses cannot be attributed to any particular
movement artefact, because selective responses started around
300 ms after image onset, whereas key presses occurred at 1 s or
later, and neuronal responses were very selective. About one-third of
the responsive units had a response localized between 300 and
600 ms. This interval corresponds to the latency of event-related
responses correlated with the recognition of oddball’ stimuli in scalp
electroencephalogram, namely, the P300 (ref. 16). Some studies
argue for a generation of the P300 in the hippocampal formation
and amygdala
17,18
, consistent with our findings.
What are the common features that activate these neurons? Given
the great diversity of distinct images of a single individual (pencil
sketches, caricatures, letter strings, coloured photographs with
different backgrounds) that these cells can selectively respond to, it
is unlikely that this degree of invariance can be explained by a simple
set of metric features common to these images. Indeed, our data are
compatible with an abstract representation of the identity of the
individual or object shown. The existence of such high-level visual
responses in medial temporal lobe structures, usually considered to
be involved in long-term memory formation and consolidation,
should not be surprising given the following: (1) the known ana-
tomical connections between the higher stages of the visual hierarchy
in the ventral pathway and the MTL
19,20
; (2) the well-characterized
reactivity of the cortical stages feeding into the MTL to the sight of
faces, objects, or spatial scenes (as ascertained using functional
magnetic resonance imaging (fMRI) in humans
21,22
and electrophysi-
ology in monkeys
2,4,7–11
); and (3) the observation that any visual
percept that will be consciously remembered later on will have to be
represented in the hippocampal system
23–25
. This is true even though
patients with bilateral loss of parts of the MTL do not, in general,
have a deficit in the perception of images
25
. Neurons in the MTL
might have a fundamental role in learning associations between
abstract representations
26
. Thus, our observed invariant responses
probably arise from experiencing very different pictures, words or
other visual stimuli in association with a given individual or object.
How neurons encode different percepts is one of the most intri-
guing questions in neuroscience. Two extreme hypotheses are
schemes based on the explicit representations by highly selective
(cardinal, gnostic or grandmother) neurons and schemes that rely on
an implicit representation over a very broad and distributed popu-
lation of neurons
1–4,6
. In the latter case, recognition would require the
simultaneous activation of a large number of cells and therefore we
would expect each cell to respond to many pictures with similar basic
features. This is in contrast to the sparse firing we observe, because
most MTL cells do not respond to the great majority of images seen
by the patient. Furthermore, cells signal a particular individual or
object in an explicit manner
27
, in the sense that the presence of the
individual can, in principle, be reliably decoded from a very small
number of neurons. We do not mean to imply the existence of single
neurons coding uniquely for discrete percepts for several reasons:
first, some of these units responded to pictures of more than one
individual or object; second, given the limited duration of our
recording sessions, we can only explore a tiny portion of stimulus
space; and third, the fact that we can discover in this short time some
images
such as photographs of Jennifer Aniston
that drive the
cells suggests that each cell might represent more than one class of
images. Yet, this subset of MTL cells is selectively activated by
different views of individuals, landmarks, animals or objects. This
is quite distinct from a completely distributed population code and
suggests a sparse, explicit and invariant encoding of visual percepts in
MTL. Such an abstract representation, in contrast to the metric
representation in the early stages of the visual pathway, might be
important in the storage of long-term memories. Other factors,
including emotional responses towards some images, could concei-
vably influence the neuronal activity as well. The responses of these
neurons are reminiscent of the behaviour of hippocampal place cells
in rodents
28
that only fire if the animal moves through a particular
spatial location, with the actual place field defined independently of
sensory cues. Notably, place cells have been found recently in the
human hippocampus as well
29
. Both classes of neurons
place cells
and the cells in the present study
have a very low baseline activity
and respond in a highly selective manner. Future research might
show that this similarity has functional implications, enabling
mammals to encode behaviourally important features of the environ-
ment and to transition between them, either in physical space or in a
more conceptual space
13
.
METHODS
The data in the present study come from 21 sessions in 8 patients with
pharmacologically intractable epilepsy (eight right handed; 3 male; 17–47 years
old). Extensive non-invasive monitoring did not yield concordant data corre-
sponding to a single resectable epileptogenic focus. Therefore, the patients were
implanted with chronic depth electrodes for 7–10 days to determine the seizure
focus for possible surgical resection
13
. Here we report data from sites in the
hippocampus, amygdala, entorhinal cortex and parahippocampal gyrus. All
studies conformed to the guidelines of the Medical Institutional Review Board at
UCLA. The electrode locations were based exclusively on clinical criteria and
were verified by MRI or by computer tomography co-registered to preoperative
MRI. Each electrode probe had a total of nine micro-wires at its end
13
, eight
active recording channels and one reference. The differential signal from the
micro-wires was amplified using a 64-channel Neuralynx system, filtered
between 1 and 9,000 Hz. We computed the power spectrum for every unit
after spike sorting. Units that showed evidence of line noise were excluded from
subsequent analysis
14
. Signals were sampled at 28 kHz. Each recording session
lasted about 30 min.
Subjects lay in bed, facing a laptop computer. Each image covered about 1.58
and was presented at the centre of the screen six times for 1 s. The order of the
pictures was randomized. Subjects had to respond, after image offset, according
to whether the picture contained a human face or something else by pressing the
Figure 4 | Distribution of the area under the ROC curves for the 51 units
(out of 132 responsive units) showing an invariant representation.
Of
these, 43 responded to a single individual or object and 8 to two individuals
or objects. The dashed vertical line marks the median of the distribution
(0.94).
LETTERS NATURE|Vol 435|23 June 2005
1106
‘Y’ and ‘N’ keys, respectively. This simple task, on which performance was
virtually flawless, required them to attend to the pictures. After the experiments,
patients gave feedback on whether they recognized the images or not. Pictures
included famous and unknown individuals, animals, landmarks and objects. We
tried to maximize the differences between pictures of the individuals (for
example, different clothing, size, point of view, and so on). In 18 of the 21
sessions, we also presented letter strings with names of individuals or objects.
The data from the screening sessions were rapidly processed to identify
responsive units and images. All pictures that elicited a response in the screening
session were included in the later testing sessions. Three to eight different views
of seven to twenty-three different individuals or objects were used in the testing
sessions with a mean of 88.6 images per session (range 70–110). Spike detection
and sorting was applied to the continuous recordings using a novel clustering
algorithm
30
(see Supplementary Information). The response to a picture was
defined as the median number of spikes across trials between 300 and 1,000 ms
after stimulus onset. Baseline activity was the average spike count for all pictures
between 1,000 and 300 ms before stimulus onset. A unit was considered
responsive if the activity to at least one picture fulfilled two criteria: (1) the
median number of spikes was larger than the average number of spikes for the
baseline plus 5 s.d.; and (2) the median number of spikes was at least two.
The classification between single unit and multi-unit was done visually based
on the following: (1) the spike shape and its variance; (2) the ratio between the
spike peak value and the noise level; (3) the inter-spike interval distribution of
each cluster; and (4) the presence of a refractory period for the single units (that
is, less than 1% of spikes within less than 3 ms inter-spike interval).
Whenever a unit had a response to a given stimulus, we further analysed the
responses to other pictures of the same individual or object by a ROC analysis.
This tested whether cells responded selectively to pictures of a given individual.
The hit rate (y axis) was defined as the number of responses to the individual
divided by the total number of pictures of this individual. The false positive rate
(x axis) was defined as the number of responses to the other pictures divided by
the total number of other pictures. The ROC curve was obtained by gradually
lowering the threshold of the responses (the median number of spikes in Figs 1b,
2b and 3b). Starting with a very high threshold (no hits, no false positives, lower
left-hand corner in the ROC diagram), if a unit responds exclusively to an image
of a particular individual or object, the ROC curve will show a steep increase
when lowering the threshold (a hit rate of 1 and no false positives). If a unit
responds to a random selection of pictures, it will have a similar relative number
of hits and false positives and the ROC curve will fall along the diagonal. In the
first case, for a highly invariant unit, the area under the ROC curve will be close
to 1, whereas in the latter case it will be about 0.5. To evaluate the statistical
significance, we created 99 surrogate curves for each responsive unit, testing the
null hypothesis that the unit responded preferentially to n randomly chosen
pictures (with n being the number of pictures of the individual for which
invariance was tested). A unit was considered invariant to a certain individual or
object if the area under the ROC curve was larger than the area of all of the 99
surrogates (that is, with a confidence of P , 0.01). Alternatively, the ROC
analysis can be done with the single trial responses instead of the median
responses across trials. Here, responses to the trials corresponding to any picture
of the individual tested are considered as hits and responses to trials to other
pictures as false positives. This trial-by-trial analysis led to very similar results,
with 55 units of all 132 responsive units showing an invariant representation. A
one-way ANOVA also yielded similar results. In particular, we tested whether the
distribution of median firing rates for all responsive units showed a dependence
on the factor identity (that is, the individual, landmark or object shown). The
different views of each individual were the repeated measures. As with the ROC
analysis, an ANOVA test was performed on all responsive units. Overall, the
results were very similar to those obtained with the ROC analysis: of 132
responsive units, 49 had a significant effect for factor identity with P , 0.01,
compared to 51 units showing an invariant representation with the ROC
analysis. The ANOVA analysis, however, does not demonstrate that the invariant
responses were very selective, whereas the ROC analysis explicitly tests the
presence of an invariant as well as sparse representation.
Images were obtained from Corbis and Photorazzi, with licensed rights to
reproduce them in this paper and in the Supplementary Information.
Received 1 December 2004; accepted 3 February 2005.
1. Barlow, H. Single units and sensation: a neuron doctrine for perception.
Perception 1, 371–-394 (1972).
2. Gross, C. G., Bender, D. B. & Rocha-Miranda, C. E. Visual receptive fields of
neurons in inferotemporal cortex of the monkey. Science 166, 1303–-1306
(1969).
3. Konorski, J. Integrative Activity of the Brain (Univ. Chicago Press, Chicago, 1967).
4. Logothetis, N. K. & Sheinberg, D. L. Visual object recognition. Annu. Rev.
Neurosci. 19, 577–-621 (1996).
5. Riesenhuber, M. & Poggio, T. Neural mechanisms of object recognition. Curr.
Opin. Neurobiol. 12, 162–-168 (2002).
6. Young, M. P. & Yamane, S. Sparse population coding of faces in the inferior
temporal cortex. Science 256, 1327–-1331 (1992).
7. Logothetis, N. K., Pauls, J. & Poggio, T. Shape representation in the inferior
temporal cortex of monkeys. Curr. Biol. 5, 552–-563 (1995).
8. Logothetis, N. K. & Pauls, J. Psychophysical and physiological evidence for
viewer-centered object representations in the primate. Cereb. Cortex 3,
270–-288 (1995).
9. Perrett, D., Rolls, E. & Caan, W. Visual neurons responsive to faces in the
monkey temporal cortex. Exp. Brain Res. 47, 329–-342 (1982).
10. Schwartz, E. L., Desimone, R., Albright, T. D. & Gross, C. G. Shape recognition
and inferior temporal neurons. Proc. Natl Acad. Sci. USA 80, 5776–-5778 (1983).
11. Tanaka, K. Inferotemporal cortex and object vision. Annu. Rev. Neurosci. 19,
109–-139 (1996).
12. Miyashita, Y. & Chang, H. S. Neuronal correlate of pictorial short-term memory
in the primate temporal cortex. Nature 331, 68–-71 (1988).
13. Fried, I., MacDonald, K. A. & Wilson, C. Single neuron activity in human
hippocampus and amygdale during recognition of faces and objects. Neuron 18,
753–-765 (1997).
14. Kreiman, G., Koch, C. & Fried, I. Category-specific visual responses of single
neurons in the human medial temporal lobe. Nature Neurosci. 3, 946–-953
(2000).
15. Macmillan, N. A. & Creelman, C. D. Detection Theory: A User’s Guide
(Cambridge Univ. Press, New York, 1991).
16. Picton, T. The P300 wave of the human event-related potential. J. Clin.
Neurophysiol. 9, 456–-479 (1992).
17. Halgren, E., Marinkovic, K. & Chauvel, P. Generators of the late cognitive
potentials in auditory and visual oddball tasks. Electroencephalogr. Clin.
Neurophysiol. 106, 156–-164 (1998).
18. McCarthy, G., Wood, C. C., Williamson, P. D. & Spencer, D. D. Task-dependent
field potentials in human hippocampal formation. J. Neurosci. 9, 4253–-4268
(1989).
19. Saleem, K. S. & Tanaka, K. Divergent projections from the anterior
inferotemporal area TE to the perirhinal and entorhinal cortices in the macaque
monkey. J. Neurosci. 16, 4757–-4775 (1996).
20. Suzuki, W. A. Neuroanatomy of the monkey entorhinal, perirhinal and
parahippocampal cortices: Organization of cortical inputs and interconnections
with amygdale and striatum. Seminar Neurosci. 8, 3–-12 (1996).
21. Kanwisher, N., McDermott, J. & Chun, M. M. The fusiform face area: A module
in human extrastriate cortex specialized for face perception. J. Neurosci. 17,
4302–-4311 (1997).
22. Haxby, J. V. et al. Distributed and overlapping representations of faces and
objects in ventral temporal cortex. Science 293, 2425–-2430 (2001).
23. Eichenbaum, H. A cortical-hippocampal system for declarative memory. Nature
Rev. Neurosci. 1, 41–-50 (2000).
24. Hampson, R. E., Pons, P. P., Stanford, T. R. & Deadwyler, S. A. Categorization in
the monkey hippocampus: A possible mechanism for encoding information
into memory. Proc. Natl Acad. Sci. USA 101, 3184–-3189 (2004).
25. Squire, L. R., Stark, C. E. L. & Clark, R. E. The medial temporal lobe. Annu. Rev.
Neurosci. 27, 279–-306 (2004).
26. Mishashita, Y. Neuronal correlate of visual associative long-term memory in
the primate temporal cortex. Nature 335, 817–-820 (1988).
27. Koch, C. The Quest for Consciousness: A Neurobiological Approach (Roberts,
Englewood, Colorado, 2004).
28. Wilson, M. A. & McNaughton, B. L. Dynamics of the hippocampal ensemble
code for space. Science 261, 1055–-1058 (1993).
29. Ekstrom, A. D. et al. Cellular networks underlying human spatial navigation.
Nature 425, 184–-187 (2003).
30. Quian Quiroga, R., Nadasdy, Z. & Ben-Shaul, Y. Unsupervised spike detection
and sorting with wavelets and super-paramagnetic clustering. Neural Comput.
16, 1661–-1687 (2004).
Supplementary Information is linked to the online version of the paper at
www.nature.com/nature.
Acknowledgements We thank all patients for their participation; P. Sinha for
drawing some faces; colleagues for providing pictures; I. Wainwright for
administrative assistance; and E. Behnke, T. Fields, E. Ho, E. Isham, A. Kraskov,
P. Steinmetz, I. Viskontas and C. Wilson for technical assistance. This work was
supported by grants from the NINDS, NIMH, NSF, DARPA, the Office of Naval
Research, the W.M. Keck Foundation Fund for Discovery in Basic Medical
Research, a Whiteman fellowship (to G.K.), the Gordon Moore Foundation, the
Sloan Foundation, and the Swartz Foundation for Computational Neuroscience.
Author Information Reprints and permissions information is available at
npg.nature.com/reprintsandpermissions. The authors declare no competing
financial interests. Correspondence and request for materials should be
addressed to R.Q.Q. (rodri@vis.caltech.edu).
NATURE|Vol 435|23 June 2005 LETTERS
1107
... This goal is shared with the study of biological vision, where there is extensive research over the last several decades focused on identifying the "preferred stimulus" of individual neurons in the visual cortex (Hubel & Wiesel, 1959;Lettvin et al., 1959;Tsunoda et al., 2001;Wang et al., 1996;Pasupathy & Connor, 2001;Quiroga, 2005). This approach to visual neuroscience reflected the dominant theory at the time, known as the "grandmother (cell)" theory , which postulates that information in the visual system is stored locally -at the level of single neurons -and that the visual system contains specific neurons that respond to particular objects or people (including one's own grandmother). ...
... This parallelism between XAI and neuroscience extends beyond vision as recent work has found neurons in multi-modal systems that respond to very high-level concepts beyond simple image appearance, including hand-drawing and text (Goh et al., 2021). Interestingly, the authors identified a "Halle Berry" neuron in CLIP, reminiscent of the neuroscience finding reported two decades ago in the human brain (Quiroga, 2005). ...
Preprint
Full-text available
Much of the research on the interpretability of deep neural networks has focused on studying the visual features that maximally activate individual neurons. However, recent work has cast doubts on the usefulness of such local representations for understanding the behavior of deep neural networks because individual neurons tend to respond to multiple unrelated visual patterns, a phenomenon referred to as "superposition". A promising alternative to disentangle these complex patterns is learning sparsely distributed vector representations from entire network layers, as the resulting basis vectors seemingly encode single identifiable visual patterns consistently. Thus, one would expect the resulting code to align better with human perceivable visual patterns, but supporting evidence remains, at best, anecdotal. To fill this gap, we conducted three large-scale psychophysics experiments collected from a pool of 560 participants. Our findings provide (i) strong evidence that features obtained from sparse distributed representations are easier to interpret by human observers and (ii) that this effect is more pronounced in the deepest layers of a neural network. Complementary analyses also reveal that (iii) features derived from sparse distributed representations contribute more to the model's decision. Overall, our results highlight that distributed representations constitute a superior basis for interpretability, underscoring a need for the field to move beyond the interpretation of local neural codes in favor of sparsely distributed ones.
... The predicted tokens can be integrated into BEV embeddings through the off-the-shelf codebook embedding for generating the final high-quality BEV semantic maps. This process is similar to the human brain's memory mechanism [8], where the targets (BEV map layouts) are encoded into highly abstract, sparse representations (BEV embeddings) through memory neurons (BEV tokens) that can be activated by specific visual signals (generated based on token queries). ...
Preprint
Bird's-eye-view (BEV) map layout estimation requires an accurate and full understanding of the semantics for the environmental elements around the ego car to make the results coherent and realistic. Due to the challenges posed by occlusion, unfavourable imaging conditions and low resolution, \emph{generating} the BEV semantic maps corresponding to corrupted or invalid areas in the perspective view (PV) is appealing very recently. \emph{The question is how to align the PV features with the generative models to facilitate the map estimation}. In this paper, we propose to utilize a generative model similar to the Vector Quantized-Variational AutoEncoder (VQ-VAE) to acquire prior knowledge for the high-level BEV semantics in the tokenized discrete space. Thanks to the obtained BEV tokens accompanied with a codebook embedding encapsulating the semantics for different BEV elements in the groundtruth maps, we are able to directly align the sparse backbone image features with the obtained BEV tokens from the discrete representation learning based on a specialized token decoder module, and finally generate high-quality BEV maps with the BEV codebook embedding serving as a bridge between PV and BEV. We evaluate the BEV map layout estimation performance of our model, termed VQ-Map, on both the nuScenes and Argoverse benchmarks, achieving 62.2/47.6 mean IoU for surround-view/monocular evaluation on nuScenes, as well as 73.4 IoU for monocular evaluation on Argoverse, which all set a new record for this map layout estimation task. The code and models are available on \url{https://github.com/Z1zyw/VQ-Map}.
... However, these methods predominantly focus on unimodal point cloud inputs, overlooking the potential benefits of leveraging multimodal information. Insights from neuroscience (Nanay, 2018;Quiroga et al., 2005;Kuhl & Meltzoff, 1984;Meltzoff & Borton, 1979) suggest that human cognitive learning is inherently multimodal, with different modalities of the same concept exhibiting strong correspondence through the activation of synergistic neurons. Particularly, multimodal signals, such as vision and language, have been shown to play crucial roles, surpassing the performance of only utilizing vision information (Jackendoff, 1987;Smith & Gasser, 2005;Gibson, 1969). ...
Preprint
Full-text available
Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal annotated support samples. While existing FS-PCS methods have shown promise, they primarily focus on unimodal point cloud inputs, overlooking the potential benefits of leveraging multimodal information. In this paper, we address this gap by introducing a cost-free multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality. Under this easy-to-achieve setup, we present the MultiModal Few-Shot SegNet (MM-FSS), a model effectively harnessing complementary information from multiple modalities. MM-FSS employs a shared backbone with two heads to extract intermodal and unimodal visual features, and a pretrained text encoder to generate text embeddings. To fully exploit the multimodal information, we propose a Multimodal Correlation Fusion (MCF) module to generate multimodal correlations, and a Multimodal Semantic Fusion (MSF) module to refine the correlations using text-aware semantic guidance. Additionally, we propose a simple yet effective Test-time Adaptive Cross-modal Calibration (TACC) technique to mitigate training bias, further improving generalization. Experimental results on S3DIS and ScanNet datasets demonstrate significant performance improvements achieved by our method. The efficacy of our approach indicates the benefits of leveraging commonly-ignored free modalities for FS-PCS, providing valuable insights for future research. The code is available at https://github.com/ZhaochongAn/Multimodality-3D-Few-Shot .
... The relatively similar tuning properties of neurons in these two areas during encoding are not surprising in the context of prior work. For example, both brain areas contain concept cells 44,45 as well as MS cells. 37,46 Here, we now find that the relationship between short-term memory maintenance and its impact on later LTM is specific to the hippocampus. ...
Article
Full-text available
Working memory (WM) and long-term memory (LTM) are often viewed as separate cognitive systems. Little isknown about how these systems interact when forming memories. We recorded single neurons in the humanmedial temporal lobe while patients maintained novel items in WM and completed a subsequent recognitionmemory test for the same items. In the hippocampus, but not in the amygdala, the level of WM content-se-lective persistent activity during WM maintenance was predictive of whether the item was later recognizedwith high confidence or forgotten. By contrast, visually evoked activity in the same cells was not predictiveof LTM formation. During LTM retrieval, memory-selective neurons responded more strongly to familiar stim-uli for which persistent activity was high while they were maintained in WM. Our study suggests that hippo-campal persistent activity of the same cells supports both WM maintenance and LTM encoding, therebyrevealing a common single-neuron component of these two memory systems.
... In the human MTL, concept cells have been identified that respond with a high degree of invariance to representations of a specific concept (for example, a picture as well as the written and spoken name of a person or an object) 41,42 . In our recordings, we identified neurons exhibiting concept cell-like characteristics. ...
Article
Full-text available
Olfaction is a fundamental sensory modality that guides animal and human behaviour1,2. However, the underlying neural processes of human olfaction are still poorly understood at the fundamental—that is, the single-neuron—level. Here we report recordings of single-neuron activity in the piriform cortex and medial temporal lobe in awake humans performing an odour rating and identification task. We identified odour-modulated neurons within the piriform cortex, amygdala, entorhinal cortex and hippocampus. In each of these regions, neuronal firing accurately encodes odour identity. Notably, repeated odour presentations reduce response firing rates, demonstrating central repetition suppression and habituation. Different medial temporal lobe regions have distinct roles in odour processing, with amygdala neurons encoding subjective odour valence, and hippocampal neurons predicting behavioural odour identification performance. Whereas piriform neurons preferably encode chemical odour identity, hippocampal activity reflects subjective odour perception. Critically, we identify that piriform cortex neurons reliably encode odour-related images, supporting a multimodal role of the human piriform cortex. We also observe marked cross-modal coding of both odours and images, especially in the amygdala and piriform cortex. Moreover, we identify neurons that respond to semantically coherent odour and image information, demonstrating conceptual coding schemes in olfaction. Our results bridge the long-standing gap between animal models and non-invasive human studies and advance our understanding of odour processing in the human brain by identifying neuronal odour-coding principles, regional functional differences and cross-modal integration.
... This line of argument can be boosted by the ability of neurons to go through periods of near constant strong activity, followed by periods of persistent weak activity. For instance, cells in higher brain areas respond with high specificity to complex concepts and remain dormant when they are not coding these concepts [16]. Also, certain neurons respond selectively with elevated firing rates over extended periods during working memory tasks [19]. ...
Preprint
Grid cells in rodent entorhinal cortex (EC) support a coordinate system for space, enabling robust memory and powerful flexibility in spatial behaviour. This coordinate system is abstract - with the same grid cells encoding position across different sensory environments; and hierarchical - with grid modules of increasing spatial scale occupying increasingly ventral locations in the EC. Recent theories suggest that a similar abstract coordinate system could offer the same benefits to general memories that are not sequences drawn from a 2D surface. Here we show that an abstract hierarchical coordinate system supports arbitrary sequences in the human medial temporal lobe (MTL). In single-unit recordings from MTL, we find abstract, coordinate-like coding of a simple sequential memory task. In an fMRI experiment with more complex hierarchical sequences, we discover an abstract hierarchical representation in EC: the coordinate representations at distinct levels in the hierarchy are arranged on an anatomical gradient along the EC's anterior-posterior axis, homologous to the ventro-dorsal axis in rodents. These results therefore mirror the anatomical gradient of grid cells in the rodent EC but now for arbitrary non-spatial sequences. Together they suggest that memories are scaffolded on a hierarchical coordinate system using common neuronal coding principles, aligned to preserved anatomy, across domains and species.
Article
Recent experimental and theoretical work has shown that nonlinear mixed selectivity, where neurons exhibit interaction effects in their tuning to multiple variables (e.g., stimulus and task) plays a key role in enabling the primate brain to form representations that can adapt to changing task contexts. Thus far all such studies have relied on invasive neural recording techniques. In this study, we demonstrate the feasibility of measuring nonlinear mixed selectivity tuning in the human brain noninvasively using fMRI pattern decoding. To do so, we examined the joint representation of object category and task information across human early, ventral stream, and dorsal stream areas while participants performed either an oddball detection task or a one-back repetition detection task on the same stimuli. These tasks were chosen to equate spatial, object-based, and feature-based attention, in order to test whether task modulations of visual representations still occur when the inputs to visual processing are kept constant between the two tasks, with only the subsequent cognitive operations varying. We found moderate but significant evidence for nonlinear mixed selectivity tuning to object category and task in fMRI response patterns in both human ventral and dorsal areas, suggesting that neurons exhibiting nonlinear mixed selectivity for category and task not only exist in these regions, but cluster at a scale visible to fMRI. Importantly, while such coding in ventral areas corresponds to a rotation or shift in the object representational geometry without changing the representational content (i.e., with the relative similarity among the categories preserved), nonlinear mixed selectivity coding in dorsal areas corresponds to a reshaping of representational geometry, indicative of a change in representational content.
Article
Full-text available
The P300 wave is a positive deflection in the human event-related potential. It is most commonly elicited in an "oddball" paradigm when a subject detects an occasional "target" stimulus in a regular train of standard stimuli. The P300 wave only occurs if the subject is actively engaged in the task of detecting the targets. Its amplitude varies with the improbability of the targets. Its latency varies with the difficulty of discriminating the target stimulus from the standard stimuli. A typical peak latency when a young adult subject makes a simple discrimination is 300 ms. In patients with decreased cognitive ability, the P300 is smaller and later than in age-matched normal subjects. The intracerebral origin of the P300 wave is not known and its role in cognition not clearly understood. The P300 may have multiple intracerebral generators, with the hippocampus and various association areas of the neocortex all contributing to the scalp-recorded potential. The P300 wave may represent the transfer of information to consciousness, a process that involves many different regions of the brain.
Article
Cells in area TE of the inferotemporal cortex of the monkey brain selectively respond to various moderately complex object features, and those that cluster in a columnar region that runs perpendicular to the cortical surface respond to similar features. Although cells within a column respond to similar features, their selectivity is not necessarily identical. The data of optical imaging in TE have suggested that the borders between neighboring columns are not discrete; a continuous mapping of complex feature space within a larger region contains several partially overlapped columns. This continuous mapping may be used for various computations, such as production of the image of the object at different viewing angles, illumination conditions. and articulation poses.
Article
Ensemble recordings of 73 to 148 rat hippocampal neurons were used to predict accurately the animals' movement through their environment, which confirms that the hippocampus transmits an ensemble code for location. In a novel space, the ensemble code was initially less robust but improved rapidly with exploration. During this period, the activity of many inhibitory cells was suppressed, which suggests that new spatial information creates conditions in the hippocampal circuitry that are conducive to the synaptic modification presumed to be involved in learning. Development of a new population code for a novel environment did not substantially alter the code for a familiar one, which suggests that the interference between the two spatial representations was very small. The parallel recording methods outlined here make possible the study of the dynamics of neuronal interactions during unique behavioral events.
Article
In human long-term memory, ideas and concepts become associated in the learning process. No neuronal correlate for this cognitive function has so far been described, except that memory traces are thought to be localized in the cerebral cortex; the temporal lobe has been assigned as the site for visual experience because electric stimulation of this area results in imagery recall and lesions produce deficits in visual recognition of objects. We previously reported that in the anterior ventral temporal cortex of monkeys, individual neurons have a sustained activity that is highly selective for a few of the 100 coloured fractal patterns used in a visual working-memory task. Here I report the development of this selectivity through repeated trials involving the working memory. The few patterns for which a neuron was conjointly selective were frequently related to each other through stimulus-stimulus association imposed during training. The results indicate that the selectivity acquired by these cells represents a neuronal correlate of the associative long-term memory of pictures.
Article
Theory of learning and perception inspired by contemporary neurophysiology. Harvard Book List (edited) 1971 #150 (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Experimental lesion studies in monkeys have demonstrated that the cortical areas surrounding the hippocampus, including the entorhinal, perirhinal and parahippocampal cortices play an important role in declarative memory (i.e. memory for facts and events). A series of neuroanatomical studies, motivated in part by the lesion studies, have shown that the macaque monkey entorhinal, perirhinal and parahippocampal cortices are polymodal association areas that each receive distinctive complements of cortical inputs. These areas also have extensive interconnections with other brain areas implicated in non-declarative forms of memory including the amygdala and striatum. This pattern of connections is consistent with the idea that the entorhinal, perirhinal and parahippocampal cortices may participate in a larger network of structures that integrates information across memory systems.
Book
Detection Theory is an introduction to one of the most important tools for analysis of data where choices must be made and performance is not perfect. Originally developed for evaluation of electronic detection, detection theory was adopted by psychologists as a way to understand sensory decision making, then embraced by students of human memory. It has since been utilized in areas as diverse as animal behavior and X-ray diagnosis. This book covers the basic principles of detection theory, with separate initial chapters on measuring detection and evaluating decision criteria. Some other features include: complete tools for application, including flowcharts, tables, pointers, and software;. student-friendly language;. complete coverage of content area, including both one-dimensional and multidimensional models;. separate, systematic coverage of sensitivity and response bias measurement;. integrated treatment of threshold and nonparametric approaches;. an organized, tutorial level introduction to multidimensional detection theory;. popular discrimination paradigms presented as applications of multidimensional detection theory; and. a new chapter on ideal observers and an updated chapter on adaptive threshold measurement. This up-to-date summary of signal detection theory is both a self-contained reference work for users and a readable text for graduate students and other researchers learning the material either in courses or on their own. © 2005 by Lawrence Erlbaum Associates, Inc. All rights reserved.