ArticlePDF Available

Speed of Processing in the Human Visual System

Authors:

Abstract

How long does it take for the human visual system to process a complex natural image? Subjectively, recognition of familiar objects and scenes appears to be virtually instantaneous, but measuring this processing time experimentally has proved difficult. Behavioural measures such as reaction times can be used, but these include not only visual processing but also the time required for response execution. However, event-related potentials (ERPs) can sometimes reveal signs of neural processing well before the motor output. Here we use a go/no-go categorization task in which subjects have to decide whether a previously unseen photograph, flashed on for just 20 ms, contains an animal. ERP analysis revealed a frontal negativity specific to no-go trials that develops roughly 150 ms after stimulus onset. We conclude that the visual processing needed to perform this highly demanding task can be achieved in under 150 ms.
... Therefore, let us first define the task of rapidly detecting an animal in a scene (see Figure 1). This task is routinely used in the study of biological vision in the laboratory (for In ultrafast image categorization, the task is to report whether a briefly flashed image contains a class of object, such as an animal [3]. The presentation time can be on the order of 20 ms, and the response is, for example, the pressing or not pressing of a button. ...
... In a future application, we could extract feature maps from these low-level layers to better understand the features needed to perform this task. This would allow us to design a stimulus set for a psychological task such as in Thorpe et al. [3]. Such a test could be relevant to whether these features are sufficient to categorize an animal in a flashed scene. ...
... Furthermore, the robustness of the categorization is comparable to that found in psychophysical data. In particular, we have shown quantitatively that the categorization of the re-trained networks may be robust to transformations such as rotations, reflections, or grayscale filtering, such as is observed in humans [3,7]. ...
Article
Full-text available
Humans are able to categorize images very efficiently, in particular to detect the presence of an animal very quickly. Recently, deep learning algorithms based on convolutional neural networks (CNNs) have achieved higher than human accuracy for a wide range of visual categorization tasks. However, the tasks on which these artificial networks are typically trained and evaluated tend to be highly specialized and do not generalize well, e.g., accuracy drops after image rotation. In this respect, biological visual systems are more flexible and efficient than artificial systems for more general tasks, such as recognizing an animal. To further the comparison between biological and artificial neural networks, we re-trained the standard VGG 16 CNN on two independent tasks that are ecologically relevant to humans: detecting the presence of an animal or an artifact. We show that re-training the network achieves a human-like level of performance, comparable to that reported in psychophysical tasks. In addition, we show that the categorization is better when the outputs of the models are combined. Indeed, animals (e.g., lions) tend to be less present in photographs that contain artifacts (e.g., buildings). Furthermore, these re-trained models were able to reproduce some unexpected behavioral observations from human psychophysics, such as robustness to rotation (e.g., an upside-down or tilted image) or to a grayscale transformation. Finally, we quantified the number of CNN layers required to achieve such performance and showed that good accuracy for ultrafast image categorization can be achieved with only a few layers, challenging the belief that image recognition requires deep sequential analysis of visual objects. We hope to extend this framework to biomimetic deep neural architectures designed for ecological tasks, but also to guide future model-based psychophysical experiments that would deepen our understanding of biological vision.
... To select relevant information, fast and correct perception and categorization are essential (Brosch et al., 2010). The human visual system can process visual stimuli in milliseconds (Thorpe et al., 1996). While extensive research has been conducted on visual perception, yet, the necessary duration to perceive a stimulus with awareness or process it without awareness is still controversial. ...
Article
Full-text available
To investigate subliminal priming effects, different durations for stimulus presentation are applied ranging from 8 to 30 ms. This study aims to select an optimal presentation span whichleads to a subconscious processing. 40 healthy participants rated emotional faces (sad, neutral or happy expression) presented for 8.3 ms, 16.7 ms and 25 ms. Alongside subjective and objectivestimulus awareness, task performance was estimated via hierarchical drift diffusion models. Participants reported stimulus awareness in 65 % of the 25 ms trials,in 36 % of 16.7 ms trials, and in 2.5 % of 8.3 ms trials.Emotion-dependent responses were reflected in decreased performance (drift rates, accuracy)during sad trials. The detection rate (probability of making a correct response) during 8.3 ms was 12.2 % and slightly above chance level (33.333 % for three response options) during 16.7 ms trials (36.8 %). The experiments suggest a presentation time of 16.7 ms as optimal for subconscious priming. An emotion-specific response was detected during 16.7 ms while the performanceindicates a subconscious processing.
... In addition to Poisson coding, spike frequency-based coding also includes frequency coding based on spike count, spike density, and group activity [50]. First-spike time coding is a common method of spike precise timing coding [51]. This coding method defaults that neurons produce only one spike. ...
Article
Full-text available
Spiking neural network (SNN) is a new generation of artificial neural networks (ANNs), which is more analogous with the brain. It has been widely considered with neural computing and brain-like intelligence. SNN is a sparse trigger event-driven model, and it has the characteristics of hardware friendliness and energy saving. SNN is more suitable for hardware implementation and rapid information processing. SNN is also a powerful method for deep learning (DL) to study brain-like computing. In this paper, the common SNN learning and training methods in the field of image classification are reviewed. In detail, we examine the SNN algorithms based on synaptic plasticity, approximate backpropagation (BP), and ANN to SNN. This paper comprehensively introduces and tracks the latest progress of SNN. On this basis, we also analyze and discuss the challenges and opportunities it faces. Finally, this paper prospects for the future development of SNN in the aspects of the biological mechanism, network training and design, computing platform, and interdisciplinary communication. This review can provide a reference for the research of SNN to promote its application in complex tasks.
... Another possibility is that recognizing and processing a single object is too fast to show any difference in our current design in which the short encoding condition was 200 ms, and only one object was shown at a time. As known from previous research, observers' ability to process an object can be supported by extremely brief viewing durations: people can completely recognize most objects or scenes in just ~150 ms viewing duration of an image (e.g., Potter, 1976;Fei-Fei et al., 2007;Thorpe et al., 1996;Green & Oliva, 2009) and some recognition is possible with even shorter exposures (Potter et al. 2014). Since our paradigm used set size one, our manipulation of different encoding times might not have been sufficient to find reliable differences in the incidental memory benefit for meaningful objects. ...
Preprint
Full-text available
Prior research has shown that visual working memory capacity is enhanced for meaningful stimuli (i.e., real-world objects) compared to abstract shapes (i.e., colored circles). Furthermore, a simple feature that is part of a real-world object is better remembered than the same feature presented on an unrecognizable shape, suggesting that meaningful objects can serve as an effective scaffold in memory. Here, we hypothesized that the shape of meaningful objects would be better remembered incidentally than the shape of non-meaningful objects in a color memory task where identity itself is task-irrelevant. We used a surprise-trial paradigm in which participants performed a color memory task for several trials before being probed with a surprise trial that asked them about the shape of the last object they saw. Across three experiments, we found a memory advantage for recognizable shapes relative to scrambled and unrecognizable versions of these shapes (Exp. 1) that was robust across different encoding times (Exp. 2), and the addition of a verbal suppression task (Exp. 3). In contrast, when we asked about the location of objects in a surprise trial, we did not observe any difference between the two stimulus types (Exp. 4). These results show that identifying information about a meaningful object is encoded into working memory despite being task-irrelevant. This privilege for meaningful shape information does not exhibit a trade-off with location memory, suggesting that meaningful identity influences representations of visual working memory in higher-level visual regions without altering the use of spatial reference frames at the lower level.
Article
We describe an integrative model that encodes associations between related concepts in the human hippocampal formation, constituting the skeleton of episodic memories. The model, based on partially overlapping assemblies of "concept cells," contrast markedly with the well-established notion of pattern separation, which relies on conjunctive, context dependent single neuron responses, instead of the invariant, context independent responses found in the human hippocampus. We argue that the model of partially overlapping assemblies is better suited to cope with memory capacity limitations, that the finding of different types of neurons and functions in this area is due to a flexible and temporary use of the extraordinary machinery of the hippocampus to deal with the task at hand, and that only information that is relevant and frequently revisited will consolidate into long-term hippocampal representations, using partially overlapping assemblies. Finally, we propose that concept cells are uniquely human and that they may constitute the neuronal underpinnings of cognitive abilities that are much further developed in humans compared to other species.
Preprint
Full-text available
Transforming sensory inputs into meaningful neural representations is critical to adaptive behaviour in everyday environments. While non-invasive neuroimaging methods are the de-facto method for investigating neural representations, they remain expensive, not widely available, time-consuming, and restrictive in terms of the experimental conditions and participant populations they can be used with. Here we show that movement trajectories collected in online behavioural experiments can be used to measure the emergence and dynamics of neural representations with fine temporal resolution. By combining online computer mouse-tracking and publicly available neuroimaging (MEG and fMRI) data via Representational Similarity Analysis (RSA), we show that movement trajectories track the evolution of visual representations over time. We used a time constrained face/object categorization task on a previously published set of images containing human faces, illusory faces and objects to demonstrate that time-resolved representational structures derived from movement trajectories correlate with those derived from MEG, revealing the unfolding of category representations in comparable temporal detail (albeit delayed) to MEG. Furthermore, we show that movement-derived representational structures correlate with those derived from fMRI in most task-relevant brain areas, faces and objects selective areas in this proof of concept. Our results highlight the richness of movement trajectories and the power of the RSA framework to reveal and compare their information content, opening new avenues to better understand human perception.
Preprint
Full-text available
Our brain builds increasingly sophisticated representations along the ventral visual pathway to support object recognition, but how these representations unfold over time is poorly understood. Here we characterized time-varying representations of faces, places, and objects throughout the pathway using human intracranial electroencephalography. For ~100 ms after an initial feedforward sweep, representations at all stages evolved to be less driven by low-order features and more categorical. Low-level areas like V1 showed unexpected, late-emerging tolerance to image size, and late but not early responses of high-level occipitotemporal areas best matched their fMRI responses. Besides aligned, simultaneous representational changes, we found a trial-by-trial association between concurrent response patterns across stages. Our results suggest fast, multi-areal recurrent processing builds upon initial feedforward processing to generate more sophisticated object representations.
Article
Simply by opening our eyes, we can immediately recognize scenes and objects in natural environment, and enjoy the subjective experience of the world filled with a variety of shapes, colors, and textures. Vision research in the 20th century has investigated basic mechanisms of vision by using artificial stimuli such as lines, gratings, and computer-generated objects, and proposed a theory that the brain reconstructs the three-dimensional world from two-dimensional retinal images to recognize scenes and objects. However, the theory could not explain the visual perception of complex scenes and objects in the real world, such as paths in the forest, a soft bouquet of flowers on the table, and so on. What information processing does the brain use to produce such a rich and realistic “visual world”? The present paper reviews recent advances in basic vision research that provide a fresh insight on how the human brain with a limited processing capacity is able to perceive natural scenes and objects from retinal images.
Article
Graphical abstract (GA) is the pictorial presentation of key findings of the manuscript which make the readers and viewers understand the highlights of the manuscript. The adoption of GA in the journal improves manuscript’s visibility and journal’s citations in the subsequent years. There are various challenges faced by the author and editorial team to incorporate GA in the journal. Infographics play a major role as a resource for creating an effective GA. The GA helps in promoting and propagating scientific research among scholarly peers in the community for inter-country collaborations through various social media.
Article
Full-text available
Visual perception principle of watching video is crucial in ensuring video works accurately and effectively grasped by audience. This article proposes an investigation into the efficiency of human visual perception on video clips considering exposure duration. The study focused on the correlation between the video shot duration and the subject?s perception of visual content. The subjects? performances were captured as perceptual scores on the testing videos by watching time-regulated clips and taking questionnaire. The statistical results show that three-second duration for each video shot is necessary for audience to grasp the main visual information. The data also indicate gender differences in perceptual procedure and attention focus. The findings can help for manipulating clip length in video editing, both via AI tools and manually, maintaining perception efficiency as possible in limited duration. This method is significant for its structured experiment involving subjects? quantified performances, which is different from AI methods of unaccountable.
Article
Full-text available
In this report we discuss a variety of psychophysical experiments that explore different aspects of the problem of object recognition and representation in human vision. In all experiments, subjects were presented with realistically rendered images of computer-generated 3D objects, with tight control over stimulus shape, surface properties, illumination, and viewpoint, as well as subjects' prior exposure to the stimulus objects. Contrary to the predictions of the paradigmatic theory of recognition, which holds that object representations are viewpoint invariant, performance in all experiments was consistently viewpoint dependent, was only partially aided by binocular stereo and other depth information, was specific to viewpoints that were familiar, and was systematically disrupted by rotation in depth more than by deforming the 2D images of the stimuli. The emerging concept of multiple-views representation supported by these results is consistent with recently advanced computational theories of recognition based on view interpolation. Moreover, in several simulated experiments employing the same stimuli used in experiments with human subjects, models based on multiple-views representations replicated many of the psychophysical results concerning the observed pattern of human performance.
Article
Abstract The effects on event-related potentials (ERPs) of within- and across-modality repetition of words and nonwords were investigated. In Experiment 1, subjects detected occasional animal names embedded in a series of words. AU items were equally likely to be presented auditorily or visually. Some words were repetitions, either within- or across-modality, of words presented six items previously. Visual-visual repetition evoked a sustained positive shift, which onset around 250 msec and comprised two topographically and temporally distinct components. Auditory-visual repetition modulated only the later of these two components. For auditory EMS, within- and across-modality repetition evoked effects with similar onset latencies. The within-modality effect was initially the larger, but only at posterior sites. In Experiment 2, critical items were auditory and visual nonwords, and target items were auditory words and visual pseudohomophones. Visual-visual nonword repetition effects onset around 450 msec, and demonstrated a more anterior scalp distribution than those evoked by auditory-visual repetition. Visual-auditory repetition evoked only a small, late-onsetting effect, whereas auditory-auditory repetition evoked an effect that, at parietal sites only, was almost equivalent to that from the analogous condition of Experiment 1. These findings indicate that, as indexed by ERF's, repetition effects both within- and across-modality are influenced by lexical status. Possible parallels with the effects of word and nonword repetition on behavioral variables are discussed.
Article
When a sequence of pictures is presented at the rapid rate of 113 msec/picture, a viewer can detect a verbally specified target more than 60% of the time. In the present experiment, sequences of pictures were presented to 96 undergraduates at rates of 258, 172, and 114 msec/picture. A target was specified by name, superordinate category, or "negative" category (e.g., "the picture that is not of food"). Although the probability of detection decreased as cue specificity decreased, even in the most difficult condition (negative category cue at 114 msec/picture) 35% of the targets were detected. When the scores from the 3 detection tasks were compared with a control group's immediate recognition memory for the targets, immediate recognition memory was invariably lower than detection. Results are consistent with the hypothesis that rapidly presented pictures may be momentarily understood at the time of viewing and then quickly forgotten. (19 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Evoked potentials (EPs) were used to help identify the timing, location, and intensity of the information-processing stages applied to faces and words in humans. EP generators were localized using intracranial recordings in 33 patients with depth electrodes implanted in order to direct surgical treatment of drug-resistant epilepsy. While awaiting spontaneous seizure onset, the patients gave their fully informed consent to perform cognitive tasks. Depth recordings were obtained from 1198 sites in the occipital, temporal and parietal cortices, and in the limbic system (amygdala, hippocampal formation and posterior cingulate gyrus). Twenty-three patients received a declarative memory recognition task in which faces of previously unfamiliar young adults without verbalizable distinguishing features were exposed for 300 ms every 3 s; 25 patients received an analogous task using words. For component identification, some patients also received simple auditory (21 patients) or visual (12 patients) discrimination tasks. Eight successive EP stages preceding the behavioral response (at about 600 ms) could be distinguished by latency, and each of 14 anatomical structures was found to participate in 2–8 of these stages. The earliest response, an N75-P105, focal in the most medial and posterior of the leads implanted in the occipital lobe (lingual g), was probably generated in visual cortical areas 17 and 18. These components were not visible in response to words, presumably because words were presented foveally. A focal evoked alpha rhythm to both words and faces was also noted in the lingual g. This was followed by an N130-P180-N240 focal and polarity-inverting in the basal occipitotemporal cortex (fusiform g, probably areas 19 and 37). In most cases, the P180 was evoked only by faces, and not by words, letters or symbols. Although largest in the fusiform g this sequence of potentials (especially the N240) was also observed in the supramarginal g, posterior superior and middle temporal g, posterior cingulate g, and posterior hippocampal formation. The N130, but not later components of this complex, was observed in the anterior hippocampus and amygdala. Faces only also evoked longer-latency potentials up to 600 ms in the right fusiform g. Words only evoked a series of potentials beginning at 190 ms and extending to 600 ms in the fusiform g and near the angular g (especially left). Both words and faces evoked a N150-P200-PN260 in the lingual g, and posterior inferior and middle temporal g. A N310-N430-P630 sequence to words and faces was largest and polarity-inverted in the hippocampal formation and amygdala, but was also probably locally-generated in many sites including the lingual g, lateral occipitotemporal cortex, middle and superior temporal g, temporal pole, supramarginal g, and posterior cingulate g. The P660 had the same distribution as has been noted for the P3b to rare target simple auditory and visual stimuli in ‘oddball’ tasks, with inversions in the hippocampus. In several sites, the N310 and N430 were smaller to repeated faces, and the P630 was larger. Putative information-processing functions were tentatively assigned to successive EP components based upon their cognitive correlates, as well as the functions and connections of their generating structures. For the N75-P105, this putative function is simple feature detection in primary visual cortex (V1 and V2). The N130-P180-N240 may embody structural face encoding in posterobasal inferotemporal cortex (homologous to V4?), with the results being spread widely to inferotemporal, multimodal and paralimbic cortices. For words, similar visual-form encoding (in fusiform g) or visual-phonemic encoding (in angular g) may occur between 150 and 280 ms. During the N310, faces and words may be multiply encoded for form and identity (inferotemporal), emotional (amygdala), recent declarative mnestic (hippocampal formation), and semantic (supramarginal and superior temporal sulcal supramodal cortices) characteristics. These multiple characteristics may be contextually integrated across inferotemporal, supramodal association, and limbic cortices during the N430, with cognitive closure following in the P630. In sum, visual information arrives at area 17 by about 75 ms, and is structurally-encoded in occipito-temporal cortex during the next 110 ms. By 150–200 ms after stimulus onset, activation has spread to parietal, lateral temporal, and limbic cortices, all of which continue to participate with the more posterior areas for the next 500 ms of event-encoding. Thus, face and word processing is serial in the sense that it can be divided into successive temporal stages, but highly parallel in that (after the initial stages where visual primitives are extracted) multiple anatomical areas with distinct perceptual, mnestic and emotional functions are engaged simultaneously. Consequently, declarative memory and emotional encoding can participate in early stages of perceptual, as well as later stages of cognitive integration. Conversely, occipitotemporal cortex is involved both early in processing (immediately after V1), as well as later, in the N430. That is, most stages of face and word processing appear to take advantage of the rich ‘upstream’ and ‘downstream’ anatomical connections in the ventral visual processing stream to link the more strictly perceptual networks with semantic, emotional, and mnestic networks.
Article
WHEN an object such as a chair is presented visually, or is represented by a line drawing, a spoken word, or a written word, the initial stages in the process leading to understanding are clearly different in each case. There is disagreement, however, about whether those early stages lead to a common abstract representation in memory, the idea of a chair1-4, or to two separate representations, one verbal (common to spoken and written words), and the other image-like5. The first view claims that words and images are associated with ideas, but the underlying representation of an idea is abstract. According to the second view, the verbal representation alone is directly associated with abstract information about an object (for example, its superordinate category: furniture). Concrete perceptual information (for example, characteristic shape, colour or size) is associated with the imaginal representation. Translation from one representation to the other takes time, on the second view, which accounts for the observation that naming a line drawing takes longer than naming (reading aloud) a written word6,7. Here we confirm that naming a drawing of an object takes much longer than reading its name, but we show that deciding whether the object is in a given category such as `furniture' takes slightly less time for a drawing than for a word, a result that seems to be inconsistent with the second view.
Article
Event-related potentials (ERPs) were recorded from one midline and three pairs of lateral electrodes while subjects determined whether pairs of sequentially presented pictures were semantically associated. The ERPs evoked by the second picture of each pair differed as a consequence of whether it was associated with its predecessor, such that ERPs to nonassociated pictures were more negative-going than those to associated items. These differences resulted from the modulation of two ERP components, one frontally distributed and centered on an N300 deflection, the other distributed more widely over the scalp and encompassing an N450 deflection. The modulation of N450 is interpreted as further evidence that the "N400" ERP component is sensitive to semantic relationships between nonverbal stimuli. The earlier N300 effects, which do not appear to occur when ERPs are evoked by semantically primed and unprimed words, could suggest that the semantic processing of pictorial stimuli involves neural systems different from those associated with the semantic processing of words.
Article
Event-related potentials (ERPs) were recorded from one midline and three pairs of lateral electrodes while subjects determined whether a pair of sequentially presented pictures had rhyming or nonrhyming names. During the 1.56-sec interval between the two pictures, the slow ERP wave recorded over the left hemisphere was more negative-going than that over the right, especially at frontal electrodes. The ERPs evoked by the second picture differed as a function of whether its name rhymed with its predecessor. This difference, taking the form of increased negativity in ERPs to nonrhyming items, had an earlier onset and a greater magnitude at right than at left hemisphere electrodes. This pattern of ERP asymmetries is qualitatively similar to that found when words are rhyme-matched. It is therefore concluded that such asymmetries do not depend on the employment of orthographic material and may reflect some aspect(s) of the phonological processing of visually presented material.
Article
Event-related brain potentials (ERPs) were recorded while subjects determined whether two sequentially presented famous faces depicted individuals belonging to the same or to different occupational categories. During the 1.56 sec interval between the onset of the faces, ERPs recorded from right hemisphere electrodes were more negative-going than those from electrodes over the left hemisphere. The ERPs evoked by the second face on each trial differed as a consequence of whether or not the person depicted belonged to the occupational category specified by the first face. This difference took the form of a bilaterally-distributed negative-going shift in the ERPs evoked by non-matching as opposed to matching faces. This negativity was maximal around 450 msec post-stimulus. The ERP asymmetries during the inter-stimulus interval are interpreted as evidence for the engagement of cognitive processes lateralized to the right hemisphere. The match/non-match differences are considered to reflect the modulation of an "N400" component similar to that evoked by words, and thus suggest that such components can be modulated by associative priming between non-linguistic stimuli.
Article
Event-related potentials (ERPs) were recorded from one midline and three pairs of lateral sites while subjects made same/different judgements on sequentially presented pairs of familiar or unfamiliar faces. During the interval between the first and second face, a slow wave was more negative-going over the right than the left hemisphere, particularly when the faces were familiar. Following the second face, two regions of the waveforms were more negative-going when this face did not match the identity of its predecessor. In the early region (less than 160 msec), this effect was confined to posterior electrode sites and familiar faces. In the later region (greater than 250 msec), the match/non-match effect was widespread across the scalp and was evident for both familiar and non-familiar faces, although in the latency range 350-450 msec (encompassing the "N400" component), it was greater in magnitude in the case of familiar stimuli. It is suggested that the slow wave asymmetries reflect the engagement of short-term memory mechanisms lateralized to the right hemisphere. The match/non-match differences are thought to reflect multiple processes, including the modulation of the "N400" component. The sensitivity of this component to the familiarity manipulation is consistent with the hypothesis that the amplitude of N400 reflects an item's compatibility with currently activated memory representations.
Article
This report describes the main features of a view-based model of object recognition. The model does not attempt to account for specific cortical structures; it tries to capture general properties to be expected in a biological architecture for object recognition. The basic module is a regularization network (RBF-like; see Poggio and Girosi, 1989; Poggio, 1990) in which each of the hidden units is broadly tuned to a specific view of the object to be recognized. The network output, which may be largely view independent, is first described in terms of some simple simulations. The following refinements and details of the basic module are then discussed: (1) some of the units may represent only components of views of the object--the optimal stimulus for the unit, its "center," is effectively a complex feature; (2) the units' properties are consistent with the usual description of cortical neurons as tuned to multidimensional optimal stimuli and may be realized in terms of plausible biophysical mechanisms; (3) in learning to recognize new objects, preexisting centers may be used and modified, but also new centers may be created incrementally so as to provide maximal view invariance; (4) modules are part of a hierarchical structure--the output of a network may be used as one of the inputs to another, in this way synthesizing increasingly complex features and templates; (5) in several recognition tasks, in particular at the basic level, a single center using view-invariant features may be sufficient.