Article

Novel scene understanding, from gist to elaboration

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... However, in many cases, the similarity is not the object identity: a variety of circular objects appear in one cluster, athletes or dogs on a grass background in another, and a variety of sea creatures on a blue background in another. What connects images within the same cluster is "the gist of the scene" [28,29,30,31,32], ...
Preprint
Full-text available
Generative diffusion models learn probability densities over diverse image datasets by estimating the score with a neural network trained to remove noise. Despite their remarkable success in generating high-quality images, the internal mechanisms of the underlying score networks are not well understood. Here, we examine a UNet trained for denoising on the ImageNet dataset, to better understand its internal representation and computation of the score. We show that the middle block of the UNet decomposes individual images into sparse subsets of active channels, and that the vector of spatial averages of these channels can provide a nonlinear representation of the underlying clean images. We develop a novel algorithm for stochastic reconstruction of images from this representation and demonstrate that it recovers a sample from a set of images defined by a target image representation. We then study the properties of the representation and demonstrate that Euclidean distances in the latent space correspond to distances between conditional densities induced by representations as well as semantic similarities in the image space. Applying a clustering algorithm in the representation space yields groups of images that share both fine details (e.g., specialized features, textured regions, small objects), as well as global structure, but are only partially aligned with object identities. Thus, we show for the first time that a network trained solely on denoising contains a rich and accessible sparse representation of images.
Article
Experiments on visually grounded, definite reference production often manipulate simple visual scenes in the form of grids filled with objects, for example, to test how speakers are affected by the number of objects that are visible. Regarding the latter, it was found that speech onset times increase along with domain size, at least when speakers refer to nonsalient target objects that do not pop out of the visual domain. This finding suggests that even in the case of many distractors, speakers perform object‐by‐object scans of the visual scene. The current study investigates whether this systematic processing strategy can be explained by the simplified nature of the scenes that were used, and if different strategies can be identified for photo‐realistic visual scenes. In doing so, we conducted a preregistered experiment that manipulated domain size and saturation; replicated the measures of speech onset times; and recorded eye movements to measure speakers’ viewing strategies more directly. Using controlled photo‐realistic scenes, we find (1) that speech onset times increase linearly as more distractors are present; (2) that larger domains elicit relatively fewer fixation switches back and forth between the target and its distractors, mainly before speech onset; and (3) that speakers fixate the target relatively less often in larger domains, mainly after speech onset. We conclude that careful object‐by‐object scans remain the dominant strategy in our photo‐realistic scenes, to a limited extent combined with low‐level saliency mechanisms. A relevant direction for future research would be to employ less controlled photo‐realistic stimuli that do allow for interpretation based on context.
Article
Full-text available
Used 3 converging procedures to determine whether pictures presented in a rapid sequence at rates comparable to eye fixations are understood and then quickly forgotten. In 2 experiments, with 96 and 16 college students, respectively, sequences of 16 color photographs were presented at rates of 113, 167, 250, or 333 msec/picture. In 1 group, Ss were given an immediate test of recognition memory for the pictures and in other groups they searched for a target picture. Even when the target had only been specified by a title (e.g., a boat), detection of a target was strikingly superior to recognition memory. Detection was slightly but significantly better for pictured than named targets. In Exp III, with 8 college students, pictures were presented for 50, 70, 90, or 120 msec preceded and followed by a visual mask; at 120 msec recognition memory was as accurate as detection had been. Results, taken together with those of M. C. Potter and E. I. Levy for slower rates of sequential presentation, suggest that on the average a scene is understood and so becomes immune to ordinary visual masking within about 100 msec but requires about 300 msec of further processing before the memory representation is resistant to conceptual masking from a following picture. Possible functions of a short-term conceptual memory (e.g., the control of eye fixations) are discussed. (25 ref)
Article
Full-text available
When we study the human ability to attend, what exactly do we seek to understand? It is not clear what the answer might be to this question. There is still so much to know, while acknowledging the tremendous progress of past decades of research. It is as if each new study adds a tile to the mosaic that, when viewed from a distance, we hope will reveal the big picture of attention. However, there is no map as to how each tile might be placed nor any guide as to what the overall picture might be. It is like digging up bits of mosaic tile at an ancient archeological site with no key as to where to look and then not only having to decide which picture it belongs to but also where exactly in that puzzle it should be placed. I argue that, although the unearthing of puzzle pieces is very important, so is their placement, but this seems much less emphasized. We have mostly unearthed a treasure trove of puzzle pieces but they are all waiting for cleaning and reassembly. It is an activity that is scientifically far riskier, but with great risk comes a greater reward. Here, I will look into two areas of broad agreement, specifically regarding visual attention, and dig deeper into their more nuanced meanings, in the hope of sketching a starting point for the guide to the attention mosaic. The goal is to situate visual attention as a purely computational problem and not as a data explanation task; it may become easier to place the puzzle pieces once you understand why they exist in the first place.
Article
Full-text available
Understanding consciousness is a major frontier in the natural sciences. However, given the nuanced and ambiguous sets of conditions regarding how and when consciousness appears to manifest, it is also one of the most elusive topics for investigation. In this context, we argue that research in empirical aesthetics—specifically on the experience of art—holds strong potential for this research area. We suggest that empirical aesthetics of art provides a more exhaustive description of conscious perception than standard laboratory studies or investigations of the less artificial, more ecological perceptual conditions that dominate this research, leading to novel and better suited designs for natural science research on consciousness. Specifically, we discuss whether empirical aesthetics of art could be used for a more adequate picture of an observer’s attributions in the context of conscious perception. We point out that attributions in the course of conscious perception to (distal) objects versus to media (proximal objects) as origins of the contents of consciousness are typically swift and automatic. However, unconventional or novel object-media relations used in art can bring these attributions to the foreground of the observer’s conscious reflection. This is the reason that art may be ideally suited to study human attributions in conscious perception compared to protocols dedicated only to the most common and conventional perceptual abilities observed under standard laboratory or “natural”/ecological conditions alone. We also conclude that art provides an enormous stock of such unconventional and novel object-media relations, allowing more systematic falsification of tentative conclusions about conscious perception versus research protocols covering more conventional (ecological) perception only. We end with an outline of how this research could be carried out in general.
Article
Full-text available
Past research suggests recognizing scene gist, a viewers’ holistic semantic representation of a scene acquired within a single eye fixation, involves purely feed-forward mechanisms. We investigated if expectations can influence scene categorization. To do this, we embedded target scenes in more ecologically valid, first-person viewpoint, image sequences, along spatiotemporally connected routes (e.g., an office to a parking lot). We manipulated the sequences’ spatiotemporal coherence by presenting them either coherently or in random order. Participants’ identified the category of 1 target scene in a 10-scene image rapid serial visual presentation. Categorization accuracy was greater for targets in coherent sequences. Accuracy was also greater for targets with more visually similar primes. In Experiment 2, we investigated whether targets in coherent sequences were more predictable and whether predictable images were identified more accurately in Experiment 1 when accounting for the effect of prime-to-target visual similarity. To do this, we removed targets and had participants predict the category of the missing scene. Images were more accurately predicted in coherent sequences, and both image predictability and prime-to-target visual similarity independently contributed to performance in Experiment 1. To test whether prediction-based facilitation effects were solely due to response bias, participants performed a two-alternative forced-choice task in which they indicated whether the target was an intact or a phase-randomized scene. Critically, predictability of the target category was irrelevant to this task. Nevertheless, results showed sensitivity, but not response bias, was greater for targets in coherent sequences. Predictions made prior to viewing a scene facilitate scene gist recognition.
Article
Full-text available
Decades of reading research have led to sophisticated accounts of single-word recognition and, in parallel, accounts of eye-movement control in text reading. Although these two endeavors have strongly advanced the field, their relative independence has precluded an integrated account of the reading process. To bridge the gap, we here present a computational model of reading, OB1-reader, which integrates insights from both literatures. Key features of OB1 are as follows: (1) parallel processing of multiple words, modulated by an attentional window of adaptable size; (2) coding of input through a layer of open bigram nodes that represent pairs of letters and their relative position; (3) activation of word representations based on constituent bigram activity, competition with other word representations and contextual predictability; (4) mapping of activated words onto a spatiotopic sentence-level representation to keep track of word order; and (5) saccade planning, with the saccade goal being dependent on the length and activation of surrounding word units, and the saccade onset being influenced by word recognition. A comparison of simulation results with experimental data shows that the model provides a fruitful and parsimonious theoretical framework for understanding reading behavior.
Article
Full-text available
Traditionally, recognizing the objects within a scene has been treated as a prerequisite to recognizing the scene itself. However, research now suggests that the ability to rapidly recognize visual scenes could be supported by global properties of the scene itself rather than the objects within the scene. Here, we argue for a particular instantiation of this view: That scenes are recognized by treating them as a global texture and processing the pattern of orientations and spatial frequencies across different areas of the scene without recognizing any objects. To test this model, we asked whether there is a link between how proficient individuals are at rapid scene perception and how proficiently they represent simple spatial patterns of orientation information (global ensemble texture). We find a significant and selective correlation between these tasks, suggesting a link between scene perception and spatial ensemble tasks but not nonspatial summary statistics In a second and third experiment, we additionally show that global ensemble texture information is not only associated with scene recognition, but that preserving only global ensemble texture information from scenes is sufficient to support rapid scene perception; however, preserving the same information is not sufficient for object recognition. Thus, global ensemble texture alone is sufficient to allow activation of scene representations but not object representations. Together, these results provide evidence for a view of scene recognition based on global ensemble texture rather than a view based purely on objects or on nonspatially localized global properties.
Article
Full-text available
What determines what we see? In contrast to the traditional “modular” understanding of perception, according to which visual processing is encapsulated from higher-level cognition, a tidal wave of recent research alleges that states such as beliefs, desires, emotions, motivations, intentions, and linguistic representations exert direct top-down influences on what we see. There is a growing consensus that such effects are ubiquitous, and that the distinction between perception and cognition may itself be unsustainable. We argue otherwise: none of these hundreds of studies — either individually or collectively — provide compelling evidence for true top-down effects on perception, or “cognitive penetrability”. In particular, and despite their variety, we suggest that these studies all fall prey to only a handful of pitfalls. And whereas abstract theoretical challenges have failed to resolve this debate in the past, our presentation of these pitfalls is empirically anchored: in each case, we show not only how certain studies could be susceptible to the pitfall (in principle), but how several alleged top-down effects actually are explained by the pitfall (in practice). Moreover, these pitfalls are perfectly general, with each applying to dozens of other top-down effects. We conclude by extracting the lessons provided by these pitfalls into a checklist that future work could use to convincingly demonstrate top-down effects on visual perception. The discovery of substantive top-down effects of cognition on perception would revolutionize our understanding of how the mind is organized; but without addressing these pitfalls, no such empirical report will license such exciting conclusions.
Article
Full-text available
Object-substitution masking (OSM) is a unique paradigm for the examination of object updating processes. However, existing models of OSM are underspecified with respect to the impact of object updating on the quality of target representations. Using two paradigms of OSM combined with a mixture model analysis we examine the impact of post-perceptual processes on a target's representational quality within conscious awareness. We conclude that object updating processes responsible for OSM cause degradation in the precision of object representations. These findings contribute to a growing body of research advocating for the application of mixture model analysis to the study of how cognitive processes impact the quality (i.e., precision) of object representations.
Article
Full-text available
Although we are able to rapidly understand novel scene images, little is known about the mechanisms that support this ability. Theories of optimal coding assert that prior visual experience can be used to ease the computational burden of visual processing. A consequence of this idea is that more probable visual inputs should be facilitated relative to more unlikely stimuli. In three experiments, we compared the perceptions of highly improbable real-world scenes (e.g., an underwater press conference) with common images matched for visual and semantic features. Although the two groups of images could not be distinguished by their low-level visual features, we found profound deficits related to the improbable images: Observers wrote poorer descriptions of these images (Exp. 1), had difficulties classifying the images as unusual (Exp. 2), and even had lower sensitivity to detect these images in noise than to detect their more probable counterparts (Exp. 3). Taken together, these results place a limit on our abilities for rapid scene perception and suggest that perception is facilitated by prior visual experience.
Article
Full-text available
How does scene complexity influence the detection of expected and appropriate objects within the scene? Traffic research has indicated that vulnerable road users (VRUs: pedestrians, bicyclists, and motorcyclists) are sometimes not perceived, despite being expected. Models of scene perception emphasize competition for limited neural resources in early perception, predicting that an object can be missed during quick glances because other objects win the competition to be individuated and consciously perceived. We used pictures of traffic scenes and manipulated complexity by inserting or removing vehicles near a to-be-detected VRU (crowding). The observers' sole task was to detect a VRU in the laterally presented pictures. Strong bias effects occurred, especially when the VRU was crowded by other nearby vehicles: Observers failed to detect the VRU (high miss rates), while making relatively few false alarm errors. Miss rates were as high as 65% for pedestrians. The results indicated that scene context can interfere with the perception of expected objects when scene complexity is high. Because urbanization has greatly increased scene complexity, these results have important implications for public safety.
Article
Full-text available
The role of target typicality in a categorical visual search task was investigated by cueing observers with a target name, followed by a five-item target present/absent search array in which the target images were rated in a pretest to be high, medium, or low in typicality with respect to the basic-level target cue. Contrary to previous work, we found that search guidance was better for high-typicality targets compared to low-typicality targets, as measured by both the proportion of immediate target fixations and the time to fixate the target. Consistent with previous work, we also found an effect of typicality on target verification times, the time between target fixation and the search judgment; as target typicality decreased, verification times increased. To model these typicality effects, we trained Support Vector Machine (SVM) classifiers on the target categories, and tested these on the corresponding specific targets used in the search task. This analysis revealed significant differences in classifier confidence between the high-, medium-, and low-typicality groups, paralleling the behavioral results. Collectively, these findings suggest that target typicality broadly affects both search guidance and verification, and that differences in typicality can be predicted by distance from an SVM classification boundary.
Article
Full-text available
The conclusion that scene knowledge interacts with object perception depends on evidence that object detection is facilitated by consistent scene context. Experiment 1 replicated the I. Biederman, R. J. Mezzanotte, and J. C. Rabinowitz (1982) object-detection paradigm. Detection performance was higher for semantically consistent versus inconsistent objects. However, when the paradigm was modified to control for response bias (Experiments 2 and 3) or when response bias was eliminated by means of a forced-choice procedure (Experiment 4), no such advantage obtained. When an additional source of biasing information was eliminated by presenting the object label after the scene (Experiments 3 and 4), there was either no effect of consistency (Experiment 4) or an inconsistent object advantage (Experiment 3). These results suggest that object perception is not facilitated by consistent scene context. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
Quantitative predictions are made from a model for word recognition. The model has as its central feature a set of "logogens," devices which accept information relevant to a particular word response irrespective of the source of this information. When more than a threshold amount of information has accumulated in any logogen, that particular response becomes available for responding. The model is tested against data available on (1) the effect of word frequency on recognition, (2) the effect of limiting the number of response alternatives, (3) the interaction of stimulus and context, and (4) the interaction of successive presentations of stimuli. Implications of the underlying model are largely upheld. Other possible models for word recognition are discussed as are the implications of the logogen model for theories of memory. (30 ref.) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
Viewers can rapidly extract a holistic semantic representation of a real-world scene within a single eye fixation, an ability called recognizing the gist of a scene, and operationally defined here as recognizing an image's basic-level scene category. However, it is unknown how scene gist recognition unfolds over both time and space-within a fixation and across the visual field. Thus, in 3 experiments, the current study investigated the spatiotemporal dynamics of basic-level scene categorization from central vision to peripheral vision over the time course of the critical first fixation on a novel scene. The method used a window/scotoma paradigm in which images were briefly presented and processing times were varied using visual masking. The results of Experiments 1 and 2 showed that during the first 100 ms of processing, there was an advantage for processing the scene category from central vision, with the relative contributions of peripheral vision increasing thereafter. Experiment 3 tested whether this pattern could be explained by spatiotemporal changes in selective attention. The results showed that manipulating the probability of information being presented centrally or peripherally selectively maintained or eliminated the early central vision advantage. Across the 3 experiments, the results are consistent with a zoom-out hypothesis, in which, during the first fixation on a scene, gist extraction extends from central vision to peripheral vision as covert attention expands outward. (PsycINFO Database Record (c) 2013 APA, all rights reserved).
Article
Full-text available
This article seeks to establish a rapprochement between explicitly Bayesian models of contextual effects in perception and neural network models of such effects, particularly the connectionist interactive activation (IA) model of perception. The article is in part an historical review and in part a tutorial, reviewing the probabilistic Bayesian approach to understanding perception and how it may be shaped by context, and also reviewing ideas about how such probabilistic computations may be carried out in neural networks, focusing on the role of context in interactive neural networks, in which both bottom-up and top-down signals affect the interpretation of sensory inputs. It is pointed out that connectionist units that use the logistic or softmax activation functions can exactly compute Bayesian posterior probabilities when the bias terms and connection weights affecting such units are set to the logarithms of appropriate probabilistic quantities. Bayesian concepts such the prior, likelihood, (joint and marginal) posterior, probability matching and maximizing, and calculating vs. sampling from the posterior are all reviewed and linked to neural network computations. Probabilistic and neural network models are explicitly linked to the concept of a probabilistic generative model that describes the relationship between the underlying target of perception (e.g., the word intended by a speaker or other source of sensory stimuli) and the sensory input that reaches the perceiver for use in inferring the underlying target. It is shown how a new version of the IA model called the multinomial interactive activation (MIA) model can sample correctly from the joint posterior of a proposed generative model for perception of letters in words, indicating that interactive processing is fully consistent with principled probabilistic computation. Ways in which these computations might be realized in real neural systems are also considered.
Article
Full-text available
The general problem of visual search can be shown to be computationally intractable in a formal, complexity-theoretic sense, yet visual search is extensively involved in everyday perception, and biological systems manage to perform it remarkably well. Complexity level analysis may resolve this contradiction. Visual search can be reshaped into tractability through approximations and by optimizing the resources devoted to visual processing. Architectural constraints can be derived using the minimum cost principle to rule out a large class of potential solutions. The evidence speaks strongly against bottom-up approaches to vision. In particular, the constraints suggest an attentional mechanism that exploits knowledge of the specific problem being solved. This analysis of visual search performance in terms of attentional influences on visual information processing and complexity satisfaction allows a large body of neurophysiological and psychological evidence to be tied together.
Article
Full-text available
A set of visual search experiments tested the proposal that focused attention is needed to detect change. Displays were arrays of rectangles, with the target being the item that continually changed its orientation or contrast polarity. Five aspects of performance were examined: linearity of response, processing time, capacity, selectivity, and memory trace. Detection of change was found to be a self-termi-nating process requiring a time that increased linearly with the number of items in the display. Capacity for orientation was found to be about five items, a value comparable to estimates of attentional capacity. Observers were able to filter out both static and dynamic variations in irrelevant properties. Analysis also indi-cated a memory for previously attended locations. These results support the hypothesis that the process needed to detect change is much the same as the attentional process needed to detect complex static pat-terns. Interestingly, the features of orientation and polarity were found to be han-dled in somewhat different ways. Taken together, these results not only provide evidence that focused attention is needed to see change, but also show that change detection itself can provide new insights into the nature of attentional processing.
Article
Full-text available
One of the more powerful impressions created by vision is that of a coherent, richly-detailed world where everything is present simultaneously. Indeed, this impression is so compelling that we tend to ascribe these properties not only to the external world, but to our internal representations as well. But results from several recent experiments argue against this latter ascription. For example, changes in images of real-world scenes often go unnoticed when made during a saccade, flicker, blink, or movie cut. This "change blindness" provides strong evidence against the idea that our brains contain a picture-like representation of the scene that is everywhere detailed and coherent. How then do we represent a scene? It is argued here that focused attention provides spatiotemporal coherence for the stable representation of one object at a time. It is then argued that the allocation of attention can be coordinated to create a "virtual representation". In such a scheme, a stable object representation is formed whenever needed, making it appear to higher levels as if all objects in the scene are represented in detail simultaneously.
Article
Full-text available
Visual word identification requires readers to code the identity and order of the letters in a word and match this code against previously learned codes. Current models of this lexical matching process posit context-specific letter codes in which letter representations are tied to either specific serial positions or specific local contexts (e.g., letter clusters). The spatial coding model described here adopts a different approach to letter position coding and lexical matching based on context-independent letter representations. In this model, letter position is coded dynamically, with a scheme called spatial coding. Lexical matching is achieved via a method called superposition matching, in which input codes and learned codes are matched on the basis of the relative positions of their common letters. Simulations of the model illustrate its ability to explain a broad range of results from the masked form priming literature, as well as to capture benchmark findings from the unprimed lexical decision task.
Article
Full-text available
Visual masking, throughout its history, has been used as an investigative tool in exploring the temporal dynamics of visual perception, beginning with retinal processes and ending in cortical processes concerned with the conscious registration of stimuli. However, visual masking also has been a phenomenon deemed worthy of study in its own right. Most of the recent uses of visual masking have focused on the study of central processes, particularly those involved in feature, object and scene representations, in attentional control mechanisms, and in phenomenal awareness. In recent years our understanding of the phenomenon and cortical mechanisms of visual masking also has benefited from several brain imaging techniques and from a number of sophisticated and neurophysiologically plausible neural network models. Key issues and problems are discussed with the aim of guiding future empirical and theoretical research.
Article
Full-text available
Previous research measuring visual short-term memory (VSTM) suggests that the capacity for representing the layout of objects is fairly high. In four experiments, we further explored the capacity of VSTM for layout of objects, using the change detection method. In Experiment 1, participants retained most of the elements in displays of 4 to 8 elements. In Experiments 2 and 3, with up to 20 elements, participants retained many of them, reaching a capacity of 13.4 stimulus elements. In Experiment 4, participants retained much of a complex naturalistic scene. In most cases, increasing display size caused only modest reductions in performance, consistent with the idea of configural, variable-resolution grouping. The results indicate that participants can retain a substantial amount of scene layout information (objects and locations) in short-term memory. We propose that this is a case of remote visual understanding, where observers' ability to integrate information from a scene is paramount.
Article
Full-text available
How much can be seen in a single brief exposure? This is an important problem because our normal mode of seeing greatly resembles a sequence of brief exposures. In this report, the following experiments were conducted to study quantitatively the information that becomes available to an observer following a brief exposure. Lettered stimuli were chosen because these contain a relatively large amount of information per item and because these are the kind of stimuli that have been used by most previous investigators. The first two experiments are essentially control experiments; they attempt to confirm that immediate-memory for letters is independent of the parameters of stimulation, that it is an individual characteristic. In the third experiment the number of letters available immediately after the extinction of the stimulus is determined by means of a sampling (partial report) procedure described. The fourth experiment explores decay of available information with time. The fifth experiment examines some exposure parameters. In the sixth experiment a technique which fails to demonstrate a large amount of available information is investigated. The seventh experiment deals with the role of the historically important variable: order of report. It was found that each observer was able to report only a limited number of symbols correctly. For exposure durations from 15 to 500 msec, the average was slightly over four letters; stimuli having four or fewer letters were reported correctly nearly 100% of the time. It is also concluded that the high accuracy of partial report observed in the experiments does not depend on the order of report or on the position of letters on the stimulus, but rather it is shown to depend on the ability of the observer to read a visual image that persists for a fraction of a second after the stimulus has been turned off.
Article
Full-text available
Three experiments examined the time course of layout priming with photographic scenes varying in complexity (number of objects). Primes were presented for varying durations (800-50 ms) before a target scene with 2 spatial probes; observers indicated whether the left or right probe was closer to viewpoint. Reaction time was the main measure. Scene primes provided maximum benefits with 200 ms or less prime duration, indicating that scene priming is rapid enough to influence everyday distance perception. The time course of prime processing was similar for simple and complex scene primes and for upright and inverted primes, suggesting that the prime representation was intermediate level in nature.
Article
Full-text available
Visual science is currently a highly active domain, with much progress being made in fields such as colour vision, stereo vision, perception of brightness and contrast, visual illusions, etc. But the "real" mystery of visual perception remains comparatively unfathomed, or at least relegated to philosophical status: Why it is that we can see so well with what is apparently such a badly constructed visual apparatus? In this paper I will discuss several defects of vision and the classical theories of how they are overcome. I will criticize these theories and suggest an alternative approach, in which the outside world is considered as a kind of external memory store which can be accessed instantaneously by casting one's eyes (or one's attention) to some location. The feeling of the presence and extreme richness of the visual world is, under this view, a kind of illusion, created by the immediate availability of the information in this external store.
Article
Full-text available
In a typical perceptual identification task, a word is presented for a few milliseconds and masked; then subjects are asked to report the word. It has been found that an earlier presentation of the test word will improve identification of the test word by as much as 30%. In addition, this facilitation has been shown to be preserved under amnesia. In this article we examine a fundamental question: Is the facilitation the result of bias toward the earlier presented item, an improvement in perceptual sensitivity, or both? The experiments presented here use a forced choice procedure to show that prior presentation of an item biases the subject to choose that item but does not improve discriminability. This result is obtained when the distractor items are visually similar to the target items. When distractors are dissimilar, earlier presentation affects neither bias nor discriminability. Two models of word identification are examined in light of the bias effects, and implications for understanding savings in amnesia are also examined.
Article
Full-text available
In contrast to expectation-based, predictive views of discourse comprehension, a model is developed in which the initial processing is strictly bottom-up. Word meanings are activated, propositions are formed, and inferences and elaborations are produced without regard to the discourse context. However, a network of interrelated items is created in this manner, which can be integrated into a coherent structure through a spreading activation process. Data concerning the time course of word identification in a discourse context are examined. A simulation of arithmetic word-problem understanding provides a plausible account for some well-known phenomena in this area.
Article
Full-text available
The question of what makes a concept coherent (what makes its members form a comprehensible class) has received a variety of answers. In this article we review accounts based on similarity, feature correlations, and various theories of categorization. We find that each theory provides an inadequate account of conceptual coherence (or no account at all) because none provides enough constraints on possible concepts. We propose that concepts are coherent to the extent that they fit people's background knowledge or naive theories about the world. These theories help to relate the concepts in a domain and to structure the attributes that are internal to a concept. Evidence of the influence of theories on various conceptual tasks is presented, and the possible importance of theories in cognitive development is discussed.
Article
Full-text available
Three areas of high-level scene perception research are reviewed. The first concerns the role of eye movements in scene perception, focusing on the influence of ongoing cognitive processing on the position and duration of fixations in a scene. The second concerns the nature of the scene representation that is retained across a saccade and other brief time intervals during ongoing scene perception. Finally, we review research on the relationship between scene and object identification, focusing particularly on whether the meaning of a scene influences the identification of constituent objects.
Article
The world contains not only objects and features (red apples, glass bowls, wooden tables), but also relations holding between them (apples contained in bowls, bowls supported by tables). Representations of these relations are often developmentally precocious and linguistically privileged; but how does the mind extract them in the first place? Although relations themselves cast no light onto our eyes, a growing body of work suggests that even very sophisticated relations display key signatures of automatic visual processing. Across physical, eventive, and social domains, relations such as support, fit, cause, chase, and even socially interact are extracted rapidly, are impossible to ignore, and influence other perceptual processes. Sophisticated and structured relations are not only judged and understood, but also seen — revealing surprisingly rich content in visual perception itself.
Article
We live in a rich, three dimensional world with complex arrangements of meaningful objects. For decades, however, theories of visual attention and perception have been based on findings generated from lines and color patches. While these theories have been indispensable for our field, the time has come to move on from this rather impoverished view of the world and (at least try to) get closer to the real thing. After all, our visual environment consists of objects that we not only look at, but constantly interact with. Having incorporated the meaning and structure of scenes, i.e. its “grammar”, then allows us to easily understand objects and scenes we have never encountered before. Studying this grammar provides us with the fascinating opportunity to gain new insights into the complex workings of attention, perception, and cognition. In this review, I will discuss how the meaning and the complex, yet predictive structure of real-world scenes influence attention allocation, search, and object identification.
Chapter
People make eye movements while interacting with objects, and these behaviors are rich with information about how visual goals are represented in the brain and used to prioritize sequential motor behavior. Here we adopt a real-world perspective and define goal-directed attention control as the guidance (or biasing) of gaze to target-object goals that have uncertain visual appearance. Specifically, we review models of goal-directed attention control that have attempted to predict the behavioral fixations made in the search for target-category goals in images. We will show how modeling perspectives on this question changed over the decades. Using the year 2020 as a reference, we will critically review the recent past of the categorical search modeling literature (~ 2000–2010), the literature defining our present (~ 2010–2020), and speculate about the future of search models and the directions that the literature may turn in the next decade (~ 2020–2030).
Article
Mainstream theories of visual short-term memory (VSTM) posit that VSTM consists of a single, limited capacity store (e.g., Luck & Vogel, 2013). Recently, however, some researchers (Sligte, Scholte & Lamme, 2008; van Moorselaar et al., 2015; Vandenbroucke et al., 2015) have proposed that VSTM consists of two separate components, a limited capacity (3–4 item) durable store and a fragile, high-capacity (5–7 item) store. To assess the structure of VSTM, these authors used a change detection task that required participants to compare two arrays (a memory array and a test array) separated by a brief temporal interval. Critically, these researchers compared performance under conditions when a cue was shown prior to the test array (retro-cue), or after the test array was shown (post-cue). They reported that participants could recall at least twice as many items in the retro-cue condition as in the post-cue condition and interpreted this as evidence for the existence of an initial stage of VSTM with a much higher capacity than was previously thought. The view that VSTM may consist of two distinct stores challenges decades of evidence and theory on the structure of VSTM. In the current study, we directly examined the architecture of VSTM using state-trace analysis, a direct method for assessing the dimensionality of psychological constructs. We replicated the benefit of presenting retro- versus post-cues on memory. However, the results of the state-trace analysis were consistent with a single store model of VSTM. We conclude that the improvement in performance in the retro-cue condition reflects increased attentional processing of the probed item rather than a distinct memory store.
Article
Although our subjective impression is of a richly detailed visual world, numerous empirical results suggest that the amount of visual information observers can perceive and remember at any given moment is limited. How can our subjective impressions be reconciled with these objective observations? Here, we answer this question by arguing that, although we see more than the handful of objects, claimed by prominent models of visual attention and working memory, we still see far less than we think we do. Taken together, we argue that these considerations resolve the apparent conflict between our subjective impressions and empirical data on visual capacity, while also illuminating the nature of the representations underlying perceptual experience. Numerous empirical results highlight the limits of visual perception, attention, and working memory. However, it intuitively feels as though we have a rich perceptual experience, leading many to claim that conscious perception overflows these limited cognitive mechanisms.A relatively new field of study (visual ensembles and summary statistics) provides empirical support for the notion that perception is not limited and that observers have access to information across the entire visual world.Ensemble statistics, and scene processing in general, also appear to be supported by neural structures that are distinct from those supporting object perception. These distinct mechanisms can work partially in parallel, providing observers with a broad perceptual experience.Moreover, new demonstrations show that perception is not as rich as is intuitively believed. Thus, ensemble statistics appear to capture the entirety of perceptual experience.
Article
This article sets out to examine the role of symbolic and sensorimotor representations in discourse comprehension. It starts out with a review of the literature on situation models, showing how mental representations are constrained by linguistic and situational factors. These ideas are then extended to more explicitly include sensorimotor representations. Following Zwaan and Madden (2005), the author argues that sensorimotor and symbolic representations mutually constrain each other in discourse comprehension. These ideas are then developed further to propose two roles for abstract concepts in discourse comprehension. It is argued that they serve as pointers in memory, used (1) cataphorically to integrate upcoming information into a sensorimotor simulation, or (2) anaphorically integrate previously presented information into a sensorimotor simulation. In either case, the sensorimotor representation is a specific instantiation of the abstract concept.
Article
This experiment demonstrates the influence of the prior presentation of visual scenes on the identification of briefly presented drawings of real-world objects. Different pairings of objects and scenes were used to produce three main contextual conditions: appropriate, inappropriate, and no context. Correct responses and confusions with visually similar objects depended strongly on both the contextual condition and the particular target object presented. The probability of being correct was highest in the appropriate context condition and lowest in the inappropriate context condition. Confidence ratings of responses were a function of the perceptual similarity between the stimulus object and the named object; they were not strongly affected by contextual conditions. Morton's (1970) "logogen" model provided a good quantitative fit to the response probability data.
Article
Visual working memory capacity is of great interest because it is strongly correlated with overall cognitive ability, can be understood at the level of neural circuits, and is easily measured. Recent studies have shown that capacity influences tasks ranging from saccade targeting to analogical reasoning. A debate has arisen over whether capacity is constrained by a limited number of discrete representations or by an infinitely divisible resource, but the empirical evidence and neural network models currently favor a discrete item limit. Capacity differs markedly across individuals and groups, and recent research indicates that some of these differences reflect true differences in storage capacity whereas others reflect variations in the ability to use memory capacity efficiently.
Article
In studies in which unrelated photographs are presented in RSVP, viewers can readily detect a picture when given a brief descriptive title such as picnic or two men talking, at rates of presentation up to about 10 pictures/s, even though they have never seen that picture before and an infinite number of different pictures could fit the description (Intraub, 1981; Potter, 1976). ( Figure 1). Evidently viewers can extract the conceptual gist of a picture rapidly, retrieving relevant conceptual information about objects and their background from long term memory (e.g., Davenport & Potter, 2004). Having spotted the target picture, viewers can continue to attend to it and consolidate it into working memory--after the sequence they can describe the picnic scene, for example. Yet viewers forget most pictures presented at that rate almost immediately, when they are not looking for a particular target, as shown in Figure 1 and Demo 1. The rate must be slowed to about 2 pictures/s for viewers to recognize as many as half the pictures as familiar, shortly after the sequence. However, even at a rate of presentation of 6 pictures/s viewers are usually able to remember most of the pictures if tested for recognition within a second of the end of the sequence (Potter, Staub, Rado, & O'Connor, 2002). That is, they will usually remember the first picture tested, if testing begins immediately; performance drops off rapidly over the first few seconds ( Figure 2). Importantly, one sees a similar fall-off in performance when the test is in the form of picture titles, showing that the gist of most pictures was initially represented but then forgotten (Potter, Staub, & O'Connor, 2004). Thus, gist can be extracted rapidly, but may be quickly forgotten without further processing.
Article
Although at any instant we experience a rich, detailed visual world, we do not use such visual details to form a stable representation across views. Over the past five years, researchers have focused increasingly on 'change blindness' (the inability to detect changes to an object or scene) as a means to examine the nature of our representations. Experiments using a diverse range of methods and displays have produced strikingly similar results: unless a change to a visual scene produces a localizable change or transient at a specific position on the retina, generally, people will not detect it. We review theory and research motivating work on change blindness and discuss recent evidence that people are blind to changes occurring in photographs, in motion pictures and even in real-world interactions. These findings suggest that relatively little visual information is preserved from one view to the next, and question a fundamental assumption that has underlain perception research for centuries: namely, that we need to store a detailed visual representation in the mind/brain from one view to the next.
Article
Eye movements are now widely used to investigate cognitive processes during reading, scene perception, and visual search. In this article, research on the following topics is reviewed with respect to reading: (a) the perceptual span (or span of effective vision), (b) preview benefit, (c) eye movement control, and (d) models of eye movements. Related issues with respect to eye movements during scene perception and visual search are also reviewed. It is argued that research on eye movements during reading has been somewhat advanced over research on eye movements in scene perception and visual search and that some of the paradigms developed to study reading should be more widely adopted in the study of scene perception and visual search. Research dealing with "real-world" tasks and research utilizing the visual-world paradigm are also briefly discussed.
Article
What information is available from a brief glance at a novel scene? Although previous efforts to answer this question have focused on scene categorization or object detection, real-world scenes contain a wealth of information whose perceptual availability has yet to be explored. We compared image exposure thresholds in several tasks involving basic-level categorization or global- property classification. All thresholds were remarkably short: Observers achieved 75%-correct performance with presentations ranging from 19 to 67 ms, reaching maximum performance at about 100 ms. Global-property categorization was performed with significantly less presentation time than basic-level categorization, which suggests that there exists a time during early visual processing when a scene may be classified as, for example, a large space or navigable, but not yet as a mountain or lake. Comparing the relative availability of visual information reveals bottlenecks in the accumulation of meaning. Understanding these bottlenecks provides critical insight into the computations underlying rapid visual understanding.
Article
Empirical results from both reading and speech perception indicate that stimulus and context information have independent influences on perceptual recognition. Massaro (1989) argued that these data are inconsistent with an interactive activation and competition (IAC) model (McClelland & Rumelhart, 1981), and consistent with the fuzzy logical model of perception (FLMP) (Massaro, 1979; 1989). McClelland (1991) than modified the interactive activation model to be stochastic rather than deterministic and to use a best one wins (BOW) decision rule, allowing it to predict independent influences of stimulus and context. When tested against real data, however, the network proposed by McClelland and extended by us gives a poorer description of actual empirical results than the FLMP. To account for the dynamics of information processing, the SIAC model, an interactive model based on the Boltzmann machine, and the FLMP are formulated to make quantitative predictions of performance as a function of processing time. It is shown that the dynamic FLMP provides a better description of the time course of perceptual processing than does interactive activation. The SIAC and Boltzmann models have difficulty predicting 1) context effects given little processing time and 2) a strong stimulus influence given substantial processing time. Finally, we demonstrate that the FLMP predicts that context can improve the accuracy of performance, in addition to providing a bias to respond with the alternative supported by context. In summary, there is now both empirical and theoretical evidence in favor of the FLMP over SIAC models of pattern recognition. We therefore argue that interactive activation is both less consistent with empirical results and not necessary to describe the joint influence of stimulus and context in language perception.
Article
A new hypothesis about the role of focused attention is proposed. The feature-integration theory of attention suggests that attention must be directed serially to each stimulus in a display whenever conjunctions of more than one separable feature are needed to characterize or distinguish the possible objects presented. A number of predictions were tested in a variety of paradigms including visual search, texture segregation, identification and localization, and using both separable dimensions (shape and color) and local elements or parts of figures (lines, curves, etc. in letters) as the features to be integrated into complex wholes. The results were in general consistent with the hypothesis. They offer a new set of criteria for distinguishing separable from integral features and a new rationale for predicting which tasks will show attention limits and which will not.
Article
Sets of similar objects are common occurrences--a crowd of people, a bunch of bananas, a copse of trees, a shelf of books, a line of cars. Each item in the set may be distinct, highly visible, and discriminable. But when we look away from the set, what information do we have? The current article starts to address this question by introducing the idea of a set representation. This idea was tested using two new paradigms: mean discrimination and member identification. Three experiments using sets of different-sized spots showed that observers know a set's mean quite accurately but know little about the individual items, except their range. Taken together, these results suggest that the visual system represents the overall statistical, and not individual, properties of sets.
Article
When a visual scene, containing many discrete objects, is presented to our retinae, only a subset of these objects will be explicitly represented in visual awareness. The number of objects accessing short-term visual memory might be even smaller. Finally, it is not known to what extent "ignored" objects (those that do not enter visual awareness) will be processed--or recognized. By combining free recall, forced-choice recognition and visual priming paradigms for the same natural visual scenes and subjects, we were able to estimate these numbers, and provide insights as to the fate of objects that are not explicitly recognized in a single fixation. When presented for 250 ms with a scene containing 10 distinct objects, human observers can remember up to 4 objects with full confidence, and between 2 and 3 more when forced to guess. Importantly, the objects that the subjects consistently failed to report elicited a significant negative priming effect when presented in a subsequent task, suggesting that their identity was represented in high-level cortical areas of the visual system, before the corresponding neural activity was suppressed during attentional selection. These results shed light on neural mechanisms of attentional competition, and representational capacity at different levels of the human visual system.