ArticlePDF Available

# A Model of Saliency-based Visual Attention for Rapid Scene Analysis

Authors:

## Abstract and Figures

A visual attention system, inspired by the behavior and the neuronal architecture of the early primate visual system, is presented. Multiscale image features are combined into a single topographical saliency map. A dynamical neural network then selects attended locations in order of decreasing saliency. The system breaks down the complex problem of scene understanding by rapidly selecting, in a computationally efficient manner, conspicuous locations to be analyzed in detail. Index terms: Visual attention, scene analysis, feature extraction, target detection, visual search. \Pi I. Introduction Primates have a remarkable ability to interpret complex scenes in real time, despite the limited speed of the neuronal hardware available for such tasks. Intermediate and higher visual processes appear to select a subset of the available sensory information before further processing [1], most likely to reduce the complexity of scene analysis [2]. This selection appears to be implemented in the ...
Content may be subject to copyright.
of return
intensity orientationscolors
Linear filtering
Center-surround differences and normalization
Input image
Attended location
Inhibition
Linear combinations
Winner-take-all
Saliency map
Conspicuity maps
Across-scale combinations and normalization
Feature
(12 maps) (6 maps) (24 maps)
maps
x
y
x
y
x
y
x
y
N (.)
Intensity map
Orientation map
N (.)
Arbitrary unitsArbitrary units
x
y
Stimulus
145 ms
206 ms
260 ms
92 ms
Output (FOA)
Input
image
I OC
S
SM
(1) (2) (3)
(a)
(b)
(c)
(d)
0
1
2
3
4
5
0 0.2 0.4 0.6 0.8 1
Noise density (d)
# false detections
0
1
2
3
4
5
0 0.2 0.4 0.6 0.8 1
Noise density (d)
# false detections
1x1 noise patches
5x5 noise patches
1x1 noise patches
5x5 noise patches
White-color noise Multicolored noise
d = 0.1 (5x5 patches)d = 0.5 (5x5 patches)
... The saliency detection algorithm based on contrast prior generates saliency map by using the color and brightness difference between the salient object and the background. The local contrast information-based saliency detection algorithm defines the surrounding region of single pixel or sub-area as a local region and then, calculates the contrast between pixels or sub-areas and local regions to obtain the saliency map [25][26][27]. [25] proposed a bottom-up saliency model, which uses a Gaussian image pyramid to extract primary features including color, brightness and direction, and saliency map is generated by fusing multi-scale image feature through central-surrounded operators. Image pyramid refers to a set of images with different resolutions. ...
... The local contrast information-based saliency detection algorithm defines the surrounding region of single pixel or sub-area as a local region and then, calculates the contrast between pixels or sub-areas and local regions to obtain the saliency map [25][26][27]. [25] proposed a bottom-up saliency model, which uses a Gaussian image pyramid to extract primary features including color, brightness and direction, and saliency map is generated by fusing multi-scale image feature through central-surrounded operators. Image pyramid refers to a set of images with different resolutions. ...
... Considering a Gaussian pyramid P = {F 0 , F 1 , ..., F n }, where F 0 represents the original image and F i (i ∈ [1, n]) represents the i th level of pyramid feature map and is generated by Gaussian smoothing filter and downsampling [28]. The high level feature map has low resolution and can reflect the overall information of the image, and the low level feature map has high resolution and can reflect the detailed information of the image, as shown F i in Fig. 2. [25] generate nine scales Gaussian pyramids and fuse the feature information of the pyramid feature maps of each scale to generate saliency map. They continuously decompose the image, which causes salient regions to be vaguely located and results in the clutter of salient and background regions. ...
Article
Full-text available
Effective segmentation of skin scars is an important part of judicial scar identification. The current evaluation procedure mainly focuses on visual examination, and the accuracy needs to be improved. This paper proposes an unsupervised scar segmentation method based on saliency detection, which can correctly extract scars in complex scar conditions. The image Gaussian pyramid feature map is extracted for feature extraction, and the clustering algorithm is used to fuse the contrast features and spatial features to establish the skin scar saliency map. Finally, the skin scar region is segmented through post-processing. The experimental results show that the precision of the proposed method is improved by 18.4%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\%$$\end{document} over the suboptional method compared with seven unsupervised methods on private datasets provided by a local forensic department, which proves that the proposed method supports the optimization and improvement of skin scar segmentation.
... Saliency detection aims to recognizing the most significant regions or objects in a scene, which plays an important role in many computer vision applications such as image automatic cropping, image segmentation, object recognition, etc. A pioneer study of saliency detection was proposed by Itti et al. [19], which produced the saliency maps by calculating the feature contrast between color, direction and intensity information. Subsequently, the saliency detection task is gradually divided into two branches, one dedicated to the locationbased visual saliency detection, and the other is the object-level saliency detection. ...
... We compare the proposed saliency detection model with eight state-of-the-art methods in terms of six widely used metrics on the test set of SIS dataset. As shown in Table 1, the methods Itti et al.'s method [19] and GBVS [17] in the first group are two traditional 2D image saliency detection models that only use hand-crafted feature for saliency detection. The remaining methods (MSI-Net [3], GazeGAN [7] and CASNet [12]) in the first group are three learning based 2D image saliency detection approaches. ...
... Moreover, we also provide the comparison of saliency maps obtained by the proposed saliency detection model with other state-of-the-art models in Fig. 6. From this figure, we can see that the traditional 2D and 3D saliency detection methods (Itti et al.'s method [19], GBVS [17] and Fang et al.'s method [14]) mistake the background as the salient regions, leading to poor performance in saliency detection. Compare the proposed model with other learning based 2D and 3D saliency detection models, we discover that the saliency maps obtained by our proposed model is closer to the ground-truth. ...
Article
Full-text available
In this paper, we propose a stereoscopic image thumbnail generation method guided by the stereoscopic image saliency. Specifically, we utilize an uncertain-weighted fusion mechanism to combine the spatial saliency information with the saliency driven by depth cues, generating the dense stereoscopic saliency fixation map. Subsequently, the obtained dense fixation map is converted into a salient object map through a saliency optimization module, which provides the object-level saliency cues for the thumbnail generation task. Under the guidance of the obtained salient object map, a cropping window is employed to cut out the most salient region and generate the stereoscopic thumbnails, such that the disparity distribution of the original image can be well preserved, and avoid sharply deforming certain structured objects in the subsequent warping operation. Finally, the warping operation is utilized to adjust the aspect ratio of the stereoscopic thumbnail to the target size. Qualitative and quantitative results demonstrate that our proposed method achieves superior performance than the state-of-the-art benchmarks on the public datasets.
... Saliency is commonly thought as a visual cue, but in effect it is a multiscale metric to measure the significance of an object as a result of its contrast with its surroundings. Itti et al. [27] present one of the earliest implementations of saliency computation. They use the center-surround operation to generate feature maps of saliency measures at multiple scales. ...
... In this paper, we have used five scales σ i ∈ σ, 3 2 σ, 2σ, 5 2 σ, 3σ , where σ is the number of interval frames indicating the time and varies for different types of motion. For combining saliency curves at different scales, we apply a nonlinear suppression operator to promote saliency curves with a small number of high peaks while suppressing saliency curves with a large number of similar peaks [27]. Firstly, all the saliency curves are normalized within the range [0, M] to ensure that they contribute equally to the result. ...
Article
Full-text available
Keyframes are a summary representation of motion capture data, which provide the basis for compression, retrieval, overview and reuse of motion capture data. In this paper, a new approach is proposed to extract keyframes from motion capture data. This approach uses the angle of rotation of limbs and the distance between joints as the feature representation of human movement and calculates limb saliency based on the multiscale saliency of each motion feature. Then the weighted sum of limb saliency is defined as pose saliency, and the frames corresponding to the local maxima on the pose saliency curve are extracted as the initial keyframes. Finally, guided by the initial keyframes, the optimal keyframes are extracted based on the reconstruction error optimization algorithm. Experiments demonstrate that this approach can effectively extract the keyframes with high visual perceptual quality and low reconstruction error, and better meet the needs of real-time analysis and compression of motion capture data.
... Traditional SOD methods [11,46,5,42,29,43] mainly rely on hand-made low-level features [11,5,42] and heuristic clues [46,29,43]. With the development of deep learning technology, more and more researchers tried to explore the integration of CNN and SOD. ...
... Traditional SOD methods [11,46,5,42,29,43] mainly rely on hand-made low-level features [11,5,42] and heuristic clues [46,29,43]. With the development of deep learning technology, more and more researchers tried to explore the integration of CNN and SOD. ...
Preprint
Most existing salient object detection (SOD) models are difficult to apply due to the complex and huge model structures. Although some lightweight models are proposed, the accuracy is barely satisfactory. In this paper, we design a novel semantics-guided contextual fusion network (SCFNet) that focuses on the interactive fusion of multi-level features for accurate and efficient salient object detection. Furthermore, we apply knowledge distillation to SOD task and provide a sizeable dataset KD-SOD80K. In detail, we transfer the rich knowledge from a seasoned teacher to the untrained SCFNet through unlabeled images, enabling SCFNet to learn a strong generalization ability to detect salient objects more accurately. The knowledge distillation based SCFNet (KDSCFNet) achieves comparable accuracy to the state-of-the-art heavyweight methods with less than 1M parameters and 174 FPS real-time detection speed. Extensive experiments demonstrate the robustness and effectiveness of the proposed distillation method and SOD framework. Code and data: https://github.com/zhangjinCV/KD-SCFNet.
... For example, Berga et al. argued that visual saliency is mostly driven by low-level features [31]. Itti established a classic visual attention model based on low-level features [32]. Berman et al. found that low-level features, including contrast, lines, and color in the image, were associated with ratings of naturalness [33]. ...
Article
Full-text available
Objective assessment of image quality seeks to predict image quality without human perception. Given that the ultimate goal of a blind/no-reference image quality assessment (BIQA) algorithm is to provide a score consistent with the subject’s prediction, it makes sense to design an algorithm that resembles human behavior. Recently, a large number of image features have been introduced to image quality assessment. However, only a few of these features are generated by using the computational mechanisms of the visual cortex. In this paper, we propose bioinspired algorithms to extract image features for BIQA by simulating the visual cortex. We extract spatial features like texture and energy from images by mimicking the retinal circuit. We extract spatial-frequency features from images by imitating the simple cell of the primary visual cortex. And we extract color features from images by employing the color opponent mechanism of the biological vision system. Then, by using the statistical features derived from these physiologically plausible features, we train a support vector regression model to predict image quality. The experimental results show that the proposed algorithm is more consistent with subjective evaluations than the comparison algorithms in predicting image quality.
Chapter
Modeling smoke dispersion from wildland fires is a complex problem. Heat and emissions are released from a fire front as well as from post-frontal combustion, and both are continuously evolving in space and time, providing an emission source that is unlike the industrial sources for which most dispersion models were originally designed. Convective motions driven by the fire’s heat release strongly couple the fire to the atmosphere, influencing the development and dynamics of the smoke plume. This chapter examines how fire events are described in the smoke modeling process and explores new research tools that may offer potential improvements to these descriptions and can reduce uncertainty in smoke model inputs. Remote sensing will help transition these research tools to operations by providing a safe and reliable means of measuring the fire environment at the space and time scales relevant to fire behavior.
Article
In the field of saliency detection, salient instance segmentation is a novel challenging task that has received widespread attention. Due to the limited scale of the available dataset and the high cost of mask annotations, a substantial quantity of supervision sources is urgently required to train a high-performing salient instance model. To this end, we aim to train a novel salient instance segmentation model by weak supervisions that make full use of the existing salient object detection dataset. In this paper, we present a cyclic global context salient instance segmentation network (CGCNet) supervised by the combination of salient regions and bounding boxes from ready-made salient object detection datasets. To locate salient instances more accurately, a global feature refining layer is designed to expand the size of the features from the region of interest (ROI) to the global field in a scene. Moreover, a labeling updating scheme is embedded in the proposed framework to iteratively update the weak labels. Extensive experimental results demonstrate that our CGCNet trained by weak labels is competitive with the existing fully-supervised state-of-the-art methods.
Article
The current study investigated how infants (6–24 months), children (2–12 years), and adults differ in how visual cues—visual saliency and centering—guide their attention to faces in videos. We report a secondary analysis of Kadooka and Franchak (2020), in which observers’ eye movements were recorded during viewing of television clips containing a variety of faces. For every face on every video frame, we calculated its visual saliency (based on both static and dynamic image features) and calculated how close the face was to the center of the image. Results revealed that participants of every age looked more often at each face when it was more salient compared to less salient. In contrast, centering did not increase the likelihood that infants looked at a given face, but in later childhood and adulthood, centering became a stronger cue for face looking. A control analysis determined that the age‐related change in centering was specific to face looking; participants of all ages were more likely to look at the center of the image, and this center bias did not change with age. The implications for using videos in educational and diagnostic contexts are discussed.
Article
Healthy operation of the tail rope is crucial to the stable and safe operation of a friction hoisting system. Failure of the tail rope will threaten the property and personnel. In this study, a fault diagnosis algorithm based on deep learning is proposed for the tail rope. Specifically, we add a spatial attention mechanism in the feature extraction stage to assign different weights to different regions in images. This way “guides” the model to focus on more important regions. A class-balance cross-entropy loss is introduced to alleviate the imbalanced data distribution in the actual conditions for enhancing the robustness of the algorithm and its transferability in practical applications. Experimental studies are conducted to validate the algorithm. The accuracy of the algorithm on the conducted dataset is 99.4819%. The accuracy of the provided algorithm is increased by 10% and 7% compared with those of the hand-crafted features, namely, scale-invariant feature transform with support vector machine and random forest, respectively. Results show that the proposed algorithm can meet the requirements of high accuracy and generalization in practical engineering applications.
Article
Full-text available
Article
Full-text available
Article
Full-text available
An important component of routine visual behavior is the ability to find one item in a visual world filled with other, distracting items. This ability to performvisual search has been the subject of a large body of research in the past 15 years. This paper reviews the visual search literature and presents a model of human search behavior. Built upon the work of Neisser, Treisman, Julesz, and others, the model distinguishes between a preattentive, massively parallel stage that processes information about basic visual features (color, motion, various depth cues, etc.) across large portions of the visual field and a subsequent limited-capacity stage that performs other, more complex operations (e.g., face recognition, reading, object identification) over a limited portion of the visual field. The spatial deployment of the limited-capacity process is under attentional control. The heart of the guided search model is the idea that attentional deployment of limited resources isguided by the output of the earlier parallel processes. Guided Search 2.0 (GS2) is a revision of the model in which virtually all aspects of the model have been made more explicit and/or revised in light of new data. The paper is organized into four parts: Part 1 presents the model and the details of its computer simulation. Part 2 reviews the visual search literature on preattentive processing of basic features and shows how the GS2 simulation reproduces those results. Part 3 reviews the literature on the attentional deployment of limited-capacity processes in conjunction and serial searches and shows how the simulation handles those conditions. Finally, Part 4 deals with shortcomings of the model and unresolved issues.
Article
Attention mechanisms extract regions of interest from image data to reduce the amount of information to be analyzed by time-consuming processes such as image transmission, robot navigation, and object recognition. Two such mechanisms are described. The first one is an alerting system that extract moving objects in a sequence through the use of multiresolution representations. The second one detects regions in still images that are likely to contain objects of interest. Two types of cues are used and integrated to compute the measure of interest. First, bottom-up cues result from the decomposition of the input image into a number of feature and conspicuity maps. The second type is cues is top-down, and is obtained from a priori knowledge about target objects, represented through invariant models. Results are reported for both the alerting and the attention mechanisms using cluttered and noisy scenes.
Book
Neural network research often builds on the fiction that neurons are simple linear threshold units, completely neglecting the highly dynamic and complex nature of synapses, dendrites, and voltage-dependent ionic currents. Biophysics of Computation: Information Processing in Single Neurons challenges this notion, using richly detailed experimental and theoretical findings from cellular biophysics to explain the repertoire of computational functions available to single neurons. The author shows how individual nerve cells can multiply, integrate, or delay synaptic inputs and how information can be encoded in the voltage across the membrane, in the intracellular calcium concentration, or in the timing of individual spikes. Key topics covered include the linear cable equation; cable theory as applied to passive dendritic trees and dendritic spines; chemical and electrical synapses and how to treat them from a computational point of view; nonlinear interactions of synaptic input in passive and active dendritic trees; the Hodgkin-Huxley model of action potential generation and propagation; phase space analysis; linking stochastic ionic channels to membrane-dependent currents; calcium and potassium currents and their role in information processing; the role of diffusion, buffering and binding of calcium, and other messenger systems in information processing and storage; short- and long-term models of synaptic plasticity; simplified models of single cells; stochastic aspects of neuronal firing; the nature of the neuronal code; and unconventional models of sub-cellular computation. Biophysics of Computation: Information Processing in Single Neurons serves as an ideal text for advanced undergraduate and graduate courses in cellular biophysics, computational neuroscience, and neural networks, and will appeal to students and professionals in neuroscience, electrical and computer engineering, and physics.
Article
A model for aspects of visual attention based on the concept of selective tuning is presented. It provides for a solution to the problems of selection in an image, information routing through the visual processing hierarchy and task-specific attentional bias. The central thesis is that attention acts to optimize the search procedure inherent in a solution to vision. It does so by selectively tuning the visual processing network which is accomplished by a top-down hierarchy of winner-take-all processes embedded within the visual processing pyramid. Comparisons to other major computational models of attention and to the relevant neurobiology are included in detail throughout the paper. The model has been implemented; several examples of its performance are shown. This model is a hypothesis for primate visual attention, but it also outperforms existing computational solutions for attention in machine vision and is highly appropriate to solving the problem in a robot vision system.
Article
Reliable vision-based control of an autonomous vehicle requires the ability to focus attention on the important features in an input scene. Previous work with an autonomous lane following system, ALVINN (Pomerleau, 1993), has yielded good results in uncluttered conditions. This paper presents an artificial neural network based learning approach for handling difficult scenes which will confuse the ALVINN system. This work presents a mechanism for achieving task-specific focus of attention by exploiting temporal coherence. A saliency map, which is based upon a computed expectation of the contents of the inputs in the next time step, indicates which regions of the input retina are important for performing the task. The saliency map can be used to accentuate the features which are important for the task, and de-emphasize those which are not.