Multisensory 3D Saliency for Artificial Attention Systems
September 2015
Conference: Third Workshop on Recognition and Action for Scene Understanding (REACTS 2015), 16th International Conference of Computer Analysis of Images and Patterns (CAIP)
In this paper we present proof-of-concept for a novel solution consisting of a short-term 3D memory for artificial attention systems, loosely inspired in perceptual processes believed to be implemented in the human brain. Our solution supports the implementation of multisen-sory perception and stimulus-driven processes of attention. For this purpose , it provides (1) knowledge persistence with temporal coherence tackling potential salient regions outside the field of view, via a panoramic, log-spherical inference grid; (2) prediction, by using estimates of local 3D velocity to anticipate the effect of scene dynamics; (3) spatial correspondence between volumetric cells potentially occupied by proto-objects and their corresponding multisensory saliency scores. Visual and auditory signals are processed to extract features that are then filtered by a proto-object segmentation module that employs colour and depth as discriminatory traits. We consider as features, apart from the commonly used colour and intensity contrast, colour bias, the presence of faces, scene dynamics and also loud auditory sources. Combining conspicuity maps derived from these features we obtain a 2D saliency map, which is then processed using the probability of occupancy in the scene to construct the final 3D saliency map as an additional layer of the Bayesian Volumetric Map (BVM) inference grid.
... This requires the implementation of the representational skill, which toddlers achieve in early stages of development [12], and involves discovering the object that the interlocutor is referring to. One of the possible approaches discussed in the literature is to provide an internal representation of the environment that works as a short-term memory and stores important information for attention [4,8], and then use the gaze cue to modulate the potential attended objects. ...
... Regarding to attention, the information provided by the different sensors must be subjected to a spatial correspondence [3]. For instance, a sound source should be related with its potential origin location in order to make possible the correspondent attention action [8]. In fact, in overt attention, where the scene is partially observed and changes depending on the actions (e.g., head movements), if we do not enable the machine to have an internal representation of the environment with temporal registration, its actions will become reactive for each location and angles setup. ...
... In this work we will use the 3D log-spherical representation proposed in [4] to codify the perceived environment. The method is also valid when the saliency of the scene is modelled [8]. ...
Human gaze is one of the most important cue for social robotics due to its embedded intention information. Discovering the location or the object that an interlocutor is staring at, gives the machine some insight to perform the correct attentional behaviour. This work presents a fast voxel traversal algorithm for estimating the potential locations that a human is gazing. Given a 3D occupancy map in log-spherical coordinates and the gaze vector, we evaluate the regions that are relevant for attention by computing the set of intersected voxels between an arbitrary gaze ray in the 3D space and a log-spherical bounded section defined by . The first intersected voxel is computed in closed form and the rest are obtained by binary search guaranteeing no repetitions in the intersected set. The proposed method is motivated and validated within a human-robot interaction application: gaze tracing for artificial attention systems.
... The robot, binding visual (saliency map with motion) and proprioceptive (accelerometers) sensory contingencies, detects whether a pixel belongs to itself or not. This layer combines probabilistic inference grids with attentional maps [40]. ...
... On one hand, the image stream provided by the camera is processed by a visual attention system [21], that contributes with two main outputs: the saliency map with the protoobject relevance encoded in a 2D image and a list of attended protoobjects in the working memory. The spatial saliency is computed using several conspicuity maps that represent different features of the protoobjects [40]: color and intensity contrast, optical flow and color bias. These feature maps (2D images) are combined by weighted average using a fixed attentional set (weights), which in the case of having contextual information is used for top-down modulation. ...
We address self-perception in robots as the key for world understanding and causality interpretation. We present a self-perception mechanism that enables a humanoid robot to understand certain sensory changes caused by naive actions during interaction with objects. Visual, proprioceptive and tactile cues are combined via artificial attention and probabilistic reasoning to permit the robot to discern between inbody and outbody sources in the scene.With that support and exploiting inter-modal sensory contingencies, the robot can infer simple concepts such as discovering potential "usable" objects. Theoretically and through experimentation with a real humanoid robot, we show how self-perception is a backdrop ability for high order cognitive skills. Moreover, we present a novel model for self-detection, which does not need to track the body parts. Furthermore, results show that the proposed approach successfully discovers objects in the reaching space improving scene understanding by discriminating real objects from visual artefacts.
... For instance, computing where the human is looking at and where the robot should look at or which object should be grasped. Furthermore, multi-sensory and 3D saliency computation has also been investigated (Lanillos et al., 2015b). Finally, more complex attention behaviors, particularly designed for social robotics and based on human non-verbal communication, such as joint attention, have also been addressed. ...
Computational models of visual attention in artificial intelligence and robotics have been inspired by the concept of a saliency map. These models account for the mutual information between the (current) visual information and its estimated causes. However, they fail to consider the circular causality between perception and action. In other words, they do not consider where to sample next, given current beliefs. Here, we reclaim salience as an active inference process that relies on two basic principles: uncertainty minimization and rhythmic scheduling. For this, we make a distinction between attention and salience. Briefly, we associate attention with precision control, i.e., the confidence with which beliefs can be updated given sampled sensory data, and salience with uncertainty minimization that underwrites the selection of future sensory data. Using this, we propose a new account of attention based on rhythmic precision-modulation and discuss its potential in robotics, providing numerical experiments that showcase its advantages for state and noise estimation, system identification and action selection for informative path planning.
... For instance, computing where the human is looking at and where the robot should look at or which object should be grasped. Furthermore, multi-sensory and 3D saliency computation has also been investigated [Lanillos et al., 2015b]. Finally, more complex attention behaviours, particularly designed for social robotics and based on human non-verbal communication, such as joint attention, have also been addressed. ...
Computational models of visual attention in artificial intelligence and robotics have been inspired by the concept of a saliency map. These models account for the mutual information between the (current) visual information and its estimated causes. However, they fail to consider the circular causality between perception and action. In other words, they do not consider where to sample next, given current beliefs. Here, we reclaim salience as an active inference process that relies on two basic principles: uncertainty minimisation and rhythmic scheduling. For this, we make a distinction between attention and salience. Briefly, we associate attention with precision control, i.e., the confidence with which beliefs can be updated given sampled sensory data, and salience with uncertainty minimisation that underwrites the selection of future sensory data. Using this, we propose a new account of attention based on rhythmic precision-modulation and discuss its potential in robotics, providing numerical experiments that showcase advantages of precision-modulation for state and noise estimation, system identification and action selection for informative path planning.
... Results showed that the robot was able to discern between inbody and outbody sources without using markers or simplified segmentation. Figure 1 shows the proto-object saliency system [4] used as visual input and the computed probability of the image regions belonging to the robot arm. Body perception was formalized as an inference problem while the robot was interacting with the world. ...
Artificial self-perception is the machine ability to perceive its own body, i.e., the mastery of modal and intermodal contingencies of performing an action with a specific sensors/actuators body configuration. In other words, the spatio-temporal patterns that relate its sensors (e.g. visual, proprioceptive, tactile, etc.), its actions and its body latent variables are responsible of the distinction between its own body and the rest of the world. This paper describes some of the latest approaches for modelling artificial body self-perception: from Bayesian estimation to deep learning. Results show the potential of these free-model unsupervised or semi-supervised crossmodal/intermodal learning approaches. However, there are still challenges that should be overcome before we achieve artificial multisensory body perception.
... 3D Attention. Many works have been done on 3D human attention[Sugano et al., 2014;Jeni and Cohn, 2016;FunesMora and Odobez, 2016;Mansouryar et al., 2016;Chen et al., 2008;Lanillos et al., 2015]. Funes-Mora and Odobez[Funes-Mora and Odobez, 2016]estimated gaze directions based on head poses and eye images. ...
This paper addresses the problem of inferring 3D human attention in RGB-D videos at scene scale. 3D human attention describes where a human is looking in 3D scenes. We propose a probabilistic method to jointly model attention, intentions, and their interactions. Latent intentions guide human attention which conversely reveals the intention features. This mutual interaction makes attention inference a joint optimization with latent intentions. An EM-based approach is adopted to learn the latent intentions and model parameters. Given an RGB-D video with 3D human skeletons, a joint-state dynamic programming algorithm is utilized to jointly infer the latent intentions, the 3D attention directions, and the attention voxels in scene point clouds. Experiments on a new 3D human attention dataset prove the strength of our method.
... The robot, binding visual (saliency map with motion) and proprioceptive (accelerometers) cues, detects whether a pixel belongs to itself or not. This layer combines probabilistic inference grids with attentional maps [14]. To avoid the tracking of the robot parts 1st order dynamics (velocities) are learnt online. ...
We address self-perception and object discovery by integrating multimodal tactile, proprioceptive and visual cues. Considering sensory signals as the only way to obtain relevant information about the environment, we enable a humanoid robot to infer potential usable objects relating visual self-detection with tactile cues. Hierarchical Bayesian models are combined with signal processing and protoobject artificial attention to tackle the problem. Results show that the robot is able to: (1) discern between inbody and outbody sources without using markers or simplified segmentation; (2) accurately discover objects in the reaching space; and (3) discriminate real objects from visual artefacts, aiding scene understanding. Furthermore, this approach reveals the need for several layers of abstraction for achieving agency and causality due to the inherent ambiguity of the sensory cues.
... Applications in education are envisioned when helping students in engineering to understand conventions in technical drawing. Other applications in computer vision would be interesting, for example when computing 3D attention saliency in proto-objects [42], a Q3D description could help to store a short memory narrative of the evolution of the proto-objects in these attention artificial systems. ...
Given the amount and variety of saliency models, the knowledge of their pros and cons, the applications they are more suitable for, or which are the more challenging scenes for each of them, would be very useful for the progress in the field. This assessment can be done based on the link between algorithms and public datasets. In one hand, performance scores of algorithms can be used to cluster video samples according to the pattern of difficulties they pose to models. In the other hand, cluster labels can be combined with video annotations to select discriminant attributes for each cluster. In this work we seek this link and try to describe each cluster of videos in a few words.
General dynamic scenes involve multiple rigid and flexible objects, with relative and common motion, camera induced or not. The complexity of the motion events together with their strong spatio-temporal correlations make the estimation of dynamic visual saliency a big computational challenge. In this work, we propose a computational model of saliency based on the assumption that perceptual relevant information is carried by high-order statistical structures. Through whitening, we completely remove the second-order information (correlations and variances) of the data, gaining access to the relevant information. The proposed approach is an analytically tractable and computationally simple framework which we call Dynamic Adaptive Whitening Saliency (AWS-D). For model assessment, the provided saliency maps were used to predict the fixations of human observers over six public video datasets, and also to reproduce the human behavior under certain psychophysical experiments (dynamic pop-out). The results demonstrate that AWS-D beats state-of-the-art dynamic saliency models, and suggest that the model might contain the basis to understand the key mechanisms of visual saliency. Experimental evaluation was performed using an extension to video of the well-known methodology for static images, together with a bootstrap permutation test (random label hypothesis) which yields additional information about temporal evolution of the metrics statistical significance.
Although our sensory experience is mostly multisensory in nature, research on working memory representations has focused mainly on examining the senses in isolation. Results from the multisensory processing literature make it clear that the senses interact on a more intimate manner than previously assumed. These interactions raise questions regarding the manner in which multisensory information is maintained in working memory. We discuss the current status of research on multisensory processing and the implications of these findings on our theoretical understanding of working memory. To do so, we focus on reviewing working memory research conducted from a multisensory perspective, and discuss the relation between working memory, attention, and multisensory processing in the context of the predictive coding framework. We argue that a multisensory approach to the study of working memory is indispensable to achieve a realistic understanding of how working memory processes maintain and manipulate information.
The color red is known to influence psychological functioning, having both negative (e.g., blood, fire, danger), and positive (e.g., sex, food) connotations. The aim of our study was to assess the attentional capture by red-colored images, and to explore the modulatory role of the emotional valence in this process, as postulated by Elliot and Maier (2012) color-in-context theory. Participants completed a dot-probe task with each cue comprising two images of equal valence and arousal, one containing a prominent red object and the other an object of different coloration. Reaction times were measured, as well as the event-related lateralizations of the EEG. Modulation of the lateralized components revealed that the color red captured and later held the attention in both positive and negative conditions, but not in a neutral condition. An overt motor response to the target stimulus was affected mainly by attention lingering over the visual field where the red cue had been flashed. However, a weak influence of the valence could still be detected in reaction times. Therefore, red seems to guide attention, specifically in emotionally-valenced circumstances, indicating that an emotional context can alter color’s impact both on attention and motor behavior.
The human ability of unconsciously attending to social signals , together with other even more primitive automatic at-tentional processes, has been argued in the literature to play an important part in social interaction. In this paper, we will argue that the evaluation of the influence of these unconscious perceptual processes in social interaction with robots has been addressed in previous research in many cases in an ad hoc fashion, while, on the contrary, it should be tackled systematically, bridging more conventional measures from robotics with criteria stemming from ideas used in human studies in psychology, neuroscience and social sciences. We will start by establishing an experimental canvas that will limit complexity to a sustainable level, while still fostering adaptive behaviour and variability in interaction. We will then present a brief assessment of the criteria used in the HRI literature to study this particular type of experiments in order to evaluate success, followed by a suggestion of adaptation of other criteria used in human studies, which has only been sporadically and non-systematically performed in HRI research – in most cases, more as expression of future intents. We will conclude by proposing a methodology for this evaluation , to be applied in the project " Coordinated Attention for Social Interaction with Robots " sponsored by the Por-tuguese Foundation for Science and Technology (FCT).
Artificial vision systems cannot process all the information that they receive from the world in real time because it is highly expensive and inefficient in terms of computational cost. Inspired by biological perception systems, artificial attention models pursuit to select only the relevant part of the scene. On human vision, it is also well established that these units of attention are not merely spatial but closely related to perceptual objects (proto-objects). This implies a strong bidirectional relationship between segmentation and attention processes. While the segmentation process is the responsible to extract the proto-objects from the scene, attention can guide segmentation, arising the concept of foveal attention. When the focus of attention is deployed from one visual unit to another, the rest of the scene is perceived but at a lower resolution that the focused object. The result is a multi-resolution visual perception in which the fovea, a dimple on the central retina, provides the highest resolution vision. In this paper, a bottom-up foveal attention model is presented. In this model the input image is a foveal image represented using a Cartesian Foveal Geometry (CFG), which encodes the field of view of the sensor as a fovea (placed in the focus of attention) surrounded by a set of concentric rings with decreasing resolution. Then multi-resolution perceptual segmentation is performed by building a foveal polygon using the Bounded Irregular Pyramid (BIP). Bottom-up attention is enclosed in the same structure, allowing to set the fovea over the most salient image proto-object. Saliency is computed as a linear combination of multiple low level features such as color and intensity contrast, symmetry, orientation and roundness. Obtained results from natural images show that the performance of the combination of hierarchical foveal segmentation and saliency estimation is good in terms of accuracy and speed.
This review intends to provide an overview of the state of the art in the modeling and implementation of automatic attentional mechanisms for socially interactive robots. Humans assess and exhibit intentionality by resorting to multisensory processes that are deeply rooted within low-level automatic attention-related mechanisms of the brain. For robots to engage with humans properly, they should also be equipped with similar capabilities. Joint attention, the precursor of many fundamental types of social interactions, has been an important focus of research in the past decade and a half, therefore providing the perfect backdrop for assessing the current status of state-of-the-art automatic attentional-based solutions. Consequently, we propose to review the influence of these mechanisms in the context of social interaction in cutting-edge research work on joint attention. This will be achieved by summarizing the contributions already made in these matters in robotic cognitive systems research, by identifying the main scientific issues to be addressed by these contributions and analyzing how successful they have been in this respect, and by consequently drawing conclusions that may suggest a roadmap for future successful research efforts.
State-of-the-art bottom-up saliency models often assign high saliency values at or near high-contrast edges, whereas people tend to look within the regions delineated by those edges, namely the objects. To resolve this inconsistency, in this work we estimate saliency at the level of coherent image regions. According to object-based attention theory, the human brain groups similar pixels into coherent regions, which are called proto-objects. The saliency of these proto-objects is estimated and incorporated together. As usual, attention is given to the most salient image regions. In this paper we employ state-of-the-art computer vision techniques to implement a proto-object-based model for visual attention. Particularly, a hierarchical image segmentation algorithm is used to extract proto-objects. The two most powerful ways to estimate saliency, rarity-based and contrast-based saliency, are generalized to assess the saliency at the proto-object level. The rarity-based saliency assesses if the proto-object contains rare or outstanding details. The contrast-based saliency estimates how much the proto-object differs from the surroundings. However, not all image regions with high contrast to the surroundings attract human attention. We take this into account by distinguishing between external and internal contrast-based saliency. Where the external contrast-based saliency estimates the difference between the proto-object and the rest of the image, the internal contrast-based saliency estimates the complexity of the proto-object itself. We evaluate the performance of the proposed method and its components on two challenging eye-fixation datasets (Judd, Ehinger, Durand, & Torralba, 2009; Subramanian, Katti, Sebe, Kankanhalli, & Chua, 2010). The results show the importance of rarity-based and both external and internal contrast-based saliency in fixation prediction. Moreover, the comparison with state-of-the-art computational models for visual saliency demonstrates the advantage of proto-objects as units of analysis.
The idea of two separate attention networks in the human brain for the voluntary deployment of attention and the reorientation to unexpected events, respectively, has inspired an enormous amount of research over the past years. In this review, we will reconcile these theoretical ideas on the dorsal and ventral attentional system with recent empirical findings from human neuroimaging experiments and studies in stroke patients. We will highlight how novel methods-such as the analysis of effective connectivity or the combination of neurostimulation with functional magnetic resonance imaging-have contributed to our understanding of the functionality and interaction of the two systems. We conclude that neither of the two networks controls attentional processes in isolation and that the flexible interaction between both systems enables the dynamic control of attention in relation to top-down goals and bottom-up sensory stimulation. We discuss which brain regions potentially govern this interaction according to current task demands.
The ability to share the attention with another individual is essential for having intuitive interaction. Two relatively simple, but important prerequisites for this, saliency detection and attention manipulation by the robot, are identified in the first part of the paper. By creating a saliency based attentional model combined with a robot ego-sphere and by adopting attention manipulation skills, the robot can engage in an interaction with a human and start an interaction game including objects as a first step towards a joint attention.
We set up an interaction experiment in which participants could physically interact with a humanoid robot equipped with mechanisms for saliency detection and attention manipulation. We tested our implementation in four combinations of activated parts of the attention system, which resulted in four different behaviours.
Our aim was to identify those physical and behavioural characteristics that need to be emphasised when implementing attentive mechanisms in robots, and to measure the user experience when interacting with a robot equipped with attentive mechanisms.
We adopted two techniques for evaluating saliency detection and attention manipulation mechanisms in human-robot interaction: user experience as measured by qualitative and quantitative questions in questionnaires and proxemics estimated from recorded videos of the interactions.
The robot’s level of interactiveness has been found to be positively correlated with user experience factors like excitement and robot factors like lifelikeness and intelligence, suggesting that robots must give as much feedback as possible in order to increase the intuitiveness of the interaction, even when performing only attentive behaviours. This was confirmed also by proxemics analysis: participants reacted more frenetically when the interaction was perceived as less satisfying. Improving the robot’s feedback capability could increase user satisfaction and decrease the probability of unexpected or incomprehensible user movements. Finally, multi-modal interaction (through arm and head movements) increased the level of interactiveness perceived by participants. Positive correlation has been found between the elegance of robot movements and user satisfaction.
In this text, we present a Bayesian framework for active multimodal perception of 3D structure and motion. The design of this framework finds its inspiration in the role of the dorsal perceptual pathway of the human brain. Its composing models build upon a common egocentric spatial configuration that is naturally fitting for the integration of readings from multiple sensors using a Bayesian approach. In the process, we will contribute with efficient and robust probabilistic solutions for cyclo-pean geometry-based stereovision and auditory perception based only on binaural cues, modelled using a consistent formalisation that allows their hierarchical use as building blocks for the multimodal sensor fusion framework. We will explicitly or implicitly address the most important challenges of sensor fusion using this framework, for vi-sion, audition and vestibular sensing. Moreover, interaction and navigation requires maximal awareness of spatial surroundings, which in turn is obtained through active attentional and behavioural exploration of the environ-ment. The computational models described in this text will support the construction of a simultaneously flexible and powerful robotic implementation of multimodal active perception to be used in real-world applications, such as human-machine interaction or mobile robot navigation.