Conference PaperPDF Available

Joint visual attention modeling for naturally interacting robotic agents

Authors:

Abstract and Figures

This paper elaborates on mechanisms for establishing visual joint attention for the design of robotic agents that learn through natural interfaces, following a developmental trajectory not unlike infants. We describe first the evolution of cognitive skills in infants and then the adaptation of cognitive development patterns in robotic design. A comprehensive outlook for cognitively inspired robotic design schemes pertaining to joint attention is presented for the last decade, with particular emphasis on practical implementation issues. A novel cognitively inspired joint attention fixation mechanism is defined for robotic agents.
Content may be subject to copyright.
A preview of the PDF is not available
... Yücel, Salah, Meriçli, and Meriçli and colleagues [83,84] presented a cognitively-inspired, virtually task-independent gaze-following/fixation and object segmentation mechanism for robotic joint attention. In their first approach [83], the head pose of the caregiver was determined by the proposed system and gaze direction estimated from it. ...
... Yücel, Salah, Meriçli, and Meriçli and colleagues [83,84] presented a cognitively-inspired, virtually task-independent gaze-following/fixation and object segmentation mechanism for robotic joint attention. In their first approach [83], the head pose of the caregiver was determined by the proposed system and gaze direction estimated from it. At the same time, the depth of the object along the direction of gaze would be inferred from head orientation. ...
Article
Full-text available
This review intends to provide an overview of the state of the art in the modeling and implementation of automatic attentional mechanisms for socially interactive robots. Humans assess and exhibit intentionality by resorting to multisensory processes that are deeply rooted within low-level automatic attention-related mechanisms of the brain. For robots to engage with humans properly, they should also be equipped with similar capabilities. Joint attention, the precursor of many fundamental types of social interactions, has been an important focus of research in the past decade and a half, therefore providing the perfect backdrop for assessing the current status of state-of-the-art automatic attentional-based solutions. Consequently, we propose to review the influence of these mechanisms in the context of social interaction in cutting-edge research work on joint attention. This will be achieved by summarizing the contributions already made in these matters in robotic cognitive systems research, by identifying the main scientific issues to be addressed by these contributions and analyzing how successful they have been in this respect, and by consequently drawing conclusions that may suggest a roadmap for future successful research efforts.
... In recent years, sophisticated algorithms have been developed in which gaze direction guides the identification of objects of interest [235,236,237,238,239]. For instance, in the context of robot learning, Yucel et al. [237] use a joint attention algorithm that works as follows: the instructor's face is first detected using Haar-like features. ...
Article
Full-text available
Today, one of the major challenges that autonomous vehicles are facing is the ability to drive in urban environments. Such a task requires communication between autonomous vehicles and other road users in order to resolve various traffic ambiguities. The interaction between road users is a form of negotiation in which the parties involved have to share their attention regarding a common objective or a goal (e.g. crossing an intersection), and coordinate their actions in order to accomplish it. In this literature review we aim to address the interaction problem between pedestrians and drivers (or vehicles) from joint attention point of view. More specifically, we will discuss the theoretical background behind joint attention, its application to traffic interaction and practical approaches to implementing joint attention for autonomous vehicles.
... The system can now control the robot to pick up the cup on the table, and not the one on the mantelpiece. Yücel and Salah (2009) proposed a method for establishing joint attention between a human and a robot. A more advanced application that requires the robot to establish and maintain a conversation is turn taking during a conversation with its human partner (Kendon, 1967). ...
Article
This chapter presents an overview of a typical scenario of Ambient Assisted Living (AAL) in which a robot navigates to a person for conveying information. Indoor robot navigation is a challenging task due to the complexity of real-home environments and the need of online learning abilities to adjust for dynamic conditions. A comparison between systems with different sensor typologies shows that visionbased systems promise to provide good performance and a wide scope of usage at reasonable cost. Moreover, vision-based systems can perform different tasks simultaneously by applying different algorithms to the input data stream thus enhancing the flexibility of the system. The authors introduce the state of the art of several computer vision methods for realizing indoor robotic navigation to a person and human-robot interaction. A case study has been conducted in which a robot, which is part of an AAL system, navigates to a person and interacts with her. The authors evaluate this test case and give an outlook on the potential of learning robot vision in ambient homes.
... De notre point de vue, ce module devraitêtre apprisà travers l'interaction. Le modèle développé par (Yucel et al., 2009) implémente un modèle relativement efficace, consistantà intégrer des algorithmes de traitement d'image robustes comme l'estimation de la pose de la tête et l'estimation de la direction du regard. D'autres auteurs comme (Marin-Urias et al., 2009;Marin-Urias et al., 2008;Sisbot et al., 2007) se sont focalisés sur des capacités importantes de l'attention partagée nommées "mental rotation" et "perspective taking". ...
Article
Full-text available
My thesis focuses on the emotional interaction in autonomous robotics. The robot must be able to act and react in a natural environment and cope with unpredictable pertubations. It is necessary that the robot can acquire a behavioral autonomy, that is to say the ability to learn and adapt on line. In particular, we propose to study what are the mechanisms to introduce so that the robot has the ability to perceive objects in the environment and in addition they can be shared by an experimenter. The problem is to teach the robot to prefer certain objects and avoid other objects. The solution can be found in psychology in the social referencing. This ability allows to associate a value to an object through emotional interaction with a human partner. In this context, our problem is how a robot can autonomously learn to recognize facial expressions of a human partner and then use them to give a emotional valence to objects and allow their discrimination. We focus on understanding how emotional interaction with a partner can bootstrap behavior of increasing complexity such as social referencing. Our idea is that social referencing as well as the recognition of facial expressions can emerge from a sensorimotor architecture. We support the idea that social referencing may be initiated by a simple cascade of sensorimotor architectures which are not dedicated to social interactions. My thesis underlines several topics that have a common denominator: social interaction. We first propose an architecture which is able to learn to recognize facial expressions through an imitation game between an expressive head and an experimenter. The robotic head would begin by learning five prototypical facial expressions. Then, we propose an architecture which can reproduce facial expressions and their different levels of intensity. The robotic head can reproduce expressive more advanced for instance joy mixed with anger. We also show that the face detection can emerge from this emotional interaction thanks to an implicit rhythm that is created between human partner and robot. Finally, we propose a model sensorimotor having the ability to achieve social referencing. Three situations have been tested: 1) a robotic arm is able to catch and avoid objects as emotional interaction from the human partner. 2) a mobile robot is able to reach or avoid certain areas of its environment. 3) an expressive head can orient its gaze in the same direction as humans and addition to associate emotional values to objects according tothe facial expressions of experimenter. We show that a developmental sequence can merge from emotional interaction and that social referencing can be explained a sensorimotor level without needing to use a model theory mind.
Chapter
This chapter presents an overview of a typical scenario of Ambient Assisted Living (AAL) in which a robot navigates to a person for conveying information. Indoor robot navigation is a challenging task due to the complexity of real-home environments and the need of online learning abilities to adjust for dynamic conditions. A comparison between systems with different sensor typologies shows that vision-based systems promise to provide good performance and a wide scope of usage at reasonable cost. Moreover, vision-based systems can perform different tasks simultaneously by applying different algorithms to the input data stream thus enhancing the flexibility of the system. The authors introduce the state of the art of several computer vision methods for realizing indoor robotic navigation to a person and human-robot interaction. A case study has been conducted in which a robot, which is part of an AAL system, navigates to a person and interacts with her. The authors evaluate this test case and give an outlook on the potential of learning robot vision in ambient homes.
Chapter
This chapter proposes a taxonomy to classify the real-life applications which can benefit from the use of attention models. There are numerous applications and we try here to remain as exhaustive as possible to provide a picture of all the applications of saliency models, but also to detect where future developments might be of interest. The applications are grouped into three categories. The first one uses the detection of the most important regions in an image and contains applications such as video surveillance, audio surveillance, defect detection, pathology detection, expressive and social gestures, computer graphics and quality metrics. The second category uses saliency maps to detect the regions which are the less interesting in an image. Here one can found applications like texture metrics, compression, retargeting, summarization, watermarking and attention-based ad insertion. Finally, a third category uses the most interesting areas in an image with further processing like comparisons between those areas. In this category one can find image registration and landmarks, object recognition, action guidance in robotics or avatars, web sites optimization, images memorability, best viewpoint, symmetries and automatic focus on images.
Article
Computer vision is essential to develop a social robotic system capable to interact with humans. It is responsible to extract and represent the information around the robot. Furthermore, a learning mechanism, to select correctly an action to be executed in the environment, pro-active mechanism, to engage in an interaction, and voice mechanism, are indispensable to develop a social robot. All these mechanisms together provide a robot emulate some human behavior, like shared attention. Then, this chapter presents a robotic architecture that is composed with such mechanisms to make possible interactions between a robotic head with a caregiver, through of the shared attention learning with identification of some objects.
Conference Paper
Currently work in robotics is expanding from industrial robots to robots that are employed in the living environment. For robots to be accepted into the real world, they must be capable to behave in such a way that humans do with other humans. This paper focuses on designing a robotic framework to interact with multiple humans in a natural and social ways. To evaluate the robotic framework, we conducted an experiment to perform an important social function in any conversation such as initiating interaction with the target human in multiparty setting. Results show that the proposed robotic system is functioning to initiate an interaction process in four viewing situations.
Conference Paper
It is a major challenge in HRI to design a robotic agent that is able to direct its partner's attention from his/her existing attentional focus towards an intended direction. For this purpose, the agent may first turn its gaze to him/her in order to set up eye contact. However, such a turning action of the agent may not in itself be sufficient to establish eye contact with its partner in all cases, especially when the agent and the its partner are not facing each other or the partner is intensely engaged in a task. This paper focuses on designing a robotic framework to shift the target human's attentional focus toward the robot's intended direction from multiple humans. For this purpose, we proposed a conceptual framework with three phases: capturing attention, making eye contact, and shifting attention. We conducted an experiment to validate our model in HRI scenarios in which two participant interacted in a session at a time. One of them interacted as a target and other as a non-target. Experimental results with twenty participants shows the effectiveness of the proposed framework.
Article
Controlling someone’s attention can be defined as shifting his/her attention from the existing direction to another. To shift someone’s attention, gaining attention and meeting gaze are two most important prerequisites. If a robot would like to communicate a particular person, it should turn its gaze to him/her for eye contact. However, it is not an easy task for the robot to make eye contact because such a turning action alone may not be effective in all situations, especially when the robot and the human are not facing each other or the human is intensely attending to his/her task. Therefore, the robot should perform some actions so that it can attract the target person and make him/her respond to the robot to meet gaze. In this paper, we present a robot that can attract a target person’s attention by moving its head, make eye contact through showing gaze awareness by blinking its eyes, and directs his/her attention by repeating its eyes and head turns from the person to the target object. Experiments using 20 human participants confirm the effectiveness of the robot actions to control human attention.
Article
Full-text available
To designate this new field, we use the term epigenesis introduced in the field of psychology by Jean Piaget, the great 20th century developmentalist. The term was used to refer to such development, determined primarily by the interaction between the organism and the environment, rather than by genes. However, we believe that PiagetÕs emphasis on the importance of sensorimotor interaction needs to be complemented with what is just as (and perhaps more) important for development: social interaction, as emphasized by Lev Vygotsky, another important figure of 20th century psychology. In the emerging field of Epigenetic Robotics, the interests of psychologists and roboticists meet. The former are in a position to provide the detailed empirical findings and theoretical generalizations that can guide the implementations of robotic systems capable of cognitive (including behavioral and social) development. Conversely, these implementations can help clarify, evaluate, and even develop psychological theories, which due to the complexity of the interactional processes involved have hitherto remained somewhat speculative. With this in mind, we invited the submission of papers to the First International Workshop on Epigenetic Robotics: Modeling Cognitive Development in Robotic Systems, sponsored by the Communications Research Laboratory, Japan, and held in Lund, Sweden on September 17-18, 2001. Our vision was that this workshop would serve as a forum for sharing and discussing theoretical frameworks, methodologies, results and problems in this new interdisciplinary field. We were most pleased to receive many original contributions. After reviewing, we could accept for presentation the 16 papers published in this volume together with those of three invited speakers: the animal psychologist Irene Pepperberg, the child psychologist Chris Sinha, and the Ôrobot psychologistÕ Tom Ziemke. In the following paragraphs, we briefly present these 19 papers, relating them to the theme of the workshop.
Article
Full-text available
Head pose and eye location estimation are two closely related issues which refer to similar application areas. In recent years, these problems have been studied individually in numerous works in the literature. Previous research shows that cylindrical head models and isophote based schemes provide satisfactory precision in head pose and eye location estimation, respectively. However, the eye locator is not adequate to accurately locate eye in the presence of extreme head poses. Therefore, head pose cues may be suited to enhance the accuracy of eye localization in the presence of severe head poses. In this paper, a hybrid scheme is proposed in which the transformation matrix obtained from the head pose is used to normalize the eye regions and, in turn the transformation matrix generated by the found eye location is used to correct the pose estimation procedure. The scheme is designed to (1) enhance the accuracy of eye location estimations in low resolution videos, (2) to extend the operating range of the eye locator and (3) to improve the accuracy and re-initialization capabilities of the pose tracker. From the experimental results it can be derived that the proposed unified scheme improves the accuracy of eye estimations by 16% to 23%. Further, it considerably extends its operating range by more than 15deg, by overcoming the problems introduced by extreme head poses. Finally, the accuracy of the head pose tracker is improved by 12% to 24%.
Article
Full-text available
Imitation is a powerful capability of infants, relevant for boot-strapping many cognitive capabilities like communication, lan-guage and learning under supervision. In infants, this skill relies on establishing a joint attentional link with the teach-ing party. In this work we propose a method for establish-ing the joint attention between an experimenter and an embod-ied agent. The agent first estimates the head pose of the ex-perimenter, based on tracking with a cylindrical head model. Then two separate neural network regressors are used to in-terpolate the gaze direction and the target object depth from the computed head pose estimates. A bottom-up feature-based saliency model is used to select and attend to objects in a re-stricted visual field indicated by the gaze direction. We demon-strate our system on a number of recordings where the exper-imenter selects and attends to an object among several alter-natives. Our results suggest that rapid gaze estimation can be achieved for establishing joint attention in interaction-driven robot training, which is a very promising testbed for hypothe-ses of cognitive development and genesis of visual communi-cation.
Article
Full-text available
The ability to imitate others enables human infants to acquire various social and cognitive capabilities. Joint attention is regarded as a behavior that can be derived from imitation. In this paper, the devel-opmental relationship between imitation and joint attention, and the role of motion information in the development are investigated from a viewpoint of cognitive developmental robotics. It is supposed in my developmental model that an infant-like robot first has the experiences of visually tracking a human face based on the ability to preferentially look at salient visual stimuli. The experiences allow the robot to acquire the ability to imitate head movement by finding an equivalence between the hu-man's head movement and the robot's when tracking the human who is turning his/her head. Then, the robot changes its gaze from tracking the human face to looking at an object at which the human has also looked at based on the abilities to imitate a head turning and gaze at a salient object. Through the experiences, the robot comes to learn joint attention behavior based on the contingency between the head movement and the object appearance. The movement information which the robot perceive plays an important role in facilitating the development of imitation and joint attention because it gives an easily understandable sensorimotor relationship. The developmental model is examined in learning experiments focusing on evaluating the role of movement in joint attention. Experimental results show that the acquired sensorimotor coordination for joint attention involves the equivalence between the human's head movement and the robot's, which can be a basis for head movement imitation.
Article
Full-text available
If we are to build human-like robots that can interact naturally with people, our robots must know not only about the properties of objects but also the properties of animate agents in the world. One of the fundamental social skills for humans is the attribution of beliefs, goals, and desires to other people. This set of skills has often been called a theory of mind. This paper presents the theories of Leslie (1994) and Baron-Cohen (1995) on the development of theory of mind in human children and discusses the potential application of both of these theories to building robots with similar capabilities. Initial implementation details and basic skills (such as finding faces and eyes and distinguishing animate from inanimate stimuli) are introduced. I further speculate on the usefulness of a robotic implementation in evaluating and comparing these two models.
Article
The role of movement in triggering early imitative responses is examined in this study. The sample consisted of 36 newborns (median age = 4 days). 16 were presented with 2 dynamic models (tongue protrusion and hand opening-closing), 12 were presented with the static form of these same models, and the remaining 8 constituted a control group. Only infants in the first condition reproduced the models at significant levels. However, infants in the static condition fixated the experimenter longer than those in the dynamic one. The results are discussed in terms of neurophysiological findings concerning the control of neonatal behaviors and early perceptual capacities.
Article
Results of the 1987-1988 anthropometric survey of Army personnel are presented in this report in the form of summary statistics, percentile data and frequency distribution. These anthropometric data are presented for a subset of personnel (1774 men and 2208 women) sampled to match the proportions of age categories and racial/ethnic groups found in the active duty Army of June 1988. Dimensions given in this report include 132 standard measurements made in the course of the survey, 60 derived dimensions calculated largely by adding and subtracting standard measurement data, and 48 head and face dimensions reported in traditional linear terms but collected by means of an automated headboard designed to obtain three-dimensional data. Measurement descriptions. Descriptions of the procedures and techniques used in this survey are also provided. These include explanations of the complex sampling plan, computer editing procedures, and strategies for minimizing observer error. Tabular material in appendices are designed to help users understand various practical applications of the dimensional data, and to identify comparable data obtained in previous anthropometric surveys.
Article
R. W. White (1959) proposed that certain motives, such as curiosity, autonomy, and play (called intrinsic motives, or IMs), have common characteristics that distinguish them from drives. The evidence that mastery is common to IMs is anecdotal, not scientific. The assertion that "intrinsic enjoyment" is common to IMs exaggerates the significance of pleasure in human motivation and expresses the hedonistic fallacy of confusing consequence for cause. Nothing has been shown scientifically to be common to IMs that differentiates them from drives. An empirically testable theory of 16 basic desires is put forth based on psychometric research and subsequent behavior validation. The desires are largely unrelated to each other and may have different evolutionary histories. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
In this paper, we propose a model of embodied commu-nications focusing on body movements. Moreover, we explore the validity of the model through psychologi-cal experiments on human-human and human-robot com-munications involving giving/receiving route directions. The proposed model emphasizes that, in order to achieve smooth communications, it is important for a relationship to emerge from a mutually entrained gesture and for a joint viewpoint to be obtained by this relationship. The experiments investigated the correlations between body movements and utterance understanding in order to con-firm the importance of the two points described above. We use robots so that we can control parameters in ex-periments and discuss the issues related to the interaction between humans and artifacts. Results supported the va-lidity of the proposed model: in the case of human-human communications, subjects could communicate smoothly when the relationship emerged from the mutually en-trained gesture and the joint viewpoint was obtained; in the case of human-robot communications, subjects could understand the robot's utterances under the same condi-tions but not when the robot's gestures were restricted.