(a) training head pose appearance range. Pan and tilt angles range respectively from-90 • to 90 • and-60 • to 60 • by 15 • steps. (b) and (c) tracking features. texture features from Gaussian and Gabor filters b) and skin color binary mask c). 

(a) training head pose appearance range. Pan and tilt angles range respectively from-90 • to 90 • and-60 • to 60 • by 15 • steps. (b) and (c) tracking features. texture features from Gaussian and Gabor filters b) and skin color binary mask c). 

Source publication
Article
Full-text available
We address the problem of recognizing the visual focus of attention (VFOA) of meeting participants based on their head pose. To this end, the head pose observations are modeled using a Gaussian mixture model (GMM) or a hidden Markov model (HMM) whose hidden states correspond to the VFOA. The novelties of this paper are threefold. First, contrary to...

Contexts in source publication

Context 1
... Space: The state space contains both continuous and discrete variables. More precisely, the state is defined as X = (S, θ, l) where S represents the head location and size, and θ represents the in-plane head rotation. The variable l labels an element of the discretized set of possible out-of-plane head poses 2 (see Fig. ...
Context 2
... is important to note that the pose dependent appearance models were not learned using the same people or head images gathered in the same meeting room environment. We used the Prima-Pointing database [34], which contains 15 individuals recorded over 93 different poses (see Fig. 3(a)). However, when learning appearance models over whole head patches, as done in [18], we experienced tracking failures with 2 out of the 16 people of the IHPD database (see Section III) which had hair appearances not represented in the Prima-Pointing dataset (e.g. one of those two people was bald). As a remedy, we trained the appearance ...
Context 3
... VFOA with a Gaussian Mixture Model (GMM): Let s t ∈ F denote the VFOA state, and z t the head pointing direction of a person at a given time instant t. The head pointing direction is defined by the head pan (α) and tilt (β) angles, i.e. z t = (α t , β t ), since the head roll (γ) has no effect on the head direction by definition (see Fig. 3(a)). Estimating the visual focus can be posed in a probabilistic framework as finding the VFOA state maximizing the a posteriori ...

Citations

... This has been studied for a variety of applications, including visual surveillance [36], driver attention [37], the visual focus of attention [38], and robotics. Appearance-based approaches compare a fresh head image to a set of head posture templates to determine which perspective is the most similar. ...
Full-text available
Article
Background/Purpose: Quantification of consumer interest is an interesting, innovative, and promising trend in marketing research. For example, an approach for a salesperson is to observe consumer behaviour during the shopping phase and then recall his interest. However, the salesperson needs unique skills because every person may interpret their behaviour in a different manner. The purpose of this research is to track client interest based on head pose positioning and facial expression recognition. Objective: We are going to develop a quantifiable system for measuring customer interest. This system recognizes the important facial expression and then processes current client photos and does not save them for later processing. Design/Methodology/Approach: The work describes a deep learning-based system for observing customer actions, focusing on interest identification. The suggested approach determines client attention by estimating head posture. The system monitors facial expressions and reports customer interest. The Viola and Jones algorithms are utilized to trim the facial image. Findings/Results: The proposed method identifies frontal face postures, then segments facial mechanisms that are critical for facial expression identification and creating an iconized face image. Finally, the obtained values of the resulting image are merged with the original one to analyze facial emotions. Conclusion: This method combines local part-based features with holistic facial information. The obtained results demonstrate the potential to use the proposed architecture as it is efficient and works in real-time. Paper Type: Conceptual Research.
... In other words, it can be said that the visual focus of attention (VFoA) of a person is closely related to his/her gaze direction and thus the head pose. Similarly, in the study Electronics 2022, 11, 1500 5 of 23 conducted by [29], head pose was used in determining the visual focus of attention of the participants in the meetings. ...
Full-text available
Article
With COVID-19, formal education was interrupted in all countries and the importance of distance learning has increased. It is possible to teach any lesson with various communication tools but it is difficult to know how far this lesson reaches to the students. In this study, it is aimed to monitor the students in a classroom or in front of the computer with a camera in real time, recognizing their faces, their head poses, and scoring their distraction to detect student engagement based on their head poses and Eye Aspect Ratios. Distraction was determined by associating the students’ attention with looking at the teacher or the camera in the right direction. The success of the face recognition and head pose estimation was tested by using the UPNA Head Pose Database and, as a result of the conducted tests, the most successful result in face recognition was obtained with the Local Binary Patterns method with a 98.95% recognition rate. In the classification of student engagement as Engaged and Not Engaged, support vector machine gave results with 72.4% accuracy. The developed system will be used to recognize and monitor students in the classroom or in front of the computer, and to determine the course flow autonomously.
... Visual surveillance [56], [57]; driver attention [58], [59]; the visual focus of attention [15], [60]; and robotics [61] have all been investigated using head posture estimation appearancebased, model-based, manifold embedding, and nonlinear regression techniques are used to create head position prediction systems [62]. Appearance-based strategies compare a new head picture to a set of head posture templates to determine which viewpoint is the most related Appearance-based approaches have the drawback of only being able to predict discrete posture locations [63]. ...
... For example a pedestrian tracker is used in [83] to infer head pose labels while automatically generating ground-truth data. Ba et al. [84] use a tracking system through filtering techniques for people recognition as the initial step of a tracking system. An algorithm proposed in [85] tracks video features to estimate the head pose. ...
Article
Head pose is an important cue in computer vision when using facial information. Over the last three decades, methods for head pose estimation have received increasing attention due to their application in several image analysis tasks. Although many techniques have been developed in the years to address this issue, head pose estimation remains an open research topic, particularly in unconstrained environments. In this paper, we present a comprehensive survey focusing on methods under both constrained and unconstrained conditions, focusing on the literature from the last decade. This work illustrates advantages and disadvantages of existing algorithms, starting from seminal contributions to head pose estimation, and ending with the more recent approaches which adopted deep learning frameworks. Several performance comparison are provided. This paper also states promising directions for future research on the topic.
... Visual Focus of Attention (VFA) prediction aims at identifying where people in an image are looking at within the image space (Ba and Odobez 2008). (Recasens et al. 2015) proposed a two-stream CNN to find the position where people in the image are looking at. ...
Article
Mutual gaze detection, i.e., predicting whether or not two people are looking at each other, plays an important role in understanding human interactions. In this work, we focus on the task of image-based mutual gaze detection, and propose a simple and effective approach to boost the performance by using an auxiliary 3D gaze estimation task during the training phase. We achieve the performance boost without additional labeling cost by training the 3D gaze estimation branch using pseudo 3D gaze labels deduced from mutual gaze labels. By sharing the head image encoder between the 3D gaze estimation and the mutual gaze detection branches, we achieve better head features than learned by training the mutual gaze detection branch alone. Experimental results on three image datasets show that the proposed approach improves the detection performance significantly without additional annotations. This work also introduces a new image dataset that consists of 33.1K pairs of humans annotated with mutual gaze labels in 29.2K images.
... Many systems therefore rely on head pose tracking as a proxy for gaze direction, which is a simpler and more robust approach, but which cannot capture quick glances or track more precise gaze targets. Despite this, studies have found head pose to be a fairly reliable indicator of visual focus of attention in multi-party interaction, given that the targets are clearly separated (Katzenmaier et al., 2004;Stiefelhagen et al., 2002;Ba and Odobez, 2009;Johansson et al., 2013). ...
Full-text available
Article
The taking of turns is a fundamental aspect of dialogue. Since it is difficult to speak and listen at the same time, the participants need to coordinate who is currently speaking and when the next person can start to speak. Humans are very good at this coordination, and typically achieve fluent turn-taking with very small gaps and little overlap. Conversational systems (including voice assistants and social robots), on the other hand, typically have problems with frequent interruptions and long response delays, which has called for a substantial body of research on how to improve turn-taking in conversational systems. In this review article, we provide an overview of this research and give directions for future research. First, we provide a theoretical background of the linguistic research tradition on turn-taking and some of the fundamental concepts in theories of turn-taking. We also provide an extensive review of multi-modal cues (including verbal cues, prosody, breathing, gaze and gestures) that have been found to facilitate the coordination of turn-taking in human-human interaction, and which can be utilised for turn-taking in conversational systems. After this, we review work that has been done on modelling turn-taking, including end-of-turn detection, handling of user interruptions, generation of turn-taking cues, and multi-party human-robot interaction. Finally, we identify key areas where more research is needed to achieve fluent turn-taking in spoken interaction between man and machine.
... Real-time analysis of surveillance cameras has become an exhaustive task due to human limitations. The primary human limitation is the Visual Focus of Attention (VFOA) [2]. The human gaze can only concentrate on one specific point at once. ...
Full-text available
Article
Crime generates significant losses, both human and economic. Every year, billions of dollars are lost due to attacks, crimes, and scams. Surveillance video camera networks generate vast amounts of data, and the surveillance staff cannot process all the information in real-time. Human sight has critical limitations. Among those limitations, visual focus is one of the most critical when dealing with surveillance. For example, in a surveillance room, a crime can occur in a different screen segment or on a distinct monitor, and the surveillance staff may overlook it. Our proposal focuses on shoplifting crimes by analyzing situations that an average person will consider as typical conditions, but may eventually lead to a crime. While other approaches identify the crime itself, we instead model suspicious behavior—the one that may occur before the build-up phase of a crime—by detecting precise segments of a video with a high probability of containing a shoplifting crime. By doing so, we provide the staff with more opportunities to act and prevent crime. We implemented a 3DCNN model as a video feature extractor and tested its performance on a dataset composed of daily action and shoplifting samples. The results are encouraging as the model correctly classifies suspicious behavior in most of the scenarios where it was tested. For example, when classifying suspicious behavior, the best model generated in this work obtains precision and recall values of 0.8571 and 1 in one of the test scenarios, respectively.
... Visual focus of attention. One classical approach for determining the VFoA is [5], where the authors model the dynamics of a meeting group in a probabilistic way, inferring where the participants are looking at. An improved version of this work is presented in [4], where context information is used to aid in solving the task. ...
... Visual Focus of Attention (VFA) prediction aims at identifying where people in an image are looking at within the image space (Ba and Odobez 2008). (Recasens et al. 2015) proposed a two-stream CNN to find the position where people in the image are looking at. ...
Preprint
Mutual gaze detection, i.e., predicting whether or not two people are looking at each other, plays an important role in understanding human interactions. In this work, we focus on the task of image-based mutual gaze detection, and propose a simple and effective approach to boost the performance by using an auxiliary 3D gaze estimation task during training. We achieve the performance boost without additional labeling cost by training the 3D gaze estimation branch using pseudo 3D gaze labels deduced from mutual gaze labels. By sharing the head image encoder between the 3D gaze estimation and the mutual gaze detection branches, we achieve better head features than learned by training the mutual gaze detection branch alone. Experimental results on three image datasets show that the proposed approach improves the detection performance significantly without additional annotations. This work also introduces a new image dataset that consists of 33.1K pairs of humans annotated with mutual gaze labels in 29.2K images.
... Instead of developing a complete analysis of image, such as Ba and Odobez (2008), where the authors study how to link head position with the visual focus of attention, modeling the pose observations with a Gaussian Mixture Model (GMM) or a Hidden Markov Model (HMM), the use of known face landmarks is exploited. A Perspective-n-Points algorithm (PnP) associates 2D points of the DLIB 68 model (Kazemi and Sullivan, 2014) with 3D points in a respective model. ...
Full-text available
Article
When there is an interaction between a robot and a person, gaze control is very important for face-to-face communication. However, when a robot interacts with several people, neurorobotics plays an important role to determine the person to look at and those to pay attention to among the others. There are several factors which can influence the decision: who is speaking, who he/she is speaking to, where people are looking, if the user wants to attract attention, etc. This article presents a novel method to decide who to pay attention to when a robot interacts with several people. The proposed method is based on a competitive network that receives different stimuli (look, speak, pose, hoard conversation, habituation, etc.) that compete with each other to decide who to pay attention to. The dynamic nature of this neural network allows a smooth transition in the focus of attention to a significant change in stimuli. A conversation is created between different participants, replicating human behavior in the robot. The method deals with the problem of several interlocutors appearing and disappearing from the visual field of the robot. A robotic head has been designed and built and a virtual agent projected on the robot's face display has been integrated with the gaze control. Different experiments have been carried out with that robotic head integrated into a ROS architecture model. The work presents the analysis of the method, how the system has been integrated with the robotic head and the experiments and results obtained.