ArticlePDF Available

Recognizing Visual Focus of Attention From Head Pose in Natural Meetings

Authors:

Abstract and Figures

We address the problem of recognizing the visual focus of attention (VFOA) of meeting participants based on their head pose. To this end, the head pose observations are modeled using a Gaussian mixture model (GMM) or a hidden Markov model (HMM) whose hidden states correspond to the VFOA. The novelties of this paper are threefold. First, contrary to previous studies on the topic, in our setup, the potential VFOA of a person is not restricted to other participants only. It includes environmental targets as well (a table and a projection screen), which increases the complexity of the task, with more VFOA targets spread in the pan as well as tilt gaze space. Second, we propose a geometric model to set the GMM or HMM parameters by exploiting results from cognitive science on saccadic eye motion, which allows the prediction of the head pose given a gaze target. Third, an unsupervised parameter adaptation step not using any labeled data is proposed, which accounts for the specific gazing behavior of each participant. Using a publicly available corpus of eight meetings featuring four persons, we analyze the above methods by evaluating, through objective performance measures, the recognition of the VFOA from head pose information obtained either using a magnetic sensor device or a vision-based tracking system. The results clearly show that in such complex but realistic situations, the VFOA recognition performance is highly dependent on how well the visual targets are separated for a given meeting participant. In addition, the results show that the use of a geometric model with unsupervised adaptation achieves better results than the use of training data to set the HMM parameters.
Content may be subject to copyright.
Recognizing Visual Focus of Attention
from Head Pose in Natural Meetings
Sileye Ba, and Jean-Marc Odobez, Member, IEEE
IDIAP Research Institute, Rue du Simplon 4, CH-1920 Martigny, Switzerland
Telephone +41 27 721 77 11
Fax +41 27 721 77 12
Email <first name>.<last name>@idiap.ch
URL www.idiap.ch
Abstract
We address the problem of recognizing the visual focus of attention (VFOA) of meeting participants
based on their head pose. To this end, the head pose observations are modeled using a Gaussian Mixture
Model (GMM) or a Hidden Markov Model (HMM) whose hidden states corresponds to the VFOA. The
novelties of this work are threefold. First, contrary to previous studies on the topic, in our set-up,
the potential VFOA of a person is not restricted to other participants only, but includes environmental
targets (a table and a projection screen), which increases the complexity of the task, with more VFOA
targets spread in the pan as well as pan gaze space. Second, we propose a geometric model to set the
GMM or HMM parameters by exploiting results from cognitive science on saccadic eye motion, which
allows the prediction of the head pose given a gaze target. Third, an unsupervised parameter adaptation
step (not using any labeled data) is proposed which accounts for the specific gazing behaviour of each
participant. Another contribution of the paper is the development of a significant publicly available
corpus of 8 meetings which are on average 10 minutes in length featuring 4 persons, with head pose
and VFOA annotation. Using this corpus, we analyze the above methods by evaluating, through objective
performance measures, the recognition of the VFOA from head pose information obtained either using a
magnetic sensor device or a vision based tracking system. The results clearly show that in such complex
but realistic situations, the VFOA recognition performance is highly dependent on how well the visual
targets are separated for a given meeting participant. In addition, the results show that the use of a
geometric model with unsupervised adaptation achieves better results that the use of training data to set
the HMM parameters.
Corresponding author.
1
I. INTRODUCTION
Understanding human behaviour or need of human is a central issue in devising next-generation
human computing systems that can emulate more human-like functions. At the heart of this
issue lies, amongst others, the difficulty of sensing human behaviours in an accurate way, i.e.
the challenge of developing algorithms that can reliably extract subtle human characteristics -e.g.
body gestures, facial expressions, emotion- that allow a fine analysis of their behaviour. One
such characteristic of interest is the gaze, which indicates where and what a person is looking at,
or, in other words, what the visual focus of attention (VFOA) of the person is. However, while
the development of gaze tracking systems for Human Computer Interface (HCI) applications has
been the topic of many studies, less research has been conducted for estimating and analyzing a
person’s gaze and VFOA in more open spaces, despite the fact that in many contexts, identifying
the VFOA of a person conveys a wealth of information about that person: what is he interested
in, what is he doing, how does he explore a new environment or react to different visual stimuli.
Thus, tracking the VFOA of people could have important applications in the development of
ambient intelligent systems.
In terms of human computing applications, VFOA can be used for video compression by
assuming that the important information in a video exist in the neighborhood of the gaze path
of a person viewing the video. Estimating the focus of a viewer can be used to define areas of
visual focus that could be encoded in high resolution, while the areas which are not focus centers
could be encoded at lower resolution [1]. Another possible application in a public space could
be to measure the degree of attraction of advertisements or shop displays based on the estimated
focus of people passing by as presented in [2]. Applications in meetings include digital assistants
that can analyze the social dynamic of the meeting based on people’s non-verbal behaviors in
order to improve the group cohesiveness and efficiency [3].
Needless to say, gaze plays an important role in face-to-face conversations and more generally
group interaction, as it has been shown in a large body of social psychology studies [4]. Human
interaction can be categorized as verbal (speech) or non-verbal (e.g. facial expressions). While
the usage of the former is tightly connected to the explicit rules of language (grammar, dialog
acts), the usage of non-verbal cues is usually more implicit, but this does not prevent it from
following rules and exhibiting specific patterns in conversations. For instance, in a meeting
context, a person raising a hand usually means that he is requesting the floor, and a listener’s
2
(a) (b) (c)
table
left person right person
organizer 1 organizer 2
slide screen
camera
(d)
Fig. 1. Recognizing the VFOA of people. (a) the meeting room (b) a sample image of the dataset (c) the potential VFOA
targets for the right person (d) the geometric configuration of the room.
head nod or shake can be interpreted as agreement or disagreement [5]. Besides hand and head
gestures, the VFOA is another important non-verbal communication cue with functions such as
establishing relationships (through mutual gaze), regulating the course of interaction, expressing
intimacy, and exercising social control [6], [7].
A speaker’s gaze often correlates with his addressees, i.e. the intended recipients of the speech
[8]. Also, for a listener, monitoring his own gaze in concordance with the speaker’s gaze is a
way to find appropriate time windows for speaker turn requests [9], [10]. Thus, recognizing the
VFOA patterns of a group of people can reveal important knowledge about the participants’ role
and status [11], [7]. Following these studies in social psychology, computer vision researchers are
showing more interest in the study of automatic gaze and VFOA recognition systems [12], [13],
[2], as illustrated by some of the research tasks defined in several recent evaluation workshops
[14], [15]. Since meetings are places where the multi-modal nature of human communication
and interaction best occur, they are well suited to conduct such research studies.
In this context, the goal of this paper is to analyze the correspondence between the head pose
of people and their gaze in more general meeting scenarios than those previously considered [12],
[13]. In addition we propose methods to recognize the VFOA of people from their head pose (see
Fig. 1, and Fig. 9 for some results). In meeting rooms, where high resolution close-up views
of the eyes which are typically required by HCI gaze estimation systems, are not available
in practice, it has been shown in [12] that head orientation can be reasonably utilized as an
approximation of the gaze when VFOA targets are the other meeting participants (in meetings
with 4 people). In this paper, we investigate the estimation of VFOA from head pose in complex
meeting situations. Firstly, unlike previous work ([12], [13]), the scenario we consider involves
people looking at slides or writing on a sheet of paper on the table. As a consequence, people
have more potential VFOA targets in our set-up (6 instead of 3 in the cited work), leading
3
to more possible ambiguities between VFOA. Secondly, due to the physical placement of the
VFOA targets, the identification of the VFOA can only be done using the complete head pose
representation (pan and tilt), instead of just the head pan, as done previously. Thus, our work
addresses general and challenging meeting room situations in which people do not just focus
their attention on other people, but also on other room targets.
To recognize the VFOA of people from their head pose, we investigated two generative models:
a Gaussian mixture model (GMM) that handle each frame separately, and its natural extension to
the temporal domain, namely a hidden Markov model (HMM), which segments pose observation
sequences into VFOA temporal segments. In both cases, for each VFOA target, the head pose
observations are represented as Gaussian distributions, whose means indicate the head pose
associated with each visual target. Alternative approaches were considered to set the model
parameters. In one approach, these were set using training data from other meetings. However,
as collecting training data can be tedious, we used the results of studies on saccadic eye motion
modeling [16], [17] and propose a novel approach (referred to as cognitive or geometric) that
models the head pose of a person given his upper body pose and his effective gaze target. In
this way, no training data is required to learn parameters, but some knowledge of the 3D room
geometry is necessary. In addition, to account for the fact that in practice we observed that
people have their own head pose preferences for looking at the same given target, we adopted
an unsupervised Maximum A Posteriori (MAP) scheme to adapt the parameters obtained from
either the learning model or the geometric model to unlabeled head pose data of individual
people in meetings.
To evaluate the different aspects of the VFOA modeling, we have conducted comparative and
thorough experiments on a large and publicly available database, comprising 8 meetings for
which both the head pose ground-truth and VFOA label ground truth are known. Therefore, we
were able to differentiate between the two main error sources in VFOA recognition: (1) the use
of head pose as a proxy for gaze, and (2) errors in the estimation of the head pose (e.g. using
our vision-based head pose tracker [18]).
In summary, the contributions of this paper are the following:
the development of a public database and a framework to evaluate the recognition of the
VFOA solely from head pose;
a novel geometric model to derive a person’s head pose given his gaze target, which
4
alleviates the need for training data;
the use of an unsupervised MAP framework to adapt the VFOA model parameters to
individual people;
a thorough experimental study and analysis of the influence of several key aspects on the
recognition performance (e.g. participant position, ground truth vs estimated head pose,
correlation with tracking errors).
The remainder of this paper is organized as follows. Section II discusses the related work.
Section III describes the task and the database that is used to evaluate the models we propose.
Section IV provides an overview of our approach. Section V describes our algorithm for joint
head tracking and pose estimation, along with its evaluation. Section VI describes the considered
models for recognizing the VFOA from head pose. Section VII gives the unsupervised MAP
framework used to adapt our VFOA model to unseen data. Section VIII describes our evaluation
setup. We give experimental results in Section IX, and conclusions in Section X.
II. RELATED WORK
We investigate the VFOA recognition from head pose in the context of meetings. Thus, we
will analyze the related work along the following lines: gaze and VFOA tracking technologies,
head pose estimation from vision sensors, and recognition of the VFOA from head pose.
The VFOA of a person is defined by his eye gaze, that is, the direction in which the eyes
are pointing in the space. Many progresses in the design of gaze tracking technologies have
been achieved. A review of such systems is presented in [19]. Gaze trackers are predominantly
developed for HCI applications, where they are used for two main purposes: as an interactive
tool, where the eyes are used as an input modality; or as a diagnostic tool, to provide evidence
of a user’s attention, such as in applications studying the visual exploration of images by people
[20]. For this reason, these systems, while being accurate, are not appropriate for analyzing
the VFOA of people in open spaces: they can be intrusive (user needs to wear special glasses)
and require specific equipment (infrared light sources are often used to ease signal processing).
More importantly, they are very constraining, as the head motion is limited to small position
and angular variations (no more than 25cm and 20[19]). In worst cases, chin rests or bite bars
are required, but even eye-appearance vision-based gaze tracking systems restrict the mobility of
the subject since their need of high resolution close-up eye images requires cameras with very
narrow field-of-views. To alleviate this constraint, some papers [19], [21] propose using head
5
pose tracking to localize eye corners and drive the acquisition of high resolution eye images
using a pan-tilt-zoom (PTZ) camera. These systems, however, require very good calibration, and
are still designed for near frontal head poses [21].
In spaces such as offices or meeting rooms, where the motion and head orientation of people are
unconstrained, high resolution images of people’s eye are not available. An alternative is to use
the head pose as a surrogate for gaze, as proposed in [22]. Broadly speaking, head pose tracking
algorithms can be divided into two groups: model based and appearance based approaches. In
model based approaches, a set of facial features such as the eyes, the nose and the mouth are
tracked. Then, knowing the relative positions of theses features, the head pose can be inferred
using anthropometric information [23], [24]. The major drawback is that robust facial feature
tracking is difficult unless high enough resolution images are used. By modelling appearance
of the whole head , such approaches exhibit more robustness for low resolution images: [12]
used neural network to model head appearance, [25], [26] developed the active appearance
models based on principal component analysis, and [27], [28] used multidimensional Gaussian
distribution to represent the head appearance likelihood.
From another perspective, head pose tracking algorithms differentiate themselves according to
whether or not the tracking and the pose estimation are conducted jointly. Often, a generic tracker
is used to locate the head, and then features extracted at this location are used to estimate the pose
[12], [26], [27], [28]. Decoupling the tracking and the pose estimation results in a computational
cost reduction. However, since head pose estimation is very sensitive to head localization [28],
head pose results are highly dependent on the tracking accuracy. To address this issue, [25],
[29], [18] perform the head tracking and the pose estimation jointly.
In contrast to head tracking algorithms, few works have investigated the recognition of the
VFOA directly from head pose. Pionneering work from [12] used a GMM model, the parameters
of which were learned on the test data after initialization from the output of a K-means clustering
of the pose values. This approach was possible due to constraints on the physical set-up (four
people evenly spaced around a round table) and by limiting the allowed VFOA targets to the
other participants. These constraints allowed them to rely only on the pan angle to represent the
head pose, and limited the possibility of ambiguities in the head pose. In addition, [12] showed
that using other participants’ speaking status could further increase the VFOA recognition. More
recently, [13] used a dynamic Bayesian network to jointly recognize the VFOA of people, as well
6
as different conversational models in a 4-person conversation, based on head pan and speaking
status observations. Finally, in more recent work, [30] exploited the head pose extracted from an
overhead camera tracking retro-reflective markers mounted on headsets to look for occurrences of
shared mutual visual attention. This information was then exploited to derive the social geometry
of co-workers within an office, and infer their availability status for communication.
III. DATABASE AND TASK
In this section, we describe the VFOA recognition task, and the data that is used to evaluate
both our pose estimation and VFOA recognition algorithms.
A. The Task and VFOA Set
Our goal is to evaluate how well we can infer the VFOA state of a person using head pose
in common meeting situations. Let us first note that while the VFOA is given by the eye
gaze, psycho-visual studies have shown that people use other cues -e.g. head and body posture,
speaking status- to recognize the VFOA state of another person [6]. Thus, one general objective
of the current work is to see how well one can recognize the VFOA of people from these
other cues in the absence of direct gazing measurements, a situation likely to occur in many
applications of interest. An important issue is: what should be the definition of a person’s VFOA
state? At first thought, one can consider that each different gaze direction could correspond to
a potential VFOA. However, studies on the VFOA in natural conditions [31] have shown that
humans tend to look at targets, whether humans or objects, that are either relevant to the task they
are solving or of immediate interest to them. Additionally, one interprets another person’s gaze
not as continuous 3D spatial locations, but as a gaze towards objects that have been identified
as potential targets. This process is often called the shared-attentional mechanism [32], [6], and
suggests that in general VFOA states correspond to a finite set of targets of interests.
Thus, in our meeting context the set of potential VFOA targets, denoted F, has been defined
as: the other participants, the slide-screen, and the table. When none of the previous applies
(the person is distracted by some noise or visual stimuli and looks at another target) we use
an additional label called (unfocused). As a result, for ’person left’ in Fig. 1(c), we have:
F={P R, O2, O1, SS, T B, U}where P R stands for person right, O1and O2for organizer
1 and 2, SS for slide screen, T B for table and Ufor unfocused. For the person right, F=
{P L, O2, O1, SS, T B, U}, where P L stands for person left. Note that in practice, the unfocused
7
label only represents a small percentage of our data (2%), while the other VFOA target represent
55%, 26% and 17% for the other participants, the slide screen, and the table, respectively.
B. The Database
Our experiments rely on the IDIAP Head Pose Database (IHPD) 1. The video database was
collected along with a head pose ground truth and each participant’s discrete VFOA ground
truth, as explained below.
Content description: the database is comprised of 8 meetings involving 4 people each, recorded
in a meeting room (cf Fig. 1(a)). The meeting durations ranged from 7 to 14 minutes, which
was long enough to realistically represent a general meeting scenario. In shorter recordings (less
than 2-3 minutes), we found that participants tend to be more active resulting in moving their
head more to focus on other people/objects. In our meetings or in longer situations, the attention
of participants sometimes drops and people are less focused on the other meeting participants.
Note, however, that the small group size encourages engagement of participants in the meeting,
in contrast to meeting with larger groups. Meeting participants were instructed to write down
their name on a sheet of paper, then discuss statements displayed on the projection screen. There
were no restrictions placed on head motion or head pose.
Head pose annotation: in each meeting, the head pose of two persons were continuously anno-
tated (person left and right in Fig. 1(c) ) using a magnetic field sensor called flock of birds (FOB)
rigidly attached to the head, resulting in a video database of 16 different people. The coordinate
frame of the magnetic sensors was calibrated with respect to the camera frame, allowing us
to generate the head pose ground truth with respect to the camera. The head pose is defined
by three Euler angles (α, β, γ )that parametrize the decomposition of the rotation matrix of the
head configuration with respect to the camera frame. To report our results, we have selected
among the possible Euler decompositions the one whose rotation axes are rigidly attached to the
head (see Fig. 4(a)): αdenotes the pan angle, a left/right head rotation; βdenotes the tilt, an
up/down head rotation; and finally, γ, the roll, represents a left/right “head on shoulder” head
rotation. Because of our meeting scenario, people often have negative pan values corresponding
to looking at the projection screen. Recorded pan values range from -70 to 60 degree. Tilt values
range from -60 (when people are writing) to 15 degrees, and roll value from -30 to 30 degrees.
1Available at http://www.idiap.ch/HeadPoseDatabase/ (IHPD)
8
(a) VFOA recognition without adaptation. (b) VFOA recognition with adaptation.
(c) VFOA parameter setting: training approach. (d) VFOA parameter setting: geometric approach.
Fig. 2. Overview of the different recognition approaches and modules.
VFOA annotation: using the predefined discrete set of VFOA targets F, the VFOA of each
person (PL and PR) was manually annotated on the basis of their gaze direction by a single
annotator using a multimedia interface. The annotator had access to all data streams, including
the central camera view (Fig. 1(a)). Specific annotation guidance was defined in [33].
IV. OVERVIEW OF THE PROPOSED VFOA RECOGNITION METHODS
In this section, schematic representations of the components of the VFOA recognition methods
proposed in this paper are provided in Fig 2 to give a global view of the methods.
Fig. 2(a) presents the VFOA recognition method when no adaptation is used. The frames of an
input video are sent to the head pose tracking algorithm (described in Section V) which outputs
people’s head poses. These poses are then processed by the VFOA recognizer module (described
in Section VI-A), whose parameters are provided by a parameter setting module (Section VI-B).
In Fig. 2(b), the use of unsupervised adaptation for VFOA recognition is sketched (described
in Section VII). In this case, we employ a batch processing: the whole input video is processed
by the head tracker to obtain the head poses of people over the entire meeting. Then, the
adaptation module estimates in an unsupervised fashion (without using any annotated data) the
VFOA recognizer parameters by fitting the recognizer model to the head poses while taking
into account priors on these parameters. Some of the parameters of these priors are provided
by the parameter setting module. Finally, the VFOA recognition module applies the parameters
obtained through unsupervised adaptation to head poses to output the recognized VFOA.
Fig. 2(c) and 2(d) describes the two options that are used to define the parameter setting
module involved in Fig. 2(a) and 2(b). The first option relies on training data: training videos
9
are sent to the head pose tracking module whose output is used in conjunction with manual
annotations of people’s VFOA to learn the VFOA recognition parameters relating head pose to
VFOA targets. The second option relies on a cognitive model of how people gaze at targets,
and uses the location of people and object in the room as input. Section VI-B describes how
the parameters are set in the two options and used when no adaptation is performed, while
Section VII-C describes how the same parameters are used to define the hyper-parameters of
the adaptation module.
V. HEAD POSE TRACKING
Head pose can be obtained in two ways: first, from the magnetic sensor readings (cf Sec-
tion III). We will consider this virtually noise-free data as our ground truth, denoted GT in the
remaining. Secondly, by applying a head pose tracker on the video stream. In this Section, we
summarize the computer vision probabilistic head tracker that we employed. Then, the pose
estimates provided by the tracker are compared with the GT and analyzed in detail, ultimately
giving us better insight into the VFOA recognition results presented in Section IX.
A. Probabilistic Method for Head Pose Tracking
The Bayesian formulation of the tracking problem is well known. Denoting the hidden state
representing the object configuration at time tby Xtand the observation extracted from the
image by Yt, the objective is to estimate the filtering distribution p(Xt|Y1:t)of the state Xtgiven
the sequence of all the observations Y1:t= (Y1,...,Yt)up to the current time. Given standard
assumptions, Bayesian tracking amounts to solving the following recursive equation:
p(Xt|Y1:t)p(Yt|Xt)ZXt1
p(Xt|Xt1)p(Xt1|Y1:t1)dXt1(1)
In non-Gaussian and non linear cases, this can be done recursively using sampling approaches,
also known as particle filters (PF). The idea behind PF consists in representing the filtering
distribution using a set of Nsweighted samples (particles) {Xn
t, wn
t, n = 1, ..., Ns}and updating
this representation when new data arrives. Given the particle set of the previous time step, con-
figurations of the current step are drawn from a proposal distribution XtPnwn
t1p(X|Xn
t1).
The weights are then computed as wtp(Yt|Xt).
Four elements are important in defining a PF: i) a state model defining the object we are
interested in; ii) a dynamical model p(Xt|Xt1)governing the temporal evolution of the state;
10
(a) (b) (c)
Fig. 3. (a) training head pose appearance range. Pan and tilt angles range respectively from -90to 90and -60to 60by
15steps. (b) and (c) tracking features. texture features from Gaussian and Gabor filters b) and skin color binary mask c).
iii) a likelihood model measuring the adequacy of data given the proposed configuration of the
tracked object; and iv) a sampling mechanism which has to propose new configurations in high
likelihood regions of the state space. These elements are described in the next paragraphs.
State Space: The state space contains both continuous and discrete variables. More precisely, the
state is defined as X= (S, θ, l)where Srepresents the head location and size, and θrepresents
the in-plane head rotation. The variable llabels an element of the discretized set of possible
out-of-plane head poses2(see Fig. 3a).
Dynamical Model: The dynamics governs the temporal evolution of the state, and is defined as
p(Xt|X1:t1) = p(θt|θt1, lt)p(lt|lt1, St)p(St|St1, St2).(2)
The dynamics of the in-plane head rotation θtand discrete head pose ltvariables are learned
using head pose GT training data. Head location and size dynamics are modeled as second order
auto-regressive processes.
Observation Model: The observation model p(Y|X)measures the likelihood of the observation
for a given state value. The observations Y= (Ytext, Y col)are composed of texture and color
observations (see Fig. 3 (b) and Fig. 3 (c)). Texture features are represented by the output of
three filters (a Gaussian and two Gabor filters at different scales) applied at locations sampled
from image patches extracted from the image and preprocessed by histogram equalization to
reduce light variations effects. Color features are represented by a binary skin mask extracted
using a temporally adapted skin color model. Assuming that, given the state value, texture and
color observation are independent, the observation likelihood is modeled as:
p(Y|X= (S, θ, l)) = ptext (Ytext(S, θ)|l)pcol(Ycol(S, θ)|l)(3)
2Note that (θ, l)is another Euler decomposition (using different axis) of the head pose, which differs from the one described
in Subsection III-B (cf Fig. 3a). Its main computational advantage is that one of the angles corresponds to the in-plane rotation.
It is straightforward to transform from one decomposition to the other.
11
where pcol(·|l)and ptext (·|l)are pose dependent models. For a given hypothesized configuration
X, the parameters (S, θ)define an image patch on which the features are computed, while the
exemplar index lselects the appropriate appearance model.
Sampling Method: In this work, we use Rao-Blackwellization, a process in which we apply the
standard PF algorithm to the tracking variables Sand θwhile applying an exact filtering step
to the exemplar variable l. The method theoretically results in a reduced estimation variance, as
well as a reduction of the number of samples.
For more details about the models and algorithm, the reader is referred to [18]. Finally, in terms
of complexity, the head tracker (in matlab) can process around 1 frame per second.
B. Head Pose Tracking Evaluation
Protocol: We used a two-fold evaluation protocol, where for each fold, we used half (8 people)
of our IHPD database (see Sec.III-B) as the training set to learn the pose dynamic model and the
remaining half as the test set. Initialization was done automatically using a simple background
subtraction technique, modeling the distribution of a pixel background color with one Gaussian,
and the assumption that background image is available and that there was one face on the left
and right half of the image (cf Fig. 1(c)).
It is important to note that the pose dependent appearance models were not learned using the
same people or head images gathered in the same meeting room environment. We used the
Prima-Pointing database [34], which contains 15 individuals recorded over 93 different poses
(see Fig. 3(a)). However, when learning appearance models over whole head patches, as done
in [18], we experienced tracking failures with 2 out of the 16 people of the IHPD database
(see Section III) which had hair appearances not represented in the Prima-Pointing dataset (e.g.
one of those two people was bald). As a remedy, we trained the appearance models on patches
centered around the visible part of the face, not the head. With this modification, no failure was
observed, but performance was slightly worse overall than those reported in [18].
Performance measures: three error measures are used. They are the average errors in pan, tilt and
roll angles, i.e. the average of the absolute difference between the pan, tilt and roll of the ground
truth (GT) and the tracker estimation. We also report the error median value, which should be
less affected by very large errors due to erroneous tracking.
Results: The statistics of the errors are shown in Table I. Overall, given the small head size,
and the fact that the appearance training set is composed of faces recorded in an external set
12
TABLE I
PAN/TILT/ROLL ERROR STATISTICS FOR PERSON LEFT/RIGHT,AND DIFFERENT CONFIGURATIONS OF THE TRUE HEAD POSE.
condition right persons left persons pan near frontal pan near profile tilt near frontal tilt far from frontal
(|α|<45) (|α|>45) (|β|<30) (|β|>30)
stat mean med mean med mean med mean med mean med mean med
pan (in ) 11.4 8.9 14.9 11.3 11.6 9.5 16.9 14.7 12.7 10 18.6 15.9
tilt (in ) 19.8 19.4 18.6 17.1 19.7 18.9 17.5 17.5 19 18.8 22.1 21.4
roll (in ) 14 13.2 10.3 8.7 10.1 8.8 18.3 18.1 11.7 10.8 18.1 16.8
(a) (b) (c)
Fig. 4. (a) head pose Euler rotation angles. Note that the zaxis indicates the head pointing direction. (b) and (c) pan, tilt
and roll tracking errors with b) average errors for each person (R for right and L for left person) and c) distribution of tracking
errors over the whole dataset.
up (different people, different viewing and illumination conditions), the results are quite good,
with a majority of head pan errors smaller than 12(see Figure 4). However these results hide a
large discrepancy between individuals. For instance, the average pan error ranges from 7to 30,
and depends mainly on whether the tracked person’s appearance is well represented by those
appearances in the training set which were used to learn the appearance model. This was more
the case for people placed seated on the right than on the left, as shown by Table I.
Table I also shows that overall the pan and roll tracking errors are smaller than the tilt errors.
The main reason is that tilt estimation is more sensitive to the quality of the face localization
than the pan, as pointed out by other researchers [28]. Indeed, even from a perceptive point of
view, visually determining head tilt is more difficult than determining head pan or head roll.
Table I further details the errors depending on whether the true pose is near frontal or not. We
can observe that, in the near frontal poses (|α| ≤ 45or |β| ≤ 30), the head pose tracking
estimates are more accurate, in particular for the pan and roll value. This can be understood
since for near profile poses, a variation in pan introduces much less appearance change than the
same variation in a near frontal view. Similarly, for high tilt values, the face-image distortion
introduced by perspective shortening affects the quality of the observations.
Finally, these results are comparable to those obtained by others in similar conditions. For
13
instance, [27] achieved a pan estimation error of 16.9 degrees for poses near the frontal position,
and 19.2 degrees for poses near profile (|α|>45). In [12], a neural network is used to train a
head pose classifier from data recorded directly in two meeting rooms. When using 15 people
for training and 2 for testing, average errors of 5 degrees in pan and tilt are reported. However,
when training the models in one room and testing on data from the other meeting room, the
average errors rise to 10 degrees.
VI. VISUAL FOCUS OF ATTENTION MODELING
In this Section, we first describe the models used to recognize the VFOA from the head pose
measurements, then the two alternatives we adopted to set the model parameters.
A. VFOA recognizer models
Modeling VFOA with a Gaussian Mixture Model (GMM): Let st∈ F denote the VFOA state,
and ztthe head pointing direction of a person at a given time instant t. The head pointing
direction is defined by the head pan (α) and tilt (β) angles, i.e. zt= (αt, βt), since the head
roll (γ) has no effect on the head direction by definition (see Fig. 3(a)). Estimating the visual
focus can be posed in a probabilistic framework as finding the VFOA state maximizing the a
posteriori probability:
ˆst= arg max
st∈ F p(st|zt)with p(st|zt) = p(zt|st)p(st)
p(zt)p(zt|st)p(st)(4)
For each VFOA fi∈ F which is not unfocused,p(zt|st=fi), which expresses the likeli-
hood of the pose observations for the VFOA state fi. This is modeled as a Gaussian distribu-
tion N(zt;µi,Σi)with mean µiand full covariance matrix Σi. The unfocused state p(zt|st=
unfocused) = uis modeled as a uniform distribution with u=1
180×180 , as the head pan and
tilt angle can vary from -90to 90. In Eq. 4, p(st=fi) = πidenotes the prior information we
have on a VFOA target fi. Thus, in this modeling, the total pose distribution is represented as
a GMM (plus one uniform mixture), with the mixture index (i) denoting the focus target:
p(zt|λG) = X
st
p(zt, st|λG) = X
st
p(zt|st, λG)p(st|λG) =
K1
X
i=1
πiN(zt;µi,Σi) + πKu , (5)
where λG={µ= (µi)i=1:K1,Σ = (Σi)i=1:K1, π = (πi)i=1:K}represents the parameter set of
the GMM model. Fig. 12 illustrates how the pan-tilt space is split into different VFOA regions
when applying the decision rule of Eq. 4 with the GMM modeling.
14
Modeling VFOA with a Hidden Markov Model (HMM): The GMM approach does not account
for the temporal dependencies between the VFOA events. To introduce such dependencies,
we consider a HMM. A HMM is a natural extension to the GMM approach for modeling
temporal dependencies between the VFOA events. Denoting the VFOA sequence by s0:Tand
the observation sequence by z1:T, the joint posterior probability density function of states and
observations can be written:
p(s0:T, z1:T) = p(s0)
T
Y
t=1
p(zt|st)p(st|st1)(6)
In this equation, the emission probabilities p(zt|st=fi)are modeled as in the previous case
(i.e. Gaussian distributions for the regular focus targets, uniform distribution for the unfocused
case). However, in the HMM modeling, the static prior distribution on VFOA targets is replaced
by a discrete transition matrix A= (ai,j), defined by ai,j =p(st=fj|st1=fi), which models
the probability of passing from the focus fito the focus fj. Thus, the set of parameters of the
HMM model is λH={µ, Σ, A = (ai,j )i,j=1:K}. With this model, given the observation sequence,
the VFOA recognition is performed by estimating the optimal sequence of focus targets which
maximizes p(s0:T|z1:T). This optimization is efficiently conducted using the Viterbi algorithm
[35]3.
B. VFOA Recognizer Parameter Setting
Gaussian Parameter Setting using labeled Training Data: Since in many meeting settings, peo-
ple are mostly static and seated at the same physical positions, we could set the model parameters
using training data. Thus, given training data with VFOA annotations, and head pose measure-
ments, we can readily estimate all the parameters of the GMM or HMM models. Parameters
learned with this training approach will be denoted with a lsuperscript. Note that µl
iand Σl
iare
learned by first computing the VFOA means and covariances per meeting and then averaging
the results on the meetings belonging to the training set.
Gaussian Parameter Setting using a Geometric Model: The training approach to parameter learn-
ing is straightforward when annotated data is available. However, annotating the VFOA of people
in video recording is tedious and time consuming, as training data needs to be gathered and
annotated for each meeting setup. In the case of moving people, this is impossible. As an
3In principle, such a decoding procedure is performed in batch. However, efficient online approximations are available.
15
αG
αHαE
Fig. 5. Relationship between gazing direction and head orientation.
alternative, we propose a model that exploits the geometric and cognitive nature of the problem.
The parameters set with this model will be denoted with a superscript g(e.g. µg
i).
Assuming that we have a camera calibrated w.r.t. the room, given a head location and a VFOA
target location, it is possible to derive the Euler angles associated with the gaze direction. As
gazing at a target is usually accomplished by rotating both the eyes (’eye-in-head’ rotation)
and the head in the same direction, the head is only partially oriented towards the gaze. In
neurophysiology and cognitive sciences, researchers studying the dynamics of the head/eye
motions involved in saccadic gaze shifts have found that the relative contribution of the head
and eyes towards a given gaze shift follows simple rules [16], [31]. While the experimental
framework employed in these papers do not completely match the meeting room scenario, we
have exploited these findings to propose a model for predicting a person’s head pose given his
gaze target.
The proposed geometric model is presented in Fig. 5. Given a person P whose reference head
pose corresponds to looking straight ahead in the N direction, and given that he is gazing towards
D, the head points in direction H according to:
αH=κααGif |αG|> ξα,and 0otherwise (7)
where αGand αHdenotes the gaze pan and the actual head pan angle respectively, both w.r.t.
the reference direction N. The parameters of this model, καand ξα, are constants independent
of the gaze target, but usually depend on individuals [16]. While there is a consensus about the
linearity aspect of the relation in Eq. 7, some researchers reported observing head movements
for all gaze shift amplitudes (i.e. ξα=0), while others did not. In this paper, we will assume
ξα= 0. Besides, Eq. 7 is only valid if the contribution of the eyes to the gaze shift (given by
αE=αGαH) do not exceed a threshold, usually taken at 35. Finally, in [16], it is shown
that the tilt angle βfollows a similar linearity rule. However, in this case, the contribution of
the head to the gaze shift is usually lower than for the pan case. Typical values range from 0.2
16
to 0.5 for κβ, and 0.5 to 0.8 for κα.
We assume we know the approximate positions of the people’s heads, VFOA targets, and
camera within the room4. The cognitive model can be used to predict the values of the mean
angles µof Gaussian distribution focusing each VFOA target. The reference direction N (Fig. 5)
will be assumed to grossly correspond to the mean of all the gaze targets directions. For both
person left and right, it corresponds to looking at O1 (cf Fig. 1(c)). The covariances Σof
the Gaussian distributions were assumed to be diagonal, and were set by taking into account
the physical target size, and the fact that VFOA targets corresponding to head poses in profile
are associated with larger pan tracking errors. The specific values were: σα(O1, O2) = 12,
σα(P r, P L, S S) = 15, and σα(T B) = 17for the pan, and σβ(O1, O2, P R, P L, SS) = 12,
σβ(T B) = 15for the tilt.
Setting the VFOA Prior Distribution πand Transition Matrix A:When training data is avail-
able, one could learn these parameters. If the training meetings exhibit a specific structure,
as is the case in our database, where the main and secondary organizers always occupy the
same seats, the learned prior will have a beneficial effect on the recognition performances for
similar unseen meetings. However, at the same time, this learned prior can considerably limit
the generalization to other data sets, since by simply exchanging seats between participants,
we obtain meeting sessions with different prior distributions. Thus, we investigated alternatives
that avoided favoring any meeting structures. In the GMM case, this was done by considering
a uniform distribution (denoted πu) over the prior π. In the HMM case, transitions defining
the probability of keeping the same focus were favored and transitions to other focuses were
distributed uniformly according to: ai,i =ǫ < 1(we used ǫ= 0.75), and ai,j =1ǫ
K1for i6=j
where Kis the number of VFOA targets. We denote as Authe constructed transition matrix.
VII. VFOA MODELS ADAPTATION
The VFOA recognizers described in the previous section are generic and can be applied
indifferently to any new person seated at the location corresponding to the defined model. In
practice, however, we observed that people have personal ways of looking at targets. For example,
some people use their eye-in-head rotation capabilities more and turn less their head towards
4The relation in Eq. 7 is valid in the person’s head reference. The camera position is needed in order to transform the obtained
pose values into head poses w.r.t. to the camera.
17
(a) (b)
Fig. 6. Examples of gaze behaviours. (a) and (b): in both images, the person on the right looks at the target O1. In (b),
however, the head is used more rotated toward O1than in (a).
the focused target than others (see Fig 6(a) and Fig 6(b)). In addition, our head pose tracking
system is sensitive to the visual appearance of people, and can introduce a systematic bias in the
estimated head pose for a given person. As a consequence, the parameters of the generic models
might not be the best for a given person. As a remedy we propose to exploit the Maximum A
Posteriori (MAP) estimation principle to adapt, in an unsupervised fashion, the generic VFOA
models to the data of each new meeting, and thus produce models adapted to an individual’s
characteristics.
A. VFOA Maximum a Posteriori (MAP) Adaptation Principle
The MAP adaptation procedure we followed is a batch process, as explained in Section IV.
Its principle is the following: Let z=z1, ..., zTdenotes the unlabeled sequence of head poses
of one person, to which we want to adapt our model, and λΛthe parameter of the VFOA
recognizer to be estimated from the head pose data. The MAP estimate ˆ
λof the parameters is
then defined as:
ˆ
λ= arg max
λΛp(λ|z) = arg max
λΛp(z|λ)p(λ)(8)
where p(z|λ)is the data likelihood and p(λ)is the prior on the parameters. The goal is thus to
find the parameters that best fit the observed head pose distribution, while avoiding too large
deviation from sensible values through the use of priors on the parameters. The choice of the prior
distribution is crucial for the MAP estimation. In [36] it is shown that for GMMs and HMMs, by
selecting the prior probability density function (pdf) on λas the product of appropriate conjugate
distributions of the likelihood of the data 5, then the MAP estimation can also be solved using
the Expectation-Maximization (EM) algorithm, as detailed in the next two sub-sections.
5A prior distribution g(λ)is the conjugate distribution of a likelihood function f(z|λ)if the posterior f(z|λ)g(λ)belongs to
the same distribution family as g.
18
B. VFOA GMM and HMM MAP Adaptation
GMM MAP Adaptation: In the case the VFOA target are modeled by a GMM, the data likeli-
hood is p(z|λG) = QT
t=1 p(zt|λG), where p(zt|λG)is the mixture model given in Eq. 5, and λG
are the parameters to be learnt.
For this model, it is possible to express the prior probability as a product of individual conjugate
priors [36]. Accordingly, the conjugate prior of the multinomial mixture weights is the Dirichlet
distribution D(νw1,...,νwK)whose pdf is given by:
pD
νw1,...,ν wK(π1,...,πK)
K
Y
k=1
πνwi1
i(9)
Additionally, the conjugate prior for the Gaussian mean and the inverse covariance matrix of a
given mixture is the Normal-Wishart distribution, W(τ, mi, d, Vi)(i= 1, ..., K 1), with pdf
pW
i(µi,Σ1
i)∝ |Σ1
i|dp
2exp τ
2(µimi)Σ1
i(µimi)×(10)
exp(1
2tr(ViΣ1
i)), d > p
where tr denotes the trace operator, (µimi)denotes the transpose of (µimi), and pdenotes
the observations’ dimension. Thus the prior distribution on the set of all the parameters is defined
as
p(λG) = pD
νw1,...,ν wK(π1,...,πK)
K1
Y
i=1
pW
i(µi,Σ1
i).(11)
The MAP estimate ˆ
λGof the distribution p(z|λG)p(λG)can thus be computed using the EM
algorithm by recursively applying the following computations (see Fig. 7) [36]:
cit =ˆπip(zt|ˆµi,ˆ
Σi)
PK
j=1 ˆπjp(zt|ˆµj,ˆ
Σj)and ci=
T
X
t=1
cit (12)
¯zi=1
ci
T
X
t=1
citztand Si=1
ci
T
X
t=1
cit(zt¯zi)(zt¯zi)(13)
where ˆ
λG= (ˆπ, µ, ˆ
Σ)) denotes the current parameter fit. Given these coefficients, the M step
re-estimation formulas are given by:
ˆπi=νwi1 + ci
νK+T,ˆµi=τmi+ci¯zi
τ+ciand ˆ
Σi=Vi+ciSi+ciτ
ci+τ(mi¯zi)(mi¯zi)
dp+ci(14)
The setting of the hyper-parameters of the prior distribution p(λG)in Eq. 11, which is discussed
at the end of this Section, is important as the adaptation is unsupervised, and thus only the prior
19
Input : adaptation parameters (ν, {wi})for the Dirichlet, (τ, d, {mi, Vi})for the Wishart prior.
Output : estimated parameter ˆ
λGof the recognizer model
Initialization of ˆ
λG:ˆπi=wi,ˆµi=mi,ˆ
Σi=Vi/(dp)
EM: repeat until convergence:
1) Expectation: compute cit ,¯ziand Si(Eq. 12 and 13) using the current parameter set ˆ
λG
2) Maximization: update parameter set ˆ
λGusing the re-estimations formulas (Equations 14)
Fig. 7. GMM MAP adaptation procedure.
prevents the adaptation process from deviating from meaningful VFOA distributions.
VFOA MAP HMM Adaptation: The VFOA HMM can also be adapted in an unsupervised way
to new test data using the MAP framework [36]. The parameters to adapt in this case are the
transition matrix and the parameters of the emission probabilities λH={A, (µ, Σ)}.
The adaptation of the HMM parameters leads to a procedure similar to the GMM adaptation
case. Indeed, the prior on the Gaussian parameters follows the same Normal-Wishart density
(Eq. 10), and the Dirichlet prior on the static VFOA prior πis replaced by a Dirichlet prior on
each row p(.|s=fi) = ai,·of the transition matrix. Accordingly, the full prior is:
p(λH)
K
Y
i=1
pD
νbi,1,...,ν bi,K (ai,1,...,ai,K )
K1
Y
i=1
pW
i(µi,Σ1
i)(15)
Then the EM algorithm to compute the MAP estimate can be conducted in the following manner.
For a sequence of observations, z= (z1, ..., zT), the hidden states are now composed of a
corresponding state sequence s1, .., sT, which allows us to compute the joint state-observation
density (cf Eq. 6). Thus, in the E step, one needs to compute ξi,j,t =p(st1=fi, st=fj|z, ˆ
λH)
and ci,t =p(st=fi|z, ˆ
λH), which respectively denote the joint probability of being in the state
fiand fjat time t1and t, and the probability of being in state fiat time t, given the current
model ˆ
λHand the observed sequence z. These values can be obtained using the Baum-Welch
forward-backward algorithm [35]. Given these values, the re-estimation formulas for the mean
and covariance matrices are the same as those in Eq. 14 and as follows for the transition matrix
parameters:
ˆai,j =νbi,j 1 + PT1
t=1 ξi,j,t
νK+PK
j=1 PT1
t=1 ξi,j,t
.(16)
C. Choice of Prior Distribution Parameters
In this section we discuss the impact of the hyper-parameter settings on the MAP estimates,
through the analysis of the re-estimation formula (Eq. 14). Before going into details, recall that
20
Tdenotes the size of the data set available for adaptation, and Kis the number of VFOA targets.
Parameter values for the Dirichlet distribution: The Dirichlet distribution is defined by two kinds
of parameters: a scale factor νand the prior values on the mixture weights wi(with Piwi= 1).
The scale factor νcontrols the balance between the prior distribution on the mixture weights
wand the data. If νis small (resp. large) with respect to TK, the adaptation is dominated
by the data (resp. by the prior, i.e. almost no adaptation occurs). When ν=TK, the data
and prior contribute equally to the adaptation process. In our experiments, the hyper-parameter
νwill be selected through cross-validation among the values in Cν={ν1=TK, ν2=
2(TK), ν3= 3(TK)}. The prior weights wi, on the other hand, are defined according to
the prior knowledge we have on the distribution of VFOA targets. Since as explained before,
we want to enforce that no knowledge about the VFOA targets distribution, the wican be set
uniformly equal to 1
K.
Parameter values for the Normal-Wishart distribution: This distribution defines the prior on the
mean µiand covariance Σiof one Gaussian. The adaptation of the mean is essentially controlled
by two parameters (see Eq. 14): the prior value for the mean, mi, which will be set to the value
computed using either the learning (mi=µl
i) or the geometric approach (mi=µg
i) and a scalar
τ, which linearly controls the contribution of the prior mito the estimated mean. As the average
value for ci, is T
K, in the experiments, we will select τthough cross-validation among the values
in Cτ={τ1=T
2K, τ2=T
K, τ3=2T
K, τ4=5T
K}. Thus, with the first value τ1, the mean
adaptation is on average dominated by the data. With τ2, the adaptation is balanced between the
data and prior distrubution on the means, and with the two last values, adaptation is dominated
by the priors on the means.
The prior on the covariance is more difficult to set. It is defined by the Wishart distribution
parameters, namely the prior covariance matrix Viand the number of degrees of freedom d.
From Eq. 14, we see that the data covariance and the deviation of the data mean from the
mean prior also influence the MAP covariance estimate. As a prior Wishart covariance, we will
take Vi= (dp)˜
Vi, where ˜
Viis either Σl
ior Σg
i, the covariance of target fiset either using
training data or the geometric model (Subsection VI-B) respectively. The weighting (dp)is
important, as it allows Vito be of the same order of magnitude than the data variance ciSi. In
the experiments, we will use d=5T
K, which puts an emphasis on the prior, and restricts the
adaptation from deviating far from the covariance priors.
21
VIII. EVALUATION SET UP
The evaluation of the VFOA models was conducted using the IHPD database (Section III).
Below, we describe our performance measures and give details about the experimental protocol.
A. Performance Measures
We propose two kinds of error measures for performance evaluation.
The Frame based Recognition Rate (FRR) which corresponds to the percentage of frames, or
equivalently, the proportion of time, during which the VFOA has been correctly recognized.
This rate, however, can be dominated by VFOA events of long duration (a VFOA event is
defined as a temporal segment with the same VFOA label). Since we are also interested in
the dynamics of the VFOA, which contains information related to interaction, we also need a
measure reflecting how well these events, short or long, are recognized.
Event based precision/recall, and F-measure. Let us consider two sequences of VFOA events:
the GT sequence Gobtained from human annotation, and the recognized sequence Robtained
through VFOA estimation. The GT sequence is defined as G= (Gi= (li, Ii= [bi, ei]))i=1,...NG
where NGis the number of events in the ground truth G,li∈ F is the ith VFOA event label,
and biand eiare the beginning and end time instants of the event Gi. The recognized sequence R
is defined similarly. To compute the performance measures, the two sequences are first aligned
using a string alignment procedure that takes into account the temporal extent of the events.
More precisely, the matching distance between two events Giand Rjis defined as:
d(Gi, Rj) =
1FIif li=ljand I=IiIj6=
2otherwise (i.e. events do not match),(17)
with FI=2ρIπI
ρI+πI
, ρI=|I|
|Ii|, πI=|I|
|Ij|(18)
where |.|denotes the cardinality operator, and FImeasures the degree of overlap between two
events. Then, given the alignment we can compute the recall ρE, the precision πE, and the
F-measure FEfor each person measuring the event recognition performance, defined as:
ρE=Nmatched
NG
, πE=Nmatched
NRand FE=2ρEπE
ρE+πE
,(19)
where Nmatched represents the number of events in the recognized sequence that match the same
event in the GT after alignment. The recall measures the percentage of ground truth events that
22
(GT, Left) (GT, Right) (TR, Left) (TR, Right)
Fig. 8. Distribution of overlap measures FIbetween true and estimated matched events. The estimated events were obtained
using the HMM approach. GT and TR respectively denote the use of GT head pose data and tracking estimates data. Left and
Right denote person left and right respectively.
acronyms description
gt the head pose measurements are the ground truth data obtained with the magnetic sensor
tr the head pose measurements are those obtained with the head tracking algorithm
gmm the VFOA recognition model is a GMM
hmm the VFOA recognition model is an HMM
ML maximum likelihood approach: the meeting used for testing is used to train the model parameters
ge parameters of the Gaussian were set using the geometric gaze approach
ad VFOA model parameters were adapted
TABLE II
MODEL ACRONYMS:ACRONYM COMBINATIONS DESCRIBE WHICH EXPERIMENTAL CONDITIONS ARE USED. FOR EXAMPLE,
GT-HMM-GE INDICATES THAT THE HMM VFOA RECOGNIZER WITH PARAMETERS SET USING THE GEOMETRIC GAZE
MODEL WERE APPLIED TO GROUND TRUTH POSE DATA.
are correctly recognized while the precision measure the percentage of estimated events that
are correct. Both precision and recall need to be high to characterize a good VFOA recognition
performance. The F-measure, defined as the harmonic mean of recall and precision, reflects this
requirement. We report the average of the precision, recall and F-measure FEof the 8 individuals
over the whole database (and for each seat position). Note that according to Eq. 17, events are
said to match whenever their common intersection is not empty (and labels match). One may
think that the counted matches could be generated by spurious accidental matches due to a very
small intersection. In practice, however, we observe that it is not the case: the vast majority of
matched events have a significant degree of overlap FI, as illustrated in Fig. 8, with 90% of the
matches exhibiting an overlap higher than 50%, even using noisier tracking data.
B. Experimental Protocol
To study the different modeling aspects, several experimental conditions have been defined.
They are summarized in Table II along with the acronyms that identify them in the result tables.
First, there are two alternatives regarding the head pose measurements: the ground truth gt case,
23
VFOA recognition without adaptation
µi,ΣiGaussian parameters - learned (µl
i,Σl
i) or given by geometric modeling (µg
i,Σg
i), cf Subsection VI-B.
π, A GMM and HMM model priors - set to the values πu, Au, as described in Subsection VI-B.
VFOA recognition with adaptation
µi,Σi,π, A same as above - set as the result of the adaptation process.
νscale factor of Dirichlet distribution - set through cross-validation.
wi, bi,j Dirichlet prior values of πiand ai,j - set to πu
iand au
i,j .
τscale factor of Normal prior distribution on mean - set through cross-validation.
miVFOA mean prior value of Normal prior distribution - set to either µl
ior µg
i.
dscale factor of Wishart prior distribution on covariance matrix - set by hand (cf Sec. VII-C).
ViVFOA covariance matrices prior values in Wishart distribution - set to either (d2)Σl
ior (d2)Σg
i.
TABLE III
VFOA MODELING PARAMETERS:DESCRIPTION AND SETTING. THE GAZE FACTORS κα, κβWERE SET BY HAND.
where the data is obtained using the FOB magnetic sensor, and the tr case, which relies on the
estimates obtained with the video tracking system. Secondly, there are the two VFOA recognizer
models, gmm and hmm, as described in Subsections VI-A. Regarding the approach relying on
training data, the default protocol is the leave-one-out approach: each meeting recording is
in turn left aside for testing, while the data of the 7 other recordings are used for parameter
learning, including hyper-parameter selection in the adaptation case (denoted ad). The maximum
likelihood case ML is an exception, in which the training data for a given meeting recording
is composed of the same single recording. The ge acronym denotes the case where the VFOA
Gaussian means and covariances were set according to the geometric model instead of being
learned from training data. Finally, the adaptation hyper-parameter pair (ν, τ )was selected (in
the cartesian set Cν×Cτ) by cross-validation over the training data, using FEas performance
measure to maximize. A summary of all parameters involved in the modeling and the way they
were set depending on whether there was adaptation or not is displayed in Table III.
IX. EXPERIMENTAL RESULTS
This section provides results under the various experimental conditions. We first analyze the
results obtained on the GT head pose data, and then compare them with those obtained using
the tracking estimates instead. In both cases, we discuss the effectiveness of the modeling w.r.t.
different issues: (i) relevance of head pose to model VFOA gaze targets, (ii) predictability of
VFOA head pose parameters, (iii) impact of the person’s position in the room. Then, we comment
on the results of the adaptation scheme. Note that although these first sets of results are only
shown with the parameter setting using the training data, the conclusions that are made are also
24
(a) (b) (c) (d)
(e) (f) (g) (h)
Fig. 9. Example of results and focus ambiguity. In green, tracking result and head pointing direction. In yellow, recognized
focus (hmm-ad condition). Images (g) and (h): despite the high visual similarity of the head pose, the true focus differ (in (g):
PL; in h: SS). Resolving such cases can only be done by using context (speaking status, other’s people gaze, slide activity etc).
data ground truth (gt) tracking estimates (tr)
modeling ML gmm hmm ML gmm hmm
FRR 79.7 72.3 72.3 57.4 47.3 47.4
recall 79.6 72.6 65.5 66.4 49.1 38.4
precision 51.2 55.1 66.7 28.9 30 59.3
F-measure FE62 62.4 65.8 38.2 34.8 45.2
TABLE IV
VFOA RECOGNITION RESULTS FOR PERSON LEFT UNDER DIFFERENT EXPERIMENTAL CONDITIONS (SEE TABLE II).
valid for the geometric parameter setting. In Section IX-D, we compare in details the results
obtained with the geometric parameter setting and those obtained with the training parameter
setting. In all cases, results are given separately for the left and right persons (see Fig. 1). Some
result illustrations are provided in Fig.9.
A. Results on GT head pose data
VFOA and head pose correlation: Table IV and V display the VFOA recognition results for
person left and right respectively. The first column of these two tables gives the results of
the ML estimation (see Tab. II) with a GMM. These results show, in an optimistic case, the
performances our model can achieve, and illustrate the correlation between a person’s head
25
data ground truth (gt) tracking estimates (tr)
modeling ML gmm hmm ML gmm hmm
FRR 68.9 56.8 57.3 43.6 38.1 38
recall 72.9 66.6 58.4 65.6 55.9 37.3
precision 47.4 49.9 63.5 24.1 26.8 55.1
F-measure FE56.9 54.4 59.5 34.8 35.6 43.8
TABLE V
VFOA RECOGNITION RESULTS FOR PERSON RIGHT UNDER DIFFERENT EXPERIMENTAL CONDITIONS (SEE TABLE II).
Fig. 10. Empirical distribution of the GT head pose pan angle computed over the database for PL (left image) and P R. For
P L, the people and slide screen VFOA targets can still be identified through the pan modes. For PR, the degree of overlap is
quite significant.
poses and his VFOA. As can be seen, this correlation is quite high for P L (almost 80% FRR),
showing the good concordance between head pose and VFOA. This correlation, however, drops
to near 69% for P R. This can be explained by the fact that for the person on the right (P R), there
is a strong ambiguity between looking at PL or SS, as illustrated by the empirical distributions
of the pan angle in Fig. 10. Indeed, the range of pan values within which the three other meeting
participants and the slide screen VFOA targets lies is half the pan range of the person sitting
to the left (P L). The average angular distance between these targets is around 20for P R, a
distance which can easily be covered using only eye movements rather than rotating the head.
The values of the confusion matrices, displayed in Fig. 11, corroborate this analysis. The analysis
of Tables IV and V shows that this discrepancy between the results for P L and P R holds for
all experimental conditions and algorithms, with a performance decrease from P L to P R of
approximately 10-13% and 6%, for the FRR and event F-measure respectively.
VFOA Prediction: In the ML condition, very good results were achieved but they were biased
because the test data was used to set the Gaussian parameters. On the contrary, the GMM and
HMM results in Table IV and V, for which the VFOA parameters were learned from other
persons’ data, highlights the generalization property of the modeling. We can observe that the
26
(a) (GT, Left) (b) (GT, Right) (c) (TR, Left) (d) (TR, Right)
Fig. 11. Frame-based recognition confusion matrices obtained with the HMM modeling (gt-hmm and tr-hmm conditions).
VFOA targets 1 to 4 have been ordered according to their pan proximity: PR: person right - PL: person left - O1 and O2:
organizer 1 and 2 - SS: slide screen - TB: table - U: unfocused. Columns represent the recognized VFOA.
−80 −60 −40 −20 0 20 40
−50
−40
−30
−20
−10
0
10
20
30
pan
tilt
(a)
−80 −60 −40 −20 0 20 40
−50
−40
−30
−20
−10
0
10
20
30
pan
tilt
(b)
Fig. 12. Pan-tilt space VFOA decision maps for person right built from all meetings, in the GMM case (cf Eq. 4), using GT
(a) or tracking head pose data (b). Black=P L, yellow=SS, blue=O1, green=O2, red=T B, magenta =U.
GMM and HMM methods produce results close to the ML case. For both P L and P R, the
GMM approach achieves better frame recognition and event recall performance while the HMM
is giving better event precision and FEresults. This can be explained since the HMM approach
is effectively denoising the event sequence. As a result some events are missed (lower recall)
but the precision increases due to the elimination of short spurious detections.
VFOA Confusions: Figure 11(a) and 11(b) display as images the confusion matrices for P L
and P R obtained with the VFOA FRR performance measure and an HMM. They clearly exhibit
confusion between VFOA targets which are proximate in the head pose space. For instance,
for P L,O2is sometimes confused with P R or O1. For P R, the main source of confusion is
between P L and SS, as already mentioned. In addition, the table, T B, can be confused with
O1and O2, as can be expected since these targets share more or less the same pan values with
T B. Thus, most of the confusion can be explained by the geometry of the room and the fact
that people can modify their gaze without adjusting their head pose, and therefore do not always
need to turn their heads to focus on a specific VFOA target.
27
B. Results on Head Pose Estimates data
Table IV and V provide the results obtained using the head pose tracking estimates, under
the same experimental conditions as those used for the GT head pose data. As can be seen,
substantial performance degradation is observed. In the ML case, the decrease in FRR and F-
measure ranges from 22% to 26% for both P L and P R. These degradations are mainly due
to small pose estimation errors and also, sometimes, large errors due to short periods when the
tracker locks on a sub-part of the face. Fig. 12 illustrates the effect of pose estimation errors
on the VFOA distributions. The shape changes in the VFOA decision maps when moving from
GT pose data to pose estimates convey the increase of pose variance measured for each VFOA
target. The increase is moderate for the pan angle, but quite important for the tilt angle.
A more detailed analysis of Table IV and V shows that the performance decreases (from GT
to tracking data) in the GMM condition follows the ML case, while the deterioration in the
HMM case is smaller, in particular for FE. This demonstrates that, in contrast with what was
observed with the clean GT pose data, in the presence of noisy data, the HMM smoothing effect
is quite beneficial. Also, the HMM performance decrease is smaller for P R (19% and 15% for
respectively FRR and FE) than for P L (25% and 20%). This can be due to the better tracking
performance -in particular regarding the pan angle- achieved on people seated at the position P R
(as reported in Table I). Fig. 13 presents the plot of the VFOA FRR versus the pan angle tracking
error for each meeting participant, when using GT head pose data (i.e. with no tracking error)
or pose estimates. It shows that for P L, there is a strong correlation between tracking errors and
VFOA performances, which can be due to the fact that higher tracking errors directly generate
larger overlaps between the VFOA class-conditional pose distributions (cf Fig. 10, left). For P R,
this correlation is weaker, as the same good tracking performance results in very different VFOA
recognition results. In this case, the increase of ambiguities between several VFOA targets (e.g.
SS and P L) may play a larger role.
Finally, Fig. 11(c) and Fig. 11(d) display the confusion matrices when using the HMM and the
head pose estimates. In this case, the confusion matrices are very similar to the case using GT.
However for the head pose estimates case more confusion is observed due to the tracking errors
and the uncertainties in the tilt estimation (see Fig 13).
28
0 5 10 15 20 25 30 35 40 45
0
10
20
30
40
50
60
70
80
90
100
pan errors
FRR
PR gt points
PL gt points
PR tr points
PL tr points
fitted line to PR tr points
fitted line to PL tr points
Fig. 13. VFOA frame based recognition rate vs head pose tracking errors (for the pan angle), plotted per meeting. The VFOA
recognizer is the HMM modeling after adaptation.
person error measure gt-gmm gt-gmm-ad gt-hmm gt-hmm-ad tr-gmm tr-gmm-ad tr-hmm tr-hmm-ad
LFRR 72.3 72.3 72.3 72.7 47.3 57.1 47.4 53.1
F-measure FE62.4 61.2 65.8 66.2 34.8 42.8 45.2 47.9
RFRR 56.8 59.3 57.3 62 38.1 39.3 38 41.8
F-measure FE54.4 56.4 59.5 62.7 35.6 37.3 43.8 48.8
TABLE VI
VFOA RECOGNITION RESULTS FOR PERSON LEFT (L) AND RIGHT (R), BEFORE AND AFTER ADAPTATION.
C. Results with Model Adaptation
Table VI displays the recognition performance obtained with the adaptation framework de-
scribed in Section VII6. For P L, one can observe no improvement when using GT data and a
large improvement when using the tracking estimates (e.g. around 10% and 8% for resp. FRR
and FEwith the GMM model). In this situation, the adaptation is able to cope with the tracking
errors and the variability in looking at a given target. For P R, we notice an improvement with
both the GT and tracking head pose data. For instance, with the HMM model and tracking data,
the improvement is 3.8% and 5% for FRR and FE. Again, in this situation adaptation can cope
with an individual way of looking at the targets, such as correcting the bias in the estimated
head tilt , as illustrated in Fig. 14.
When exploring the optimal adaptation parameters estimated through cross-validation, one ob-
tains the histograms of Fig. 15. As can be seen, regardless of the kind of input pose data (GT
or estimates), they correspond to configurations giving approximately equal balance to the data
and prior w.r.t. the adaptation of the HMM transition matrices (ν1and ν2), and configurations
6In the tables, we recall the values without adaptation for ease of comparison.
29
−80 −60 −40 −20 0 20 40
−50
−40
−30
−20
−10
0
10
20
30
pan
tilt
−80 −60 −40 −20 0 20 40
−50
−40
−30
−20
−10
0
10
20
30
pan
tilt
Fig. 14. VFOA decision map example before adaptation (Left) and after adaptation (right). After adaptation, the VFOA of
O1and O2correspond to lower tilt values. black=P L, yellow=SS, blue=O1, green=O2, red=T B , magenta =U. The blue stars
represent the tracking head pose estimates used for adaptation.
(a) (b)
Fig. 15. Histogram of the optimal scale adaptation factor of the HMM prior (a) and HMM VFOA mean (b), selected though
cross-validation on the training set, and when working with GT head pose data.
for which the data are driving the adaptation process of the mean pose values (τ1and τ2).
D. Results with the Geometrical VFOA Modeling
Here we report the results obtained when setting the model parameters by exploiting the
meeting room geometry, as described in Subsection VI-B. This possibility for setting parameters
is interesting because it removes the need for data annotation each time a new focus target is
considered (for instance, if a 5th person was introduced the table).
Fig. 16 shows the geometric VFOA Gaussian parameters (mean and covariance) generated by
the model when using (κα, κβ) = (0.5,0.5). As can be seen, the VFOA pose values predicted
by the model are consistent with the average pose values computed for individuals using the GT
pose data. This is demonstrated by Table VII, which provides the prediction errors in pan Epan
defined as:
Epan =1
8×(K1)
8
X
m=1 X
fi∈F/{U}
|¯αm(fi)αp
m(fi)|(20)
where ¯αm(fi)is the average pan value of the person in meeting mand for the VFOA fi, and
αp
m(fi)is the predicted value according to the chosen model (i.e. the pan component of µg
fi
or µl
fiin the geometric or learning approaches respectively). The tilt prediction error Etilt is
30
Method learned VFOA geometric VFOA geometric VFOA
(with cross-validation) (with κα=κβ= 0.5)
Error Epan Etilt Epan Etilt Epan Etilt
P L 6.4 5.1 5.5 6.4 5.8 6.4
P R 5.9 6.1 5.6 7.6 12.8 7.4
TABLE VII
PREDICTION ERRORS (IN DEGREES)FOR LEARNED VFOA AND GEOMETRIC VFOA MODELS (WITH GT POSE DATA). IN
THE GEOMETRIC CROSS-VALIDATED CASE,THE SAME METHODOLOGY THAN IN THE LEARNING CASE IS USED:FOR EACH
MEETING THE EMPLOYED κα(OR κβ)HAS BEEN LEARNED ON THE OTHER MEETINGS.
obtained by replacing pan angles by tilt angles in Eq. 20. As can be seen, using cross-validated
καand κβvalues provides better results than setting these parameters to the constant values
(κα, κβ) = (0.5,0.5) used in all the recognition experiments reported below. Also, we noticed
that usually the καvalues providing good prediction are lower when using tracking data than
when using the ground truth head pose data. A likely explanation is that the head tracker under-
estimates the pan angles. Thus, to account for this, a smaller καhas to be used to obtain better
prediction. Interestingly enough, however, in practice we did not find any particular relationship
between an optimal angular prediction (as measured by Eq. 20) and the VFOA recognition results,
showing that the selection of these values is not critical. We thus relied on (κα, κβ) = (0.5,0.5)
for all our experiments.
The recognition performance is presented in Table VIII. These tables show that, when using
GT head pose data, the results are slightly worse than with the learning approach, which is
apparent in the similarity of the prediction errors. However, when using the pose estimates, the
results are better. For instance, for P L, when comparing the method setting the parameter using
the geometric approach to the method setting the parameter using the training based approach
both method with adaptation, the FRR improvement is more than 6%. It is interesting and
encouraging given that the modeling does not require any training data. Also, we notice that
the adaptation always improves the recognition, sometimes quite significantly (see the GT data
condition for P R, or the tracking data for P L).
Comparison with Stiefelhagen et al [12]: Our results seem quite far from the 73% reported by
31
Fig. 16. Geometric VFOA Gaussian distributions for P R (left image) and P L (right): the figure displays the gaze target
direction (), the corresponding head pose contribution according to the geometric model with values (κα, κβ) = (0.5,0.5)
(symbols), and the average head pose (from GT pose data) of individual people (+). Ellipses display the standard deviations
used in the geometric modeling. black=P L or P R, cyan=SS, blue=O1, green=O2, red=T B.
person Measure gt gt-ge gt-ad gt-ge-ad tr tr-ge tr-ad tr-ge-ad
LFRR 72.3 69.3 72.7 70.8 47.4 55.2 53.1 59.5
F-measure FE65.8 65.2 66.2 65.3 45.2 48.2 47.9 50.1
RFRR 57.3 51.8 62 58.5 38 41.1 41.8 42.7
F-measure FE59.5 53 62.7 59.2 43.8 49.1 48.8 50.1
TABLE VIII
VFOA RECOGNITION RESULTS FOR P L AND P L USING THE HMM MODEL WITH THE GEOMETRIC VFOA PARAMETER
SETTING ((κα, κβ) = (0.5,0.5)), WITH/WITHOUT ADAPTATION. FOR EASE OF COMPARISON,WE RECALL THE RESULTS
WITH THE TRAINING PARAMETER SETTING.
Stiefelhagen et al [12]7. Several factors may explain the difference. First, in [12], meeting with 4
people were studied and no other target apart from the other meeting participants was considered.
In addition, these participants were sitting at equally spaced positions around the table, optimizing
the discriminability between VFOA targets. People were recorded from a camera placed directly
in front of them. Hence, due to the table geometry, the majority of head pan lay between
[45,45], where the tracking errors are smaller (see Table I). Ultimately, our results are more
in accordance with the 52% FRR reported by the same authors [37] when using the same
framework as in [12] but applied to a 5-person meeting, resulting in 4 possible VFOA targets.
Nevertheless, as comparing algorithm results on different setups is quite difficult, we implemented
the methodology proposed in [12], [37] to recognize the VFOA solely from head pose. This
7Note that in [12], approaches to recognize the VFOA from audio, and a combination of audio and head pose are also provided.
However, for the remainder of this paper, we compare our method with their approach on recognizing the VFOA solely from
head pose, since this is the scope of our paper.
32
Method Stiefelhagen et al [12] Our model
measure gt-L tr-L gt-R tr-R gt-ge-ad-L tr-ge-ad-L gt-ge-ad-R tr-ge-ad-R
FRR 61.9 55.7 53.1 39.6 70.8 59.5 58.5 42.7
F-measure FE53.8 35.1 43.8 34.7 65.3 50.1 59.2 50.1
TABLE IX
COMPARISON OF OUR VFOA RECOGNITION APPROACH (HMM WITH GEOMETRIC MODEL AND ADAPTATION)AND [12]
(SEE FOOTNOTE 7).
methodology consists of first clustering the head pose measurements of an individual person
using the k-means algorithm, and then using the outcome to initialize the learning of a GMM
similar to the one we presented. Finally, each component of the GMM mixture is associated
with a target focus using a set of rules. This approach clearly has several issues, especially
when the number of targets is large: how to initialize the k-means algorithm, and how to define
the association rules. As no information was given in [12] w.r.t. k-means initialization, we
experimented with different alternatives and report the best results, which were obtained using
the gaze values predicted by the geometrical model (random initialization produced on average
much worse results than those presented, around 10% less). Each component was associated
with a focus by taking the mixture with the lowest mean tilt value as the table, and other
mixtures were associated to the other VFOA targets based on their respective pan values. The
comparative results are given in Table IX. They clearly show that our method leads to significant
improvements in all conditions. Interestingly enough, the improvement is higher when using
uncorrupted head pose measurements (i.e. the GT data). These improvements validate our use
of the MAP adaptation framework. Indeed, while in [12] full freedom is given to the data to
drive the adaptation process, our experiments show (cf Figure 15) that the optimal adaptation
parameters, selected by cross-validation, give equal importance to the data and the prior set on
the GMM parameters to obtain better models.
X. CONCLUSION AND FUTURE WORK
In this paper, we addressed the VFOA recognition of meeting participants from their head pose
in complex meeting scenarios. Head pose measurements were obtained either through magnetic
field sensors or using a head pose tracking algorithm. Several alternative models were studied.
33
Thorough experiments on a large and challenging database made publicly available, gave the
following outcome:
influence of the physical setup: when using head pose tracking estimates, average recog-
nition rates of 60% and and 42% were obtained for the left and right seat respectively. It
shows that good VFOA recognition can only be achieved if the visual targets of a person
are well separated in the head pose angular space, which mainly depends on the person’s
position in the meeting room.
head pose tracking: accurate pose estimation is essential for good results. Around 11%
and 16% error decreases were observed for the left and right seat respectively when using
the pose estimates instead of the ground truth. In addition, experiments showed that there
exists some correlation between head pose tracking errors and VFOA recognition results.
VFOA recognizer model: the HMM method is performing better than that of the GMM.
While this can not be observed with the standard Frame Recognition Rate measure, the
newly introduced event-based measure FEshows that the temporal smoothing introduced
by the HMM removes spurious detections in the VFOA estimation.
training data vs geometric model: to avoid the need for training data, we have proposed a
novel cognitive model exploiting the room geometry to set the recognizer parameters which
links the head pose measures to the VFOA targets. Compared with the standard approach
based on training data, and with a state-of-the-art algorithm, the new approach was shown
to provide much better results when using the head pose tracking estimates as input.
unsupervised adaptation: results show that in all conditions, automatically adapting the
VFOA recognition parameters using the unlabeled head pose measurements improves the
recognition.
From the above, there are several ways to increase performance. The first one is to increase
the separation between the visual targets. However, in practice, this is limited by the number
of people that we want to accommodate and the activities that people are allowed to perform.
The second one is to improve the pose tracking algorithms. This can be achieved using multiple
cameras, higher resolution images, or adaptive appearance modeling techniques, preferably in a
supervised fashion, by setting up training session to acquire people’s appearance at the beginning
of a meeting.
A third way to improve VFOA recognition can only come from the prior knowledge embedded in
34
the cognitive and interactive aspects of human-to-human communication. Ambiguous situations
such as the one illustrated in Fig. 9(g) and Fig. 9(h), where the same head pose can correspond
to two different VFOA targets, could be resolved by the joint modeling of the speaking status
and VFOA of all meeting participants. The relationship between speech and VFOA, used for
instance in [12], has been shown to exhibit specific patterns in the behavioral and cognitive
literature, as already exploited by [13] to derive conversation structures.
Finally, in the case of meetings in which people are moving to the slide screen or white board
for presentations, the development of a more general approach that models the VFOA of these
moving people will be necessary.
REFERENCES
[1] J. Khan and O. Komogortsev, “A hybrid scheme for perceptual object window design with joint scene analysis and eye-gaze
tracking for media encoding based on perceptual attention,” Journal of Electronic Imaging, vol. 15, pp. 332–350, 2006.
[2] K. Smith, S. Ba, D. Gatica-Perez, and J.-M. Odobez, “Multi-person wandering focus of attention tracking,” in International
Conference on Multimodal Interfaces, Banff, Canada, Nov. 2006.
[3] O. Kulyk, J. Wang, and J. Terken, Machine Learning for Multimodal Interaction, ser. LNCS 3869. Springer Verlag, 2006,
ch. Real-Time Feedback on Nonverbal Behaviour to Enhance Social Dynamics in Small Group Meetings.
[4] J. McGrath, Groups: Interaction and Performance. Prentice-Hall, 1984.
[5] D. Heylen, “Challenges ahead head movements and other social acts in conversation,” in The Joint Symposium on Virtual
Social Agent, 2005.
[6] S. Langton, R. Watt, and V. Bruce, “Do the eyes have it? cues to the direction of social attention,” Trends in Cognitive
Sciences, vol. 4(2), pp. 50–58, 2000.
[7] J. N. Bailenson, A. Beal, J. Loomis, J. Blascovitch, and M. Turk, “Transformed social interaction, augmented gaze, and
social influence in immersive virtual environments,Human Comm. Research, vol. 31, no. 4, pp. 511–537, Oct. 2005.
[8] N. Jovanovic and H. Op den Akker, “Towards automatic addressee identification in multi-party dialogues,” in 5th SIGdial
Workshop on Discourse and Dialogue, 2004.
[9] S. Duncan Jr, “Some signals and rules for taking speaking turns in conversations,Journal of Personality and Social
Psychology, vol. 23(2), pp. 283–292, 1972.
[10] D. Novick, B. Hansen, and K. Ward, “Coordinating turn taking with gaze,” in Int. Conf. on Spoken Lang. Processing,
1996.
[11] R. Rienks, D. Zhang, D. Gatica-Perez, and W. Post, “Detection and Application of Influence Rankings in Small Group
Meetings,” in ACM - Inter. Conf. on Multimodal Interfaces, Banff, Canada, Nov. 2006.
[12] R. Stiefelhagen, J. Yang, and A. Waibel, “Modeling focus of attention for meeting indexing based on multiple cues,IEEE
Transactions on Neural Networks, vol. 13(4), pp. 928–938, 2002.
[13] K. Otsuka, Y. Takemae, J. Yamato, and H. Murase, “A probabilistic inference of multiparty-conversation structure based
on markov-switching models of gaze patterns, head directions, and utterances,” in Proc. of International Conference on
Multimodal Interface (ICMI’05), Trento, Italy, Oct. 2005, pp. 191–198.
[14] ICPR-POINTING, “Icpr: Pointing’04: Visual observation of deictic gestures workshop,” 2004.
35
[15] CLEAR, “CLEAR evaluation campaign and workshop,” 2006.
[16] E. G. Freedman and D. L. Sparks, “Eye-head coordination during head-unrestrained gaze shifts in rhesus monkeys,Journal
of Neurophysiology, vol. 77, pp. 2328–2348, 1997.
[17] I. Malinov, J. Epelboim, A. Herst, and R. Steinman, “Characteristics of saccades and vergence in two kinds of sequential
looking tasks,” Vision Research, 2000.
[18] S. O. Ba and J. M. Odobez, “A rao-blackwellized mixed state particle filter for head pose tracking,” in ACM-ICMI Workshop
on Multi-modal Multi-party Meeting Processing (MMMP), Trento Italy, 2005, pp. 9–16.
[19] C. Morimoto and M. Mimica, “Eye gaze tracking techniques for interactive applications,Computer Vision and Image
Understanding, vol. 98, pp. 4–24, 2005.
[20] R. Pieters, E. Rosbergen, and M. Hartog, “Visual attention to advertising: The impact of motivation and repetition,” in
Conference on Advances in Consumer Research, 1995.
[21] J.-G. Wang and E. Sung, “Study on eye gaze estimation,” IEEE Transactions on Systems, Man and Cybernetics, Part B,
vol. 32, pp. 332–350, 2002.
[22] R. Stiefelhagen and J. Zhu, “Head orientation and gaze direction in meetings,” in Conference on Human Factors in
Computing Systems, 2002.
[23] A. Gee and R. Cipolla, “Estimating gaze from a single view of a face,” in Int. Conf. on Pattern Recognition, 1994.
[24] T. Horprasert, Y. Yacoob, and L. Davis, “Computing 3d head orientation from a monocular image sequence,” in IEEE
International Conference on Automatic Face and Gesture Recognition, 1996.
[25] T. Cootes and P. Kittipanya-ngam, “Comparing variations on the active appearance model algorithm,” in British Mach.
Vis. Conf. (BMVC), 2002.
[26] S. Srinivasan and K. L. Boyer, “Head pose estimation using view based eigenspaces,” in Int. Conf. on Pat. Recognition,
2002.
[27] Y. Wu and K. Toyama, “Wide range illumination insensitive head orientation estimation,” in IEEE Conference on Automatic
Face and Gesture Recognition, 2001.
[28] L. Brown and Y. Tian, “A study of coarse head pose estimation,” in IEEE Work. on Motion and Video Computing, 2002.
[29] L. Lu, Z. Zhang, H. Shum, Z. Liu, and H. Chen, “Model and exemplar-based robust head pose tracking under occlusion
and varying expression,” in IEEE Workshop on Models versus Exemplars in Computer Vision (CVPR-MECV), Dec. 2001.
[30] M. Danninger, R. Vertegaal, D. Siewiorek, and A. Mamuji, “Using social geometry to manage interruptions and co-worker
attention in office environments,” in Proc. of the Conf. on Graphics Interfaces, Victoria, Canada, 2005, pp. 211–218.
[31] M. Hayhoe and D. Ballard, “Eye movements in natural behavior,” TRENDS in Cog. Sciences, vol. 9(4), pp. 188–194, 2005.
[32] S. Baron-Cohen, “How to build a baby that can read minds: cognitive mechanisms in mindreading,” Cahier de psychologies
Cognitive, vol. 13, pp. 513–552, 1994.
[33] J.-M. Odobez, “Focus of attention coding guidelines,” IDIAP Reasearch Institute, Tech. Rep. IDIAP-COM-2, 2006.
[34] N. Gourier, D. Hall, and J. L. Crowley, “Estimating face orientation from robust detection of salient facial features,” in
Pointing 2004, ICPR International Workshop on Visual Observation of Deictic Gestures, 2004, pp. 183–191.
[35] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Readings in Speech
Recognition, vol. 53A(3), pp. 267–296, 1990.
[36] J. Gauvain and C. H. Lee, “Bayesian learning for hidden Markov model with Gaussian mixture state observation densities,
Speech Communication, vol. 11, pp. 205–213, 1992.
[37] R. Stiefelhagen, “Tracking and modeling focus of attention,” Ph.D. dissertation, University of Karlsruhe, 2002.
... In the horizontal plane, the face and gaze directions are strongly related, and the gaze direction is within + 20 • of the head direction [50]. Thus, we assume that the head direction corresponds to the gaze direction. ...
Article
Full-text available
Human symbiotic mobile robots are required to smoothly reach destinations while avoiding humans. Human-intent information, e.g., a moving direction, must be carefully estimated since its lack leads to the hesitation and repetitious avoidance. Conventional studies address human intent estimation and conveyance but do not consider cases of failing communication as a systematic framework, even though the intent is essentially difficult for humans to estimate due to its interiority. In response to this problem, we propose a framework of error-tolerant navigation (ETN) with a process to actively estimate the human intent by iterative interaction from the robot. As a preliminary study, we focus on ‘the intent conveyance from robot to human’ and ‘its achievement’ as core information. The ETN estimates interference possibility to determine the need for inducement, human awareness (HA) to select an inducement method, and inducement achievement (IA) to judge the need for action again. If the ETN estimates the interference, the robot provides inducements according to HA, e.g., route indication when HA is high or voice/physical interaction when HA is low. Each inducement corresponds to an expected behavior change in the human. IA is calculated from the difference between the expected and actual changes. If the robot observes no change within a specified time after the inducement, it executes inducements with a stronger intent conveyance. When IA is none after the strongest action, it selects another route. This error-collection loop in the ETN could prevent a fatal mistake by recognizing a small mistake and recovering it. The static and dynamic experimental results indicated that the ETN could achieve smoother human movement and reduce psychological burden by correcting the robot behaviors, compared with a conventional navigation system, which can contribute to constructing a practical ETN framework.
... Visual surveillance [56], [57]; driver attention [58], [59]; the visual focus of attention [15], [60]; and robotics [61] have all been investigated using head posture estimation appearancebased, model-based, manifold embedding, and nonlinear regression techniques are used to create head position prediction systems [62]. Appearance-based strategies compare a new head picture to a set of head posture templates to determine which viewpoint is the most related Appearance-based approaches have the drawback of only being able to predict discrete posture locations [63]. ...
... Visual Focus of Attention (VFA) prediction aims at identifying where people in an image are looking at within the image space (Ba and Odobez 2008). (Recasens et al. 2015) proposed a two-stream CNN to find the position where people in the image are looking at. ...
Article
Mutual gaze detection, i.e., predicting whether or not two people are looking at each other, plays an important role in understanding human interactions. In this work, we focus on the task of image-based mutual gaze detection, and propose a simple and effective approach to boost the performance by using an auxiliary 3D gaze estimation task during the training phase. We achieve the performance boost without additional labeling cost by training the 3D gaze estimation branch using pseudo 3D gaze labels deduced from mutual gaze labels. By sharing the head image encoder between the 3D gaze estimation and the mutual gaze detection branches, we achieve better head features than learned by training the mutual gaze detection branch alone. Experimental results on three image datasets show that the proposed approach improves the detection performance significantly without additional annotations. This work also introduces a new image dataset that consists of 33.1K pairs of humans annotated with mutual gaze labels in 29.2K images.
... There are many approaches for human pose recognition [14][15][16][17]. Some studies used a depth camera (e.g., Microsoft Kinect) to estimate human pose ( [14,18]). ...
Article
Student feedback is useful for teachers to improve their teaching. Although it is common to receive student ratings in universities, the low frequency of such feedback reduces the utility of the information. Using methods that do not rely on ratings can increase the frequency of feedback. We investigated whether the body posture of students can be used as an indicator of classroom engagement. In this paper, we estimated body posture from videos taken of students in the audience during a presentation and classified the scenes based on the postural similarity. The obtained clusters showed that body posture changed over time and did not return to the original state. A comparison between clusters at the beginning and end of the presentation showed that the standard deviation of head direction becomes large at the end, suggesting that body posture might reflect the degree of distraction. We discussed how body posture information facilitates teachers' reflection.
... This has been studied for a variety of applications, including visual surveillance [36], driver attention [37], the visual focus of attention [38], and robotics. Appearance-based approaches compare a fresh head image to a set of head posture templates to determine which perspective is the most similar. ...
Article
Full-text available
Background/Purpose: Quantification of consumer interest is an interesting, innovative, and promising trend in marketing research. For example, an approach for a salesperson is to observe consumer behaviour during the shopping phase and then recall his interest. However, the salesperson needs unique skills because every person may interpret their behaviour in a different manner. The purpose of this research is to track client interest based on head pose positioning and facial expression recognition. Objective: We are going to develop a quantifiable system for measuring customer interest. This system recognizes the important facial expression and then processes current client photos and does not save them for later processing. Design/Methodology/Approach: The work describes a deep learning-based system for observing customer actions, focusing on interest identification. The suggested approach determines client attention by estimating head posture. The system monitors facial expressions and reports customer interest. The Viola and Jones algorithms are utilized to trim the facial image. Findings/Results: The proposed method identifies frontal face postures, then segments facial mechanisms that are critical for facial expression identification and creating an iconized face image. Finally, the obtained values of the resulting image are merged with the original one to analyze facial emotions. Conclusion: This method combines local part-based features with holistic facial information. The obtained results demonstrate the potential to use the proposed architecture as it is efficient and works in real-time. Paper Type: Conceptual Research.
Chapter
The growing deployment of robots in social contexts implies the need to model their behaviour as social agents. In this context, the way a robot approaches a user and eventually engages in an interaction is a crucial aspect to take into account for the acceptance of these tools. In this work, we explore how the approaching policy and gaze behaviours can influence the perceived intention to interact before the interaction starts. The conducted user study highlights the importance of the robot’s gaze behaviour when approaching a human with respect to its approaching behaviour. In particular, if the robot moves in the surroundings of a human, even not straightforward in their direction, but locks the gaze at them, the intention to interact is recognised clearer and faster with respect to the direct approaching of the user but with an adverse gaze.KeywordsApproach policySocial behaviourGazeProxemics
Article
Full-text available
In recent years, recognizing the visual focus of attention (VFoA) has attracted much attention among computer vision experts due to its various Human-Computer Interaction (HCI) or Human-Robot Interaction (HRI) applications. Although eye gaze is a potential cue to determine someone’s focus of attention (FOA), it is challenging to determine FOA alone when the interacting partners are far away or the camera cannot capture high-resolution images from long distance. Therefore, the head pose can be used as an approximation to recognize the focus of someone’s attention. This paper proposes a vision-based framework to detect the FOA of humans using nine head poses consisting of four main modules: face detection and facial key-point selection (FDKPSM), head pose classification (HPCM), object localization and classification (OLCM), and focus of attention estimation (FoAEM). The FDKPSM uses the Multi-task Cascaded Neural Network (MTCNN) framework to detect head poses, and the HPCM classifies them into nine classes using the ResNet18. To estimate the FoA, the FoAEM uses a mapping Algorithm (EFoA) which integrates head poses on the focused object. Experimental results show that the proposed model outperformed other deep learning models by achieving the highest accuracy on three datasets: BIWI-M (96.97%), Pointing’04-M (96.04%) and HPoD 9 (98.99%). The visual focus of the attention model gained an accuracy of 94.12% in the multi-object scenario.
Article
Full-text available
This paper addresses the problem of estimating face ori- entation from automatic detection of salient facial struc- tures using learned robust features. Face imagettes are de- tected using color and described using a weighted sum of locally normalized Gaussian receptive fields. Robust face features are learned by clustering the Gaussian derivative responses within a training set face imagettes. The most reliable clusters are identified and used as features for de- tecting salient facial structures. We have found that a sin- gle cluster is sufficient to provide a detector for salient fa- cial structures that is robust to face orientation, illumina- tion and identity. We describe how clusters are learned and which facial structures are detected. We show use of this detection to estimate facial orientation.
Conference Paper
Full-text available
We present a model- and exemplar-based technique for head pose tracking. Because of the dynamic nature, it is not possible to represent face appearance by a single tex- ture image. Instead, we sample the complex face appear- ance space by a few reference images (exemplars). By tak- ing advantage of the rich geometric information of a 3D face model and the flexible representation provided by ex- emplars, our system is able to track head pose robustly un- der occlusion and/or varying facial expression. The system starts with a simple learning stage. The user moves his/her head with a neutral expression in front of the camera within the working space. Our system automatically builds a per- sonalized 3D face model by fitting a generic mesh model to a near frontal facial image, and acquires a few reference images at distinct poses to sparsely sample the facial ap- pearance space. When tracking the head under occlusion and varying expression, we match the current view against the most appropriate reference image according to the pre- dicted pose, which is much easier and more robust than if only a single texture image is used. A robust motion seg- mentation algorithm is used to separate point matches cor- responding to rigid head motion from those corresponding to facial deformation. The head pose can then be reliably estimated from the rigid-motion points with the help of the 3D face mesh model, even when the number of points is small. Since we use reference images during tracking, the accumulative error inherent in frame-by-frame tracking is avoided and more accurate pose estimation is achieved. We demonstrate the validity of our approach with several video sequences acquired in a casual environment.
Article
Full-text available
The paper is about the issue of addressing in multi-party dialogues. Analysis of addressing behavior in face to face meetings results in the identification of several addressing mech-anisms. From these we extract several utter-ance features and features of non-verbal com-municative behavior of a speaker, like gaze and gesturing, that are relevant for observers to identify the participants the speaker is talking to. A method for the automatic prediction of the addressee of speech acts is discussed.
Article
When involved in face-to-face conversations, people move their heads in typical ways. The pattern of head gestures and their function in conversation has been studied in various disciplines. Many factors are involved in determining the exact patterns that occur in conversation. These can be explained by considering some of the basic properties of face-to-face interactions. The fact that conversations are a type of joint activity involving social actions together with a few other properties, such as the need for grounding, can explain the variety in functions that are served by the multitude of movements that people display during conversations.
Article
Studied the turn-taking mechanism, whereby participants manage the smooth and appropriate exchange of speaking turns in face-to-face interaction in 2 videotapes showing a therapist-patient interview and a discussion between 2 therapists. 3 basic signals were noted: (a) turn-yielding signals by the speaker, (b) attempt-suppressing signals by the speaker, and (c) back-channel signals by the auditor. These signals were used and responded to in a relatively structured manner, describable in terms of a set of rules. Results indicate that behaviors in every communication modality examined content, syntax, intonation, paralanguage, and body motion were active as elements of the turn-taking signals. (22 ref.) (PsycINFO Database Record (c) 2012 APA, all rights reserved)