Content uploaded by Sileye Ba

Author content

All content in this area was uploaded by Sileye Ba

Content may be subject to copyright.

Recognizing Visual Focus of Attention

from Head Pose in Natural Meetings

Sileye Ba⋆, and Jean-Marc Odobez, Member, IEEE

IDIAP Research Institute, Rue du Simplon 4, CH-1920 Martigny, Switzerland

Telephone +41 27 721 77 11

Fax +41 27 721 77 12

Email <first name>.<last name>@idiap.ch

URL www.idiap.ch

Abstract

We address the problem of recognizing the visual focus of attention (VFOA) of meeting participants

based on their head pose. To this end, the head pose observations are modeled using a Gaussian Mixture

Model (GMM) or a Hidden Markov Model (HMM) whose hidden states corresponds to the VFOA. The

novelties of this work are threefold. First, contrary to previous studies on the topic, in our set-up,

the potential VFOA of a person is not restricted to other participants only, but includes environmental

targets (a table and a projection screen), which increases the complexity of the task, with more VFOA

targets spread in the pan as well as pan gaze space. Second, we propose a geometric model to set the

GMM or HMM parameters by exploiting results from cognitive science on saccadic eye motion, which

allows the prediction of the head pose given a gaze target. Third, an unsupervised parameter adaptation

step (not using any labeled data) is proposed which accounts for the speciﬁc gazing behaviour of each

participant. Another contribution of the paper is the development of a signiﬁcant publicly available

corpus of 8 meetings which are on average 10 minutes in length featuring 4 persons, with head pose

and VFOA annotation. Using this corpus, we analyze the above methods by evaluating, through objective

performance measures, the recognition of the VFOA from head pose information obtained either using a

magnetic sensor device or a vision based tracking system. The results clearly show that in such complex

but realistic situations, the VFOA recognition performance is highly dependent on how well the visual

targets are separated for a given meeting participant. In addition, the results show that the use of a

geometric model with unsupervised adaptation achieves better results that the use of training data to set

the HMM parameters.

⋆Corresponding author.

1

I. INTRODUCTION

Understanding human behaviour or need of human is a central issue in devising next-generation

human computing systems that can emulate more human-like functions. At the heart of this

issue lies, amongst others, the difﬁculty of sensing human behaviours in an accurate way, i.e.

the challenge of developing algorithms that can reliably extract subtle human characteristics -e.g.

body gestures, facial expressions, emotion- that allow a ﬁne analysis of their behaviour. One

such characteristic of interest is the gaze, which indicates where and what a person is looking at,

or, in other words, what the visual focus of attention (VFOA) of the person is. However, while

the development of gaze tracking systems for Human Computer Interface (HCI) applications has

been the topic of many studies, less research has been conducted for estimating and analyzing a

person’s gaze and VFOA in more open spaces, despite the fact that in many contexts, identifying

the VFOA of a person conveys a wealth of information about that person: what is he interested

in, what is he doing, how does he explore a new environment or react to different visual stimuli.

Thus, tracking the VFOA of people could have important applications in the development of

ambient intelligent systems.

In terms of human computing applications, VFOA can be used for video compression by

assuming that the important information in a video exist in the neighborhood of the gaze path

of a person viewing the video. Estimating the focus of a viewer can be used to deﬁne areas of

visual focus that could be encoded in high resolution, while the areas which are not focus centers

could be encoded at lower resolution [1]. Another possible application in a public space could

be to measure the degree of attraction of advertisements or shop displays based on the estimated

focus of people passing by as presented in [2]. Applications in meetings include digital assistants

that can analyze the social dynamic of the meeting based on people’s non-verbal behaviors in

order to improve the group cohesiveness and efﬁciency [3].

Needless to say, gaze plays an important role in face-to-face conversations and more generally

group interaction, as it has been shown in a large body of social psychology studies [4]. Human

interaction can be categorized as verbal (speech) or non-verbal (e.g. facial expressions). While

the usage of the former is tightly connected to the explicit rules of language (grammar, dialog

acts), the usage of non-verbal cues is usually more implicit, but this does not prevent it from

following rules and exhibiting speciﬁc patterns in conversations. For instance, in a meeting

context, a person raising a hand usually means that he is requesting the ﬂoor, and a listener’s

2

(a) (b) (c)

table

left person right person

organizer 1 organizer 2

slide screen

camera

(d)

Fig. 1. Recognizing the VFOA of people. (a) the meeting room (b) a sample image of the dataset (c) the potential VFOA

targets for the right person (d) the geometric conﬁguration of the room.

head nod or shake can be interpreted as agreement or disagreement [5]. Besides hand and head

gestures, the VFOA is another important non-verbal communication cue with functions such as

establishing relationships (through mutual gaze), regulating the course of interaction, expressing

intimacy, and exercising social control [6], [7].

A speaker’s gaze often correlates with his addressees, i.e. the intended recipients of the speech

[8]. Also, for a listener, monitoring his own gaze in concordance with the speaker’s gaze is a

way to ﬁnd appropriate time windows for speaker turn requests [9], [10]. Thus, recognizing the

VFOA patterns of a group of people can reveal important knowledge about the participants’ role

and status [11], [7]. Following these studies in social psychology, computer vision researchers are

showing more interest in the study of automatic gaze and VFOA recognition systems [12], [13],

[2], as illustrated by some of the research tasks deﬁned in several recent evaluation workshops

[14], [15]. Since meetings are places where the multi-modal nature of human communication

and interaction best occur, they are well suited to conduct such research studies.

In this context, the goal of this paper is to analyze the correspondence between the head pose

of people and their gaze in more general meeting scenarios than those previously considered [12],

[13]. In addition we propose methods to recognize the VFOA of people from their head pose (see

Fig. 1, and Fig. 9 for some results). In meeting rooms, where high resolution close-up views

of the eyes which are typically required by HCI gaze estimation systems, are not available

in practice, it has been shown in [12] that head orientation can be reasonably utilized as an

approximation of the gaze when VFOA targets are the other meeting participants (in meetings

with 4 people). In this paper, we investigate the estimation of VFOA from head pose in complex

meeting situations. Firstly, unlike previous work ([12], [13]), the scenario we consider involves

people looking at slides or writing on a sheet of paper on the table. As a consequence, people

have more potential VFOA targets in our set-up (6 instead of 3 in the cited work), leading

3

to more possible ambiguities between VFOA. Secondly, due to the physical placement of the

VFOA targets, the identiﬁcation of the VFOA can only be done using the complete head pose

representation (pan and tilt), instead of just the head pan, as done previously. Thus, our work

addresses general and challenging meeting room situations in which people do not just focus

their attention on other people, but also on other room targets.

To recognize the VFOA of people from their head pose, we investigated two generative models:

a Gaussian mixture model (GMM) that handle each frame separately, and its natural extension to

the temporal domain, namely a hidden Markov model (HMM), which segments pose observation

sequences into VFOA temporal segments. In both cases, for each VFOA target, the head pose

observations are represented as Gaussian distributions, whose means indicate the head pose

associated with each visual target. Alternative approaches were considered to set the model

parameters. In one approach, these were set using training data from other meetings. However,

as collecting training data can be tedious, we used the results of studies on saccadic eye motion

modeling [16], [17] and propose a novel approach (referred to as cognitive or geometric) that

models the head pose of a person given his upper body pose and his effective gaze target. In

this way, no training data is required to learn parameters, but some knowledge of the 3D room

geometry is necessary. In addition, to account for the fact that in practice we observed that

people have their own head pose preferences for looking at the same given target, we adopted

an unsupervised Maximum A Posteriori (MAP) scheme to adapt the parameters obtained from

either the learning model or the geometric model to unlabeled head pose data of individual

people in meetings.

To evaluate the different aspects of the VFOA modeling, we have conducted comparative and

thorough experiments on a large and publicly available database, comprising 8 meetings for

which both the head pose ground-truth and VFOA label ground truth are known. Therefore, we

were able to differentiate between the two main error sources in VFOA recognition: (1) the use

of head pose as a proxy for gaze, and (2) errors in the estimation of the head pose (e.g. using

our vision-based head pose tracker [18]).

In summary, the contributions of this paper are the following:

•the development of a public database and a framework to evaluate the recognition of the

VFOA solely from head pose;

•a novel geometric model to derive a person’s head pose given his gaze target, which

4

alleviates the need for training data;

•the use of an unsupervised MAP framework to adapt the VFOA model parameters to

individual people;

•a thorough experimental study and analysis of the inﬂuence of several key aspects on the

recognition performance (e.g. participant position, ground truth vs estimated head pose,

correlation with tracking errors).

The remainder of this paper is organized as follows. Section II discusses the related work.

Section III describes the task and the database that is used to evaluate the models we propose.

Section IV provides an overview of our approach. Section V describes our algorithm for joint

head tracking and pose estimation, along with its evaluation. Section VI describes the considered

models for recognizing the VFOA from head pose. Section VII gives the unsupervised MAP

framework used to adapt our VFOA model to unseen data. Section VIII describes our evaluation

setup. We give experimental results in Section IX, and conclusions in Section X.

II. RELATED WORK

We investigate the VFOA recognition from head pose in the context of meetings. Thus, we

will analyze the related work along the following lines: gaze and VFOA tracking technologies,

head pose estimation from vision sensors, and recognition of the VFOA from head pose.

The VFOA of a person is deﬁned by his eye gaze, that is, the direction in which the eyes

are pointing in the space. Many progresses in the design of gaze tracking technologies have

been achieved. A review of such systems is presented in [19]. Gaze trackers are predominantly

developed for HCI applications, where they are used for two main purposes: as an interactive

tool, where the eyes are used as an input modality; or as a diagnostic tool, to provide evidence

of a user’s attention, such as in applications studying the visual exploration of images by people

[20]. For this reason, these systems, while being accurate, are not appropriate for analyzing

the VFOA of people in open spaces: they can be intrusive (user needs to wear special glasses)

and require speciﬁc equipment (infrared light sources are often used to ease signal processing).

More importantly, they are very constraining, as the head motion is limited to small position

and angular variations (no more than 25cm and 20◦[19]). In worst cases, chin rests or bite bars

are required, but even eye-appearance vision-based gaze tracking systems restrict the mobility of

the subject since their need of high resolution close-up eye images requires cameras with very

narrow ﬁeld-of-views. To alleviate this constraint, some papers [19], [21] propose using head

5

pose tracking to localize eye corners and drive the acquisition of high resolution eye images

using a pan-tilt-zoom (PTZ) camera. These systems, however, require very good calibration, and

are still designed for near frontal head poses [21].

In spaces such as ofﬁces or meeting rooms, where the motion and head orientation of people are

unconstrained, high resolution images of people’s eye are not available. An alternative is to use

the head pose as a surrogate for gaze, as proposed in [22]. Broadly speaking, head pose tracking

algorithms can be divided into two groups: model based and appearance based approaches. In

model based approaches, a set of facial features such as the eyes, the nose and the mouth are

tracked. Then, knowing the relative positions of theses features, the head pose can be inferred

using anthropometric information [23], [24]. The major drawback is that robust facial feature

tracking is difﬁcult unless high enough resolution images are used. By modelling appearance

of the whole head , such approaches exhibit more robustness for low resolution images: [12]

used neural network to model head appearance, [25], [26] developed the active appearance

models based on principal component analysis, and [27], [28] used multidimensional Gaussian

distribution to represent the head appearance likelihood.

From another perspective, head pose tracking algorithms differentiate themselves according to

whether or not the tracking and the pose estimation are conducted jointly. Often, a generic tracker

is used to locate the head, and then features extracted at this location are used to estimate the pose

[12], [26], [27], [28]. Decoupling the tracking and the pose estimation results in a computational

cost reduction. However, since head pose estimation is very sensitive to head localization [28],

head pose results are highly dependent on the tracking accuracy. To address this issue, [25],

[29], [18] perform the head tracking and the pose estimation jointly.

In contrast to head tracking algorithms, few works have investigated the recognition of the

VFOA directly from head pose. Pionneering work from [12] used a GMM model, the parameters

of which were learned on the test data after initialization from the output of a K-means clustering

of the pose values. This approach was possible due to constraints on the physical set-up (four

people evenly spaced around a round table) and by limiting the allowed VFOA targets to the

other participants. These constraints allowed them to rely only on the pan angle to represent the

head pose, and limited the possibility of ambiguities in the head pose. In addition, [12] showed

that using other participants’ speaking status could further increase the VFOA recognition. More

recently, [13] used a dynamic Bayesian network to jointly recognize the VFOA of people, as well

6

as different conversational models in a 4-person conversation, based on head pan and speaking

status observations. Finally, in more recent work, [30] exploited the head pose extracted from an

overhead camera tracking retro-reﬂective markers mounted on headsets to look for occurrences of

shared mutual visual attention. This information was then exploited to derive the social geometry

of co-workers within an ofﬁce, and infer their availability status for communication.

III. DATABASE AND TASK

In this section, we describe the VFOA recognition task, and the data that is used to evaluate

both our pose estimation and VFOA recognition algorithms.

A. The Task and VFOA Set

Our goal is to evaluate how well we can infer the VFOA state of a person using head pose

in common meeting situations. Let us ﬁrst note that while the VFOA is given by the eye

gaze, psycho-visual studies have shown that people use other cues -e.g. head and body posture,

speaking status- to recognize the VFOA state of another person [6]. Thus, one general objective

of the current work is to see how well one can recognize the VFOA of people from these

other cues in the absence of direct gazing measurements, a situation likely to occur in many

applications of interest. An important issue is: what should be the deﬁnition of a person’s VFOA

state? At ﬁrst thought, one can consider that each different gaze direction could correspond to

a potential VFOA. However, studies on the VFOA in natural conditions [31] have shown that

humans tend to look at targets, whether humans or objects, that are either relevant to the task they

are solving or of immediate interest to them. Additionally, one interprets another person’s gaze

not as continuous 3D spatial locations, but as a gaze towards objects that have been identiﬁed

as potential targets. This process is often called the shared-attentional mechanism [32], [6], and

suggests that in general VFOA states correspond to a ﬁnite set of targets of interests.

Thus, in our meeting context the set of potential VFOA targets, denoted F, has been deﬁned

as: the other participants, the slide-screen, and the table. When none of the previous applies

(the person is distracted by some noise or visual stimuli and looks at another target) we use

an additional label called (unfocused). As a result, for ’person left’ in Fig. 1(c), we have:

F={P R, O2, O1, SS, T B, U}where P R stands for person right, O1and O2for organizer

1 and 2, SS for slide screen, T B for table and Ufor unfocused. For the person right, F=

{P L, O2, O1, SS, T B, U}, where P L stands for person left. Note that in practice, the unfocused

7

label only represents a small percentage of our data (2%), while the other VFOA target represent

55%, 26% and 17% for the other participants, the slide screen, and the table, respectively.

B. The Database

Our experiments rely on the IDIAP Head Pose Database (IHPD) 1. The video database was

collected along with a head pose ground truth and each participant’s discrete VFOA ground

truth, as explained below.

Content description: the database is comprised of 8 meetings involving 4 people each, recorded

in a meeting room (cf Fig. 1(a)). The meeting durations ranged from 7 to 14 minutes, which

was long enough to realistically represent a general meeting scenario. In shorter recordings (less

than 2-3 minutes), we found that participants tend to be more active resulting in moving their

head more to focus on other people/objects. In our meetings or in longer situations, the attention

of participants sometimes drops and people are less focused on the other meeting participants.

Note, however, that the small group size encourages engagement of participants in the meeting,

in contrast to meeting with larger groups. Meeting participants were instructed to write down

their name on a sheet of paper, then discuss statements displayed on the projection screen. There

were no restrictions placed on head motion or head pose.

Head pose annotation: in each meeting, the head pose of two persons were continuously anno-

tated (person left and right in Fig. 1(c) ) using a magnetic ﬁeld sensor called ﬂock of birds (FOB)

rigidly attached to the head, resulting in a video database of 16 different people. The coordinate

frame of the magnetic sensors was calibrated with respect to the camera frame, allowing us

to generate the head pose ground truth with respect to the camera. The head pose is deﬁned

by three Euler angles (α, β, γ )that parametrize the decomposition of the rotation matrix of the

head conﬁguration with respect to the camera frame. To report our results, we have selected

among the possible Euler decompositions the one whose rotation axes are rigidly attached to the

head (see Fig. 4(a)): αdenotes the pan angle, a left/right head rotation; βdenotes the tilt, an

up/down head rotation; and ﬁnally, γ, the roll, represents a left/right “head on shoulder” head

rotation. Because of our meeting scenario, people often have negative pan values corresponding

to looking at the projection screen. Recorded pan values range from -70 to 60 degree. Tilt values

range from -60 (when people are writing) to 15 degrees, and roll value from -30 to 30 degrees.

1Available at http://www.idiap.ch/HeadPoseDatabase/ (IHPD)

8

(a) VFOA recognition without adaptation. (b) VFOA recognition with adaptation.

(c) VFOA parameter setting: training approach. (d) VFOA parameter setting: geometric approach.

Fig. 2. Overview of the different recognition approaches and modules.

VFOA annotation: using the predeﬁned discrete set of VFOA targets F, the VFOA of each

person (PL and PR) was manually annotated on the basis of their gaze direction by a single

annotator using a multimedia interface. The annotator had access to all data streams, including

the central camera view (Fig. 1(a)). Speciﬁc annotation guidance was deﬁned in [33].

IV. OVERVIEW OF THE PROPOSED VFOA RECOGNITION METHODS

In this section, schematic representations of the components of the VFOA recognition methods

proposed in this paper are provided in Fig 2 to give a global view of the methods.

Fig. 2(a) presents the VFOA recognition method when no adaptation is used. The frames of an

input video are sent to the head pose tracking algorithm (described in Section V) which outputs

people’s head poses. These poses are then processed by the VFOA recognizer module (described

in Section VI-A), whose parameters are provided by a parameter setting module (Section VI-B).

In Fig. 2(b), the use of unsupervised adaptation for VFOA recognition is sketched (described

in Section VII). In this case, we employ a batch processing: the whole input video is processed

by the head tracker to obtain the head poses of people over the entire meeting. Then, the

adaptation module estimates in an unsupervised fashion (without using any annotated data) the

VFOA recognizer parameters by ﬁtting the recognizer model to the head poses while taking

into account priors on these parameters. Some of the parameters of these priors are provided

by the parameter setting module. Finally, the VFOA recognition module applies the parameters

obtained through unsupervised adaptation to head poses to output the recognized VFOA.

Fig. 2(c) and 2(d) describes the two options that are used to deﬁne the parameter setting

module involved in Fig. 2(a) and 2(b). The ﬁrst option relies on training data: training videos

9

are sent to the head pose tracking module whose output is used in conjunction with manual

annotations of people’s VFOA to learn the VFOA recognition parameters relating head pose to

VFOA targets. The second option relies on a cognitive model of how people gaze at targets,

and uses the location of people and object in the room as input. Section VI-B describes how

the parameters are set in the two options and used when no adaptation is performed, while

Section VII-C describes how the same parameters are used to deﬁne the hyper-parameters of

the adaptation module.

V. HEAD POSE TRACKING

Head pose can be obtained in two ways: ﬁrst, from the magnetic sensor readings (cf Sec-

tion III). We will consider this virtually noise-free data as our ground truth, denoted GT in the

remaining. Secondly, by applying a head pose tracker on the video stream. In this Section, we

summarize the computer vision probabilistic head tracker that we employed. Then, the pose

estimates provided by the tracker are compared with the GT and analyzed in detail, ultimately

giving us better insight into the VFOA recognition results presented in Section IX.

A. Probabilistic Method for Head Pose Tracking

The Bayesian formulation of the tracking problem is well known. Denoting the hidden state

representing the object conﬁguration at time tby Xtand the observation extracted from the

image by Yt, the objective is to estimate the ﬁltering distribution p(Xt|Y1:t)of the state Xtgiven

the sequence of all the observations Y1:t= (Y1,...,Yt)up to the current time. Given standard

assumptions, Bayesian tracking amounts to solving the following recursive equation:

p(Xt|Y1:t)∝p(Yt|Xt)ZXt−1

p(Xt|Xt−1)p(Xt−1|Y1:t−1)dXt−1(1)

In non-Gaussian and non linear cases, this can be done recursively using sampling approaches,

also known as particle ﬁlters (PF). The idea behind PF consists in representing the ﬁltering

distribution using a set of Nsweighted samples (particles) {Xn

t, wn

t, n = 1, ..., Ns}and updating

this representation when new data arrives. Given the particle set of the previous time step, con-

ﬁgurations of the current step are drawn from a proposal distribution Xt∼Pnwn

t−1p(X|Xn

t−1).

The weights are then computed as wt∝p(Yt|Xt).

Four elements are important in deﬁning a PF: i) a state model deﬁning the object we are

interested in; ii) a dynamical model p(Xt|Xt−1)governing the temporal evolution of the state;

10

(a) (b) (c)

Fig. 3. (a) training head pose appearance range. Pan and tilt angles range respectively from -90◦to 90◦and -60◦to 60◦by

15◦steps. (b) and (c) tracking features. texture features from Gaussian and Gabor ﬁlters b) and skin color binary mask c).

iii) a likelihood model measuring the adequacy of data given the proposed conﬁguration of the

tracked object; and iv) a sampling mechanism which has to propose new conﬁgurations in high

likelihood regions of the state space. These elements are described in the next paragraphs.

State Space: The state space contains both continuous and discrete variables. More precisely, the

state is deﬁned as X= (S, θ, l)where Srepresents the head location and size, and θrepresents

the in-plane head rotation. The variable llabels an element of the discretized set of possible

out-of-plane head poses2(see Fig. 3a).

Dynamical Model: The dynamics governs the temporal evolution of the state, and is deﬁned as

p(Xt|X1:t−1) = p(θt|θt−1, lt)p(lt|lt−1, St)p(St|St−1, St−2).(2)

The dynamics of the in-plane head rotation θtand discrete head pose ltvariables are learned

using head pose GT training data. Head location and size dynamics are modeled as second order

auto-regressive processes.

Observation Model: The observation model p(Y|X)measures the likelihood of the observation

for a given state value. The observations Y= (Ytext, Y col)are composed of texture and color

observations (see Fig. 3 (b) and Fig. 3 (c)). Texture features are represented by the output of

three ﬁlters (a Gaussian and two Gabor ﬁlters at different scales) applied at locations sampled

from image patches extracted from the image and preprocessed by histogram equalization to

reduce light variations effects. Color features are represented by a binary skin mask extracted

using a temporally adapted skin color model. Assuming that, given the state value, texture and

color observation are independent, the observation likelihood is modeled as:

p(Y|X= (S, θ, l)) = ptext (Ytext(S, θ)|l)pcol(Ycol(S, θ)|l)(3)

2Note that (θ, l)is another Euler decomposition (using different axis) of the head pose, which differs from the one described

in Subsection III-B (cf Fig. 3a). Its main computational advantage is that one of the angles corresponds to the in-plane rotation.

It is straightforward to transform from one decomposition to the other.

11

where pcol(·|l)and ptext (·|l)are pose dependent models. For a given hypothesized conﬁguration

X, the parameters (S, θ)deﬁne an image patch on which the features are computed, while the

exemplar index lselects the appropriate appearance model.

Sampling Method: In this work, we use Rao-Blackwellization, a process in which we apply the

standard PF algorithm to the tracking variables Sand θwhile applying an exact ﬁltering step

to the exemplar variable l. The method theoretically results in a reduced estimation variance, as

well as a reduction of the number of samples.

For more details about the models and algorithm, the reader is referred to [18]. Finally, in terms

of complexity, the head tracker (in matlab) can process around 1 frame per second.

B. Head Pose Tracking Evaluation

Protocol: We used a two-fold evaluation protocol, where for each fold, we used half (8 people)

of our IHPD database (see Sec.III-B) as the training set to learn the pose dynamic model and the

remaining half as the test set. Initialization was done automatically using a simple background

subtraction technique, modeling the distribution of a pixel background color with one Gaussian,

and the assumption that background image is available and that there was one face on the left

and right half of the image (cf Fig. 1(c)).

It is important to note that the pose dependent appearance models were not learned using the

same people or head images gathered in the same meeting room environment. We used the

Prima-Pointing database [34], which contains 15 individuals recorded over 93 different poses

(see Fig. 3(a)). However, when learning appearance models over whole head patches, as done

in [18], we experienced tracking failures with 2 out of the 16 people of the IHPD database

(see Section III) which had hair appearances not represented in the Prima-Pointing dataset (e.g.

one of those two people was bald). As a remedy, we trained the appearance models on patches

centered around the visible part of the face, not the head. With this modiﬁcation, no failure was

observed, but performance was slightly worse overall than those reported in [18].

Performance measures: three error measures are used. They are the average errors in pan, tilt and

roll angles, i.e. the average of the absolute difference between the pan, tilt and roll of the ground

truth (GT) and the tracker estimation. We also report the error median value, which should be

less affected by very large errors due to erroneous tracking.

Results: The statistics of the errors are shown in Table I. Overall, given the small head size,

and the fact that the appearance training set is composed of faces recorded in an external set

12

TABLE I

PAN/TILT/ROLL ERROR STATISTICS FOR PERSON LEFT/RIGHT,AND DIFFERENT CONFIGURATIONS OF THE TRUE HEAD POSE.

condition right persons left persons pan near frontal pan near proﬁle tilt near frontal tilt far from frontal

(|α|<45◦) (|α|>45◦) (|β|<30◦) (|β|>30◦)

stat mean med mean med mean med mean med mean med mean med

pan (in ◦) 11.4 8.9 14.9 11.3 11.6 9.5 16.9 14.7 12.7 10 18.6 15.9

tilt (in ◦) 19.8 19.4 18.6 17.1 19.7 18.9 17.5 17.5 19 18.8 22.1 21.4

roll (in ◦) 14 13.2 10.3 8.7 10.1 8.8 18.3 18.1 11.7 10.8 18.1 16.8

(a) (b) (c)

Fig. 4. (a) head pose Euler rotation angles. Note that the zaxis indicates the head pointing direction. (b) and (c) pan, tilt

and roll tracking errors with b) average errors for each person (R for right and L for left person) and c) distribution of tracking

errors over the whole dataset.

up (different people, different viewing and illumination conditions), the results are quite good,

with a majority of head pan errors smaller than 12◦(see Figure 4). However these results hide a

large discrepancy between individuals. For instance, the average pan error ranges from 7◦to 30◦,

and depends mainly on whether the tracked person’s appearance is well represented by those

appearances in the training set which were used to learn the appearance model. This was more

the case for people placed seated on the right than on the left, as shown by Table I.

Table I also shows that overall the pan and roll tracking errors are smaller than the tilt errors.

The main reason is that tilt estimation is more sensitive to the quality of the face localization

than the pan, as pointed out by other researchers [28]. Indeed, even from a perceptive point of

view, visually determining head tilt is more difﬁcult than determining head pan or head roll.

Table I further details the errors depending on whether the true pose is near frontal or not. We

can observe that, in the near frontal poses (|α| ≤ 45◦or |β| ≤ 30◦), the head pose tracking

estimates are more accurate, in particular for the pan and roll value. This can be understood

since for near proﬁle poses, a variation in pan introduces much less appearance change than the

same variation in a near frontal view. Similarly, for high tilt values, the face-image distortion

introduced by perspective shortening affects the quality of the observations.

Finally, these results are comparable to those obtained by others in similar conditions. For

13

instance, [27] achieved a pan estimation error of 16.9 degrees for poses near the frontal position,

and 19.2 degrees for poses near proﬁle (|α|>45◦). In [12], a neural network is used to train a

head pose classiﬁer from data recorded directly in two meeting rooms. When using 15 people

for training and 2 for testing, average errors of 5 degrees in pan and tilt are reported. However,

when training the models in one room and testing on data from the other meeting room, the

average errors rise to 10 degrees.

VI. VISUAL FOCUS OF ATTENTION MODELING

In this Section, we ﬁrst describe the models used to recognize the VFOA from the head pose

measurements, then the two alternatives we adopted to set the model parameters.

A. VFOA recognizer models

Modeling VFOA with a Gaussian Mixture Model (GMM): Let st∈ F denote the VFOA state,

and ztthe head pointing direction of a person at a given time instant t. The head pointing

direction is deﬁned by the head pan (α) and tilt (β) angles, i.e. zt= (αt, βt), since the head

roll (γ) has no effect on the head direction by deﬁnition (see Fig. 3(a)). Estimating the visual

focus can be posed in a probabilistic framework as ﬁnding the VFOA state maximizing the a

posteriori probability:

ˆst= arg max

st∈ F p(st|zt)with p(st|zt) = p(zt|st)p(st)

p(zt)∝p(zt|st)p(st)(4)

For each VFOA fi∈ F which is not unfocused,p(zt|st=fi), which expresses the likeli-

hood of the pose observations for the VFOA state fi. This is modeled as a Gaussian distribu-

tion N(zt;µi,Σi)with mean µiand full covariance matrix Σi. The unfocused state p(zt|st=

unfocused) = uis modeled as a uniform distribution with u=1

180×180 , as the head pan and

tilt angle can vary from -90◦to 90◦. In Eq. 4, p(st=fi) = πidenotes the prior information we

have on a VFOA target fi. Thus, in this modeling, the total pose distribution is represented as

a GMM (plus one uniform mixture), with the mixture index (i) denoting the focus target:

p(zt|λG) = X

st

p(zt, st|λG) = X

st

p(zt|st, λG)p(st|λG) =

K−1

X

i=1

πiN(zt;µi,Σi) + πKu , (5)

where λG={µ= (µi)i=1:K−1,Σ = (Σi)i=1:K−1, π = (πi)i=1:K}represents the parameter set of

the GMM model. Fig. 12 illustrates how the pan-tilt space is split into different VFOA regions

when applying the decision rule of Eq. 4 with the GMM modeling.

14

Modeling VFOA with a Hidden Markov Model (HMM): The GMM approach does not account

for the temporal dependencies between the VFOA events. To introduce such dependencies,

we consider a HMM. A HMM is a natural extension to the GMM approach for modeling

temporal dependencies between the VFOA events. Denoting the VFOA sequence by s0:Tand

the observation sequence by z1:T, the joint posterior probability density function of states and

observations can be written:

p(s0:T, z1:T) = p(s0)

T

Y

t=1

p(zt|st)p(st|st−1)(6)

In this equation, the emission probabilities p(zt|st=fi)are modeled as in the previous case

(i.e. Gaussian distributions for the regular focus targets, uniform distribution for the unfocused

case). However, in the HMM modeling, the static prior distribution on VFOA targets is replaced

by a discrete transition matrix A= (ai,j), deﬁned by ai,j =p(st=fj|st−1=fi), which models

the probability of passing from the focus fito the focus fj. Thus, the set of parameters of the

HMM model is λH={µ, Σ, A = (ai,j )i,j=1:K}. With this model, given the observation sequence,

the VFOA recognition is performed by estimating the optimal sequence of focus targets which

maximizes p(s0:T|z1:T). This optimization is efﬁciently conducted using the Viterbi algorithm

[35]3.

B. VFOA Recognizer Parameter Setting

Gaussian Parameter Setting using labeled Training Data: Since in many meeting settings, peo-

ple are mostly static and seated at the same physical positions, we could set the model parameters

using training data. Thus, given training data with VFOA annotations, and head pose measure-

ments, we can readily estimate all the parameters of the GMM or HMM models. Parameters

learned with this training approach will be denoted with a lsuperscript. Note that µl

iand Σl

iare

learned by ﬁrst computing the VFOA means and covariances per meeting and then averaging

the results on the meetings belonging to the training set.

Gaussian Parameter Setting using a Geometric Model: The training approach to parameter learn-

ing is straightforward when annotated data is available. However, annotating the VFOA of people

in video recording is tedious and time consuming, as training data needs to be gathered and

annotated for each meeting setup. In the case of moving people, this is impossible. As an

3In principle, such a decoding procedure is performed in batch. However, efﬁcient online approximations are available.

15

N H

P

D

αG

αHαE

Fig. 5. Relationship between gazing direction and head orientation.

alternative, we propose a model that exploits the geometric and cognitive nature of the problem.

The parameters set with this model will be denoted with a superscript g(e.g. µg

i).

Assuming that we have a camera calibrated w.r.t. the room, given a head location and a VFOA

target location, it is possible to derive the Euler angles associated with the gaze direction. As

gazing at a target is usually accomplished by rotating both the eyes (’eye-in-head’ rotation)

and the head in the same direction, the head is only partially oriented towards the gaze. In

neurophysiology and cognitive sciences, researchers studying the dynamics of the head/eye

motions involved in saccadic gaze shifts have found that the relative contribution of the head

and eyes towards a given gaze shift follows simple rules [16], [31]. While the experimental

framework employed in these papers do not completely match the meeting room scenario, we

have exploited these ﬁndings to propose a model for predicting a person’s head pose given his

gaze target.

The proposed geometric model is presented in Fig. 5. Given a person P whose reference head

pose corresponds to looking straight ahead in the N direction, and given that he is gazing towards

D, the head points in direction H according to:

αH=κααGif |αG|> ξα,and 0otherwise (7)

where αGand αHdenotes the gaze pan and the actual head pan angle respectively, both w.r.t.

the reference direction N. The parameters of this model, καand ξα, are constants independent

of the gaze target, but usually depend on individuals [16]. While there is a consensus about the

linearity aspect of the relation in Eq. 7, some researchers reported observing head movements

for all gaze shift amplitudes (i.e. ξα=0), while others did not. In this paper, we will assume

ξα= 0. Besides, Eq. 7 is only valid if the contribution of the eyes to the gaze shift (given by

αE=αG−αH) do not exceed a threshold, usually taken at ∼35◦. Finally, in [16], it is shown

that the tilt angle βfollows a similar linearity rule. However, in this case, the contribution of

the head to the gaze shift is usually lower than for the pan case. Typical values range from 0.2

16

to 0.5 for κβ, and 0.5 to 0.8 for κα.

We assume we know the approximate positions of the people’s heads, VFOA targets, and

camera within the room4. The cognitive model can be used to predict the values of the mean

angles µof Gaussian distribution focusing each VFOA target. The reference direction N (Fig. 5)

will be assumed to grossly correspond to the mean of all the gaze targets directions. For both

person left and right, it corresponds to looking at O1 (cf Fig. 1(c)). The covariances Σof

the Gaussian distributions were assumed to be diagonal, and were set by taking into account

the physical target size, and the fact that VFOA targets corresponding to head poses in proﬁle

are associated with larger pan tracking errors. The speciﬁc values were: σα(O1, O2) = 12◦,

σα(P r, P L, S S) = 15◦, and σα(T B) = 17◦for the pan, and σβ(O1, O2, P R, P L, SS) = 12◦,

σβ(T B) = 15◦for the tilt.

Setting the VFOA Prior Distribution πand Transition Matrix A:When training data is avail-

able, one could learn these parameters. If the training meetings exhibit a speciﬁc structure,

as is the case in our database, where the main and secondary organizers always occupy the

same seats, the learned prior will have a beneﬁcial effect on the recognition performances for

similar unseen meetings. However, at the same time, this learned prior can considerably limit

the generalization to other data sets, since by simply exchanging seats between participants,

we obtain meeting sessions with different prior distributions. Thus, we investigated alternatives

that avoided favoring any meeting structures. In the GMM case, this was done by considering

a uniform distribution (denoted πu) over the prior π. In the HMM case, transitions deﬁning

the probability of keeping the same focus were favored and transitions to other focuses were

distributed uniformly according to: ai,i =ǫ < 1(we used ǫ= 0.75), and ai,j =1−ǫ

K−1for i6=j

where Kis the number of VFOA targets. We denote as Authe constructed transition matrix.

VII. VFOA MODELS ADAPTATION

The VFOA recognizers described in the previous section are generic and can be applied

indifferently to any new person seated at the location corresponding to the deﬁned model. In

practice, however, we observed that people have personal ways of looking at targets. For example,

some people use their eye-in-head rotation capabilities more and turn less their head towards

4The relation in Eq. 7 is valid in the person’s head reference. The camera position is needed in order to transform the obtained

pose values into head poses w.r.t. to the camera.

17

(a) (b)

Fig. 6. Examples of gaze behaviours. (a) and (b): in both images, the person on the right looks at the target O1. In (b),

however, the head is used more rotated toward O1than in (a).

the focused target than others (see Fig 6(a) and Fig 6(b)). In addition, our head pose tracking

system is sensitive to the visual appearance of people, and can introduce a systematic bias in the

estimated head pose for a given person. As a consequence, the parameters of the generic models

might not be the best for a given person. As a remedy we propose to exploit the Maximum A

Posteriori (MAP) estimation principle to adapt, in an unsupervised fashion, the generic VFOA

models to the data of each new meeting, and thus produce models adapted to an individual’s

characteristics.

A. VFOA Maximum a Posteriori (MAP) Adaptation Principle

The MAP adaptation procedure we followed is a batch process, as explained in Section IV.

Its principle is the following: Let z=z1, ..., zTdenotes the unlabeled sequence of head poses

of one person, to which we want to adapt our model, and λ∈Λthe parameter of the VFOA

recognizer to be estimated from the head pose data. The MAP estimate ˆ

λof the parameters is

then deﬁned as:

ˆ

λ= arg max

λ∈Λp(λ|z) = arg max

λ∈Λp(z|λ)p(λ)(8)

where p(z|λ)is the data likelihood and p(λ)is the prior on the parameters. The goal is thus to

ﬁnd the parameters that best ﬁt the observed head pose distribution, while avoiding too large

deviation from sensible values through the use of priors on the parameters. The choice of the prior

distribution is crucial for the MAP estimation. In [36] it is shown that for GMMs and HMMs, by

selecting the prior probability density function (pdf) on λas the product of appropriate conjugate

distributions of the likelihood of the data 5, then the MAP estimation can also be solved using

the Expectation-Maximization (EM) algorithm, as detailed in the next two sub-sections.

5A prior distribution g(λ)is the conjugate distribution of a likelihood function f(z|λ)if the posterior f(z|λ)g(λ)belongs to

the same distribution family as g.

18

B. VFOA GMM and HMM MAP Adaptation

GMM MAP Adaptation: In the case the VFOA target are modeled by a GMM, the data likeli-

hood is p(z|λG) = QT

t=1 p(zt|λG), where p(zt|λG)is the mixture model given in Eq. 5, and λG

are the parameters to be learnt.

For this model, it is possible to express the prior probability as a product of individual conjugate

priors [36]. Accordingly, the conjugate prior of the multinomial mixture weights is the Dirichlet

distribution D(νw1,...,νwK)whose pdf is given by:

pD

νw1,...,ν wK(π1,...,πK)∝

K

Y

k=1

πνwi−1

i(9)

Additionally, the conjugate prior for the Gaussian mean and the inverse covariance matrix of a

given mixture is the Normal-Wishart distribution, W(τ, mi, d, Vi)(i= 1, ..., K −1), with pdf

pW

i(µi,Σ−1

i)∝ |Σ−1

i|d−p

2exp −τ

2(µi−mi)′Σ−1

i(µi−mi)×(10)

exp(−1

2tr(ViΣ−1

i)), d > p

where tr denotes the trace operator, (µi−mi)′denotes the transpose of (µi−mi), and pdenotes

the observations’ dimension. Thus the prior distribution on the set of all the parameters is deﬁned

as

p(λG) = pD

νw1,...,ν wK(π1,...,πK)

K−1

Y

i=1

pW

i(µi,Σ−1

i).(11)

The MAP estimate ˆ

λGof the distribution p(z|λG)p(λG)can thus be computed using the EM

algorithm by recursively applying the following computations (see Fig. 7) [36]:

cit =ˆπip(zt|ˆµi,ˆ

Σi)

PK

j=1 ˆπjp(zt|ˆµj,ˆ

Σj)and ci=

T

X

t=1

cit (12)

¯zi=1

ci

T

X

t=1

citztand Si=1

ci

T

X

t=1

cit(zt−¯zi)(zt−¯zi)′(13)

where ˆ

λG= (ˆπ, (ˆµ, ˆ

Σ)) denotes the current parameter ﬁt. Given these coefﬁcients, the M step

re-estimation formulas are given by:

ˆπi=νwi−1 + ci

ν−K+T,ˆµi=τmi+ci¯zi

τ+ciand ˆ

Σi=Vi+ciSi+ciτ

ci+τ(mi−¯zi)(mi−¯zi)′

d−p+ci(14)

The setting of the hyper-parameters of the prior distribution p(λG)in Eq. 11, which is discussed

at the end of this Section, is important as the adaptation is unsupervised, and thus only the prior

19

Input : adaptation parameters (ν, {wi})for the Dirichlet, (τ, d, {mi, Vi})for the Wishart prior.

Output : estimated parameter ˆ

λGof the recognizer model

•Initialization of ˆ

λG:ˆπi=wi,ˆµi=mi,ˆ

Σi=Vi/(d−p)

•EM: repeat until convergence:

1) Expectation: compute cit ,¯ziand Si(Eq. 12 and 13) using the current parameter set ˆ

λG

2) Maximization: update parameter set ˆ

λGusing the re-estimations formulas (Equations 14)

Fig. 7. GMM MAP adaptation procedure.

prevents the adaptation process from deviating from meaningful VFOA distributions.

VFOA MAP HMM Adaptation: The VFOA HMM can also be adapted in an unsupervised way

to new test data using the MAP framework [36]. The parameters to adapt in this case are the

transition matrix and the parameters of the emission probabilities λH={A, (µ, Σ)}.

The adaptation of the HMM parameters leads to a procedure similar to the GMM adaptation

case. Indeed, the prior on the Gaussian parameters follows the same Normal-Wishart density

(Eq. 10), and the Dirichlet prior on the static VFOA prior πis replaced by a Dirichlet prior on

each row p(.|s=fi) = ai,·of the transition matrix. Accordingly, the full prior is:

p(λH)∝

K

Y

i=1

pD

νbi,1,...,ν bi,K (ai,1,...,ai,K )

K−1

Y

i=1

pW

i(µi,Σ−1

i)(15)

Then the EM algorithm to compute the MAP estimate can be conducted in the following manner.

For a sequence of observations, z= (z1, ..., zT), the hidden states are now composed of a

corresponding state sequence s1, .., sT, which allows us to compute the joint state-observation

density (cf Eq. 6). Thus, in the E step, one needs to compute ξi,j,t =p(st−1=fi, st=fj|z, ˆ

λH)

and ci,t =p(st=fi|z, ˆ

λH), which respectively denote the joint probability of being in the state

fiand fjat time t−1and t, and the probability of being in state fiat time t, given the current

model ˆ

λHand the observed sequence z. These values can be obtained using the Baum-Welch

forward-backward algorithm [35]. Given these values, the re-estimation formulas for the mean

and covariance matrices are the same as those in Eq. 14 and as follows for the transition matrix

parameters:

ˆai,j =νbi,j −1 + PT−1

t=1 ξi,j,t

ν−K+PK

j=1 PT−1

t=1 ξi,j,t

.(16)

C. Choice of Prior Distribution Parameters

In this section we discuss the impact of the hyper-parameter settings on the MAP estimates,

through the analysis of the re-estimation formula (Eq. 14). Before going into details, recall that

20

Tdenotes the size of the data set available for adaptation, and Kis the number of VFOA targets.

Parameter values for the Dirichlet distribution: The Dirichlet distribution is deﬁned by two kinds

of parameters: a scale factor νand the prior values on the mixture weights wi(with Piwi= 1).

The scale factor νcontrols the balance between the prior distribution on the mixture weights

wand the data. If νis small (resp. large) with respect to T−K, the adaptation is dominated

by the data (resp. by the prior, i.e. almost no adaptation occurs). When ν=T−K, the data

and prior contribute equally to the adaptation process. In our experiments, the hyper-parameter

νwill be selected through cross-validation among the values in Cν={ν1=T−K, ν2=

2(T−K), ν3= 3(T−K)}. The prior weights wi, on the other hand, are deﬁned according to

the prior knowledge we have on the distribution of VFOA targets. Since as explained before,

we want to enforce that no knowledge about the VFOA targets distribution, the wican be set

uniformly equal to 1

K.

Parameter values for the Normal-Wishart distribution: This distribution deﬁnes the prior on the

mean µiand covariance Σiof one Gaussian. The adaptation of the mean is essentially controlled

by two parameters (see Eq. 14): the prior value for the mean, mi, which will be set to the value

computed using either the learning (mi=µl

i) or the geometric approach (mi=µg

i) and a scalar

τ, which linearly controls the contribution of the prior mito the estimated mean. As the average

value for ci, is T

K, in the experiments, we will select τthough cross-validation among the values

in Cτ={τ1=T

2K, τ2=T

K, τ3=2T

K, τ4=5T

K}. Thus, with the ﬁrst value τ1, the mean

adaptation is on average dominated by the data. With τ2, the adaptation is balanced between the

data and prior distrubution on the means, and with the two last values, adaptation is dominated

by the priors on the means.

The prior on the covariance is more difﬁcult to set. It is deﬁned by the Wishart distribution

parameters, namely the prior covariance matrix Viand the number of degrees of freedom d.

From Eq. 14, we see that the data covariance and the deviation of the data mean from the

mean prior also inﬂuence the MAP covariance estimate. As a prior Wishart covariance, we will

take Vi= (d−p)˜

Vi, where ˜

Viis either Σl

ior Σg

i, the covariance of target fiset either using

training data or the geometric model (Subsection VI-B) respectively. The weighting (d−p)is

important, as it allows Vito be of the same order of magnitude than the data variance ciSi. In

the experiments, we will use d=5T

K, which puts an emphasis on the prior, and restricts the

adaptation from deviating far from the covariance priors.

21

VIII. EVALUATION SET UP

The evaluation of the VFOA models was conducted using the IHPD database (Section III).

Below, we describe our performance measures and give details about the experimental protocol.

A. Performance Measures

We propose two kinds of error measures for performance evaluation.

The Frame based Recognition Rate (FRR) which corresponds to the percentage of frames, or

equivalently, the proportion of time, during which the VFOA has been correctly recognized.

This rate, however, can be dominated by VFOA events of long duration (a VFOA event is

deﬁned as a temporal segment with the same VFOA label). Since we are also interested in

the dynamics of the VFOA, which contains information related to interaction, we also need a

measure reﬂecting how well these events, short or long, are recognized.

Event based precision/recall, and F-measure. Let us consider two sequences of VFOA events:

the GT sequence Gobtained from human annotation, and the recognized sequence Robtained

through VFOA estimation. The GT sequence is deﬁned as G= (Gi= (li, Ii= [bi, ei]))i=1,...NG

where NGis the number of events in the ground truth G,li∈ F is the ith VFOA event label,

and biand eiare the beginning and end time instants of the event Gi. The recognized sequence R

is deﬁned similarly. To compute the performance measures, the two sequences are ﬁrst aligned

using a string alignment procedure that takes into account the temporal extent of the events.

More precisely, the matching distance between two events Giand Rjis deﬁned as:

d(Gi, Rj) =

1−FIif li=ljand I∩=Ii∩Ij6=∅

2otherwise (i.e. events do not match),(17)

with FI=2ρIπI

ρI+πI

, ρI=|I∩|

|Ii|, πI=|I∩|

|Ij|(18)

where |.|denotes the cardinality operator, and FImeasures the degree of overlap between two

events. Then, given the alignment we can compute the recall ρE, the precision πE, and the

F-measure FEfor each person measuring the event recognition performance, deﬁned as:

ρE=Nmatched

NG

, πE=Nmatched

NRand FE=2ρEπE

ρE+πE

,(19)

where Nmatched represents the number of events in the recognized sequence that match the same

event in the GT after alignment. The recall measures the percentage of ground truth events that

22

(GT, Left) (GT, Right) (TR, Left) (TR, Right)

Fig. 8. Distribution of overlap measures FIbetween true and estimated matched events. The estimated events were obtained

using the HMM approach. GT and TR respectively denote the use of GT head pose data and tracking estimates data. Left and

Right denote person left and right respectively.

acronyms description

gt the head pose measurements are the ground truth data obtained with the magnetic sensor

tr the head pose measurements are those obtained with the head tracking algorithm

gmm the VFOA recognition model is a GMM

hmm the VFOA recognition model is an HMM

ML maximum likelihood approach: the meeting used for testing is used to train the model parameters

ge parameters of the Gaussian were set using the geometric gaze approach

ad VFOA model parameters were adapted

TABLE II

MODEL ACRONYMS:ACRONYM COMBINATIONS DESCRIBE WHICH EXPERIMENTAL CONDITIONS ARE USED. FOR EXAMPLE,

GT-HMM-GE INDICATES THAT THE HMM VFOA RECOGNIZER WITH PARAMETERS SET USING THE GEOMETRIC GAZE

MODEL WERE APPLIED TO GROUND TRUTH POSE DATA.

are correctly recognized while the precision measure the percentage of estimated events that

are correct. Both precision and recall need to be high to characterize a good VFOA recognition

performance. The F-measure, deﬁned as the harmonic mean of recall and precision, reﬂects this

requirement. We report the average of the precision, recall and F-measure FEof the 8 individuals

over the whole database (and for each seat position). Note that according to Eq. 17, events are

said to match whenever their common intersection is not empty (and labels match). One may

think that the counted matches could be generated by spurious accidental matches due to a very

small intersection. In practice, however, we observe that it is not the case: the vast majority of

matched events have a signiﬁcant degree of overlap FI, as illustrated in Fig. 8, with 90% of the

matches exhibiting an overlap higher than 50%, even using noisier tracking data.

B. Experimental Protocol

To study the different modeling aspects, several experimental conditions have been deﬁned.

They are summarized in Table II along with the acronyms that identify them in the result tables.

First, there are two alternatives regarding the head pose measurements: the ground truth gt case,

23

VFOA recognition without adaptation

µi,ΣiGaussian parameters - learned (µl

i,Σl

i) or given by geometric modeling (µg

i,Σg

i), cf Subsection VI-B.

π, A GMM and HMM model priors - set to the values πu, Au, as described in Subsection VI-B.

VFOA recognition with adaptation

µi,Σi,π, A same as above - set as the result of the adaptation process.

νscale factor of Dirichlet distribution - set through cross-validation.

wi, bi,j Dirichlet prior values of πiand ai,j - set to πu

iand au

i,j .

τscale factor of Normal prior distribution on mean - set through cross-validation.

miVFOA mean prior value of Normal prior distribution - set to either µl

ior µg

i.

dscale factor of Wishart prior distribution on covariance matrix - set by hand (cf Sec. VII-C).

ViVFOA covariance matrices prior values in Wishart distribution - set to either (d−2)Σl

ior (d−2)Σg

i.

TABLE III

VFOA MODELING PARAMETERS:DESCRIPTION AND SETTING. THE GAZE FACTORS κα, κβWERE SET BY HAND.

where the data is obtained using the FOB magnetic sensor, and the tr case, which relies on the

estimates obtained with the video tracking system. Secondly, there are the two VFOA recognizer

models, gmm and hmm, as described in Subsections VI-A. Regarding the approach relying on

training data, the default protocol is the leave-one-out approach: each meeting recording is

in turn left aside for testing, while the data of the 7 other recordings are used for parameter

learning, including hyper-parameter selection in the adaptation case (denoted ad). The maximum

likelihood case ML is an exception, in which the training data for a given meeting recording

is composed of the same single recording. The ge acronym denotes the case where the VFOA

Gaussian means and covariances were set according to the geometric model instead of being

learned from training data. Finally, the adaptation hyper-parameter pair (ν, τ )was selected (in

the cartesian set Cν×Cτ) by cross-validation over the training data, using FEas performance

measure to maximize. A summary of all parameters involved in the modeling and the way they

were set depending on whether there was adaptation or not is displayed in Table III.

IX. EXPERIMENTAL RESULTS

This section provides results under the various experimental conditions. We ﬁrst analyze the

results obtained on the GT head pose data, and then compare them with those obtained using

the tracking estimates instead. In both cases, we discuss the effectiveness of the modeling w.r.t.

different issues: (i) relevance of head pose to model VFOA gaze targets, (ii) predictability of

VFOA head pose parameters, (iii) impact of the person’s position in the room. Then, we comment

on the results of the adaptation scheme. Note that although these ﬁrst sets of results are only

shown with the parameter setting using the training data, the conclusions that are made are also

24

(a) (b) (c) (d)

(e) (f) (g) (h)

Fig. 9. Example of results and focus ambiguity. In green, tracking result and head pointing direction. In yellow, recognized

focus (hmm-ad condition). Images (g) and (h): despite the high visual similarity of the head pose, the true focus differ (in (g):

PL; in h: SS). Resolving such cases can only be done by using context (speaking status, other’s people gaze, slide activity etc).

data ground truth (gt) tracking estimates (tr)

modeling ML gmm hmm ML gmm hmm

FRR 79.7 72.3 72.3 57.4 47.3 47.4

recall 79.6 72.6 65.5 66.4 49.1 38.4

precision 51.2 55.1 66.7 28.9 30 59.3

F-measure FE62 62.4 65.8 38.2 34.8 45.2

TABLE IV

VFOA RECOGNITION RESULTS FOR PERSON LEFT UNDER DIFFERENT EXPERIMENTAL CONDITIONS (SEE TABLE II).

valid for the geometric parameter setting. In Section IX-D, we compare in details the results

obtained with the geometric parameter setting and those obtained with the training parameter

setting. In all cases, results are given separately for the left and right persons (see Fig. 1). Some

result illustrations are provided in Fig.9.

A. Results on GT head pose data

VFOA and head pose correlation: Table IV and V display the VFOA recognition results for

person left and right respectively. The ﬁrst column of these two tables gives the results of

the ML estimation (see Tab. II) with a GMM. These results show, in an optimistic case, the

performances our model can achieve, and illustrate the correlation between a person’s head

25

data ground truth (gt) tracking estimates (tr)

modeling ML gmm hmm ML gmm hmm

FRR 68.9 56.8 57.3 43.6 38.1 38

recall 72.9 66.6 58.4 65.6 55.9 37.3

precision 47.4 49.9 63.5 24.1 26.8 55.1

F-measure FE56.9 54.4 59.5 34.8 35.6 43.8

TABLE V

VFOA RECOGNITION RESULTS FOR PERSON RIGHT UNDER DIFFERENT EXPERIMENTAL CONDITIONS (SEE TABLE II).

Fig. 10. Empirical distribution of the GT head pose pan angle computed over the database for PL (left image) and P R. For

P L, the people and slide screen VFOA targets can still be identiﬁed through the pan modes. For PR, the degree of overlap is

quite signiﬁcant.

poses and his VFOA. As can be seen, this correlation is quite high for P L (almost 80% FRR),

showing the good concordance between head pose and VFOA. This correlation, however, drops

to near 69% for P R. This can be explained by the fact that for the person on the right (P R), there

is a strong ambiguity between looking at PL or SS, as illustrated by the empirical distributions

of the pan angle in Fig. 10. Indeed, the range of pan values within which the three other meeting

participants and the slide screen VFOA targets lies is half the pan range of the person sitting

to the left (P L). The average angular distance between these targets is around 20◦for P R, a

distance which can easily be covered using only eye movements rather than rotating the head.

The values of the confusion matrices, displayed in Fig. 11, corroborate this analysis. The analysis

of Tables IV and V shows that this discrepancy between the results for P L and P R holds for

all experimental conditions and algorithms, with a performance decrease from P L to P R of

approximately 10-13% and 6%, for the FRR and event F-measure respectively.

VFOA Prediction: In the ML condition, very good results were achieved but they were biased

because the test data was used to set the Gaussian parameters. On the contrary, the GMM and

HMM results in Table IV and V, for which the VFOA parameters were learned from other

persons’ data, highlights the generalization property of the modeling. We can observe that the

26

(a) (GT, Left) (b) (GT, Right) (c) (TR, Left) (d) (TR, Right)

Fig. 11. Frame-based recognition confusion matrices obtained with the HMM modeling (gt-hmm and tr-hmm conditions).

VFOA targets 1 to 4 have been ordered according to their pan proximity: PR: person right - PL: person left - O1 and O2:

organizer 1 and 2 - SS: slide screen - TB: table - U: unfocused. Columns represent the recognized VFOA.

−80 −60 −40 −20 0 20 40

−50

−40

−30

−20

−10

0

10

20

30

pan

tilt

(a)

−80 −60 −40 −20 0 20 40

−50

−40

−30

−20

−10

0

10

20

30

pan

tilt

(b)

Fig. 12. Pan-tilt space VFOA decision maps for person right built from all meetings, in the GMM case (cf Eq. 4), using GT

(a) or tracking head pose data (b). Black=P L, yellow=SS, blue=O1, green=O2, red=T B, magenta =U.

GMM and HMM methods produce results close to the ML case. For both P L and P R, the

GMM approach achieves better frame recognition and event recall performance while the HMM

is giving better event precision and FEresults. This can be explained since the HMM approach

is effectively denoising the event sequence. As a result some events are missed (lower recall)

but the precision increases due to the elimination of short spurious detections.

VFOA Confusions: Figure 11(a) and 11(b) display as images the confusion matrices for P L

and P R obtained with the VFOA FRR performance measure and an HMM. They clearly exhibit

confusion between VFOA targets which are proximate in the head pose space. For instance,

for P L,O2is sometimes confused with P R or O1. For P R, the main source of confusion is

between P L and SS, as already mentioned. In addition, the table, T B, can be confused with

O1and O2, as can be expected since these targets share more or less the same pan values with

T B. Thus, most of the confusion can be explained by the geometry of the room and the fact

that people can modify their gaze without adjusting their head pose, and therefore do not always

need to turn their heads to focus on a speciﬁc VFOA target.

27

B. Results on Head Pose Estimates data

Table IV and V provide the results obtained using the head pose tracking estimates, under

the same experimental conditions as those used for the GT head pose data. As can be seen,

substantial performance degradation is observed. In the ML case, the decrease in FRR and F-

measure ranges from 22% to 26% for both P L and P R. These degradations are mainly due

to small pose estimation errors and also, sometimes, large errors due to short periods when the

tracker locks on a sub-part of the face. Fig. 12 illustrates the effect of pose estimation errors

on the VFOA distributions. The shape changes in the VFOA decision maps when moving from

GT pose data to pose estimates convey the increase of pose variance measured for each VFOA

target. The increase is moderate for the pan angle, but quite important for the tilt angle.

A more detailed analysis of Table IV and V shows that the performance decreases (from GT

to tracking data) in the GMM condition follows the ML case, while the deterioration in the

HMM case is smaller, in particular for FE. This demonstrates that, in contrast with what was

observed with the clean GT pose data, in the presence of noisy data, the HMM smoothing effect

is quite beneﬁcial. Also, the HMM performance decrease is smaller for P R (19% and 15% for

respectively FRR and FE) than for P L (25% and 20%). This can be due to the better tracking

performance -in particular regarding the pan angle- achieved on people seated at the position P R

(as reported in Table I). Fig. 13 presents the plot of the VFOA FRR versus the pan angle tracking

error for each meeting participant, when using GT head pose data (i.e. with no tracking error)

or pose estimates. It shows that for P L, there is a strong correlation between tracking errors and

VFOA performances, which can be due to the fact that higher tracking errors directly generate

larger overlaps between the VFOA class-conditional pose distributions (cf Fig. 10, left). For P R,

this correlation is weaker, as the same good tracking performance results in very different VFOA

recognition results. In this case, the increase of ambiguities between several VFOA targets (e.g.

SS and P L) may play a larger role.

Finally, Fig. 11(c) and Fig. 11(d) display the confusion matrices when using the HMM and the

head pose estimates. In this case, the confusion matrices are very similar to the case using GT.

However for the head pose estimates case more confusion is observed due to the tracking errors

and the uncertainties in the tilt estimation (see Fig 13).

28

0 5 10 15 20 25 30 35 40 45

0

10

20

30

40

50

60

70

80

90

100

pan errors

FRR

PR gt points

PL gt points

PR tr points

PL tr points

fitted line to PR tr points

fitted line to PL tr points

Fig. 13. VFOA frame based recognition rate vs head pose tracking errors (for the pan angle), plotted per meeting. The VFOA

recognizer is the HMM modeling after adaptation.

person error measure gt-gmm gt-gmm-ad gt-hmm gt-hmm-ad tr-gmm tr-gmm-ad tr-hmm tr-hmm-ad

LFRR 72.3 72.3 72.3 72.7 47.3 57.1 47.4 53.1

F-measure FE62.4 61.2 65.8 66.2 34.8 42.8 45.2 47.9

RFRR 56.8 59.3 57.3 62 38.1 39.3 38 41.8

F-measure FE54.4 56.4 59.5 62.7 35.6 37.3 43.8 48.8

TABLE VI

VFOA RECOGNITION RESULTS FOR PERSON LEFT (L) AND RIGHT (R), BEFORE AND AFTER ADAPTATION.

C. Results with Model Adaptation

Table VI displays the recognition performance obtained with the adaptation framework de-

scribed in Section VII6. For P L, one can observe no improvement when using GT data and a

large improvement when using the tracking estimates (e.g. around 10% and 8% for resp. FRR

and FEwith the GMM model). In this situation, the adaptation is able to cope with the tracking

errors and the variability in looking at a given target. For P R, we notice an improvement with

both the GT and tracking head pose data. For instance, with the HMM model and tracking data,

the improvement is 3.8% and 5% for FRR and FE. Again, in this situation adaptation can cope

with an individual way of looking at the targets, such as correcting the bias in the estimated

head tilt , as illustrated in Fig. 14.

When exploring the optimal adaptation parameters estimated through cross-validation, one ob-

tains the histograms of Fig. 15. As can be seen, regardless of the kind of input pose data (GT

or estimates), they correspond to conﬁgurations giving approximately equal balance to the data

and prior w.r.t. the adaptation of the HMM transition matrices (ν1and ν2), and conﬁgurations

6In the tables, we recall the values without adaptation for ease of comparison.

29

−80 −60 −40 −20 0 20 40

−50

−40

−30

−20

−10

0

10

20

30

pan

tilt

−80 −60 −40 −20 0 20 40

−50

−40

−30

−20

−10

0

10

20

30

pan

tilt

Fig. 14. VFOA decision map example before adaptation (Left) and after adaptation (right). After adaptation, the VFOA of

O1and O2correspond to lower tilt values. black=P L, yellow=SS, blue=O1, green=O2, red=T B , magenta =U. The blue stars

represent the tracking head pose estimates used for adaptation.

(a) (b)

Fig. 15. Histogram of the optimal scale adaptation factor of the HMM prior (a) and HMM VFOA mean (b), selected though

cross-validation on the training set, and when working with GT head pose data.

for which the data are driving the adaptation process of the mean pose values (τ1and τ2).

D. Results with the Geometrical VFOA Modeling

Here we report the results obtained when setting the model parameters by exploiting the

meeting room geometry, as described in Subsection VI-B. This possibility for setting parameters

is interesting because it removes the need for data annotation each time a new focus target is

considered (for instance, if a 5th person was introduced the table).

Fig. 16 shows the geometric VFOA Gaussian parameters (mean and covariance) generated by

the model when using (κα, κβ) = (0.5,0.5). As can be seen, the VFOA pose values predicted

by the model are consistent with the average pose values computed for individuals using the GT

pose data. This is demonstrated by Table VII, which provides the prediction errors in pan Epan

deﬁned as:

Epan =1

8×(K−1)

8

X

m=1 X

fi∈F/{U}

|¯αm(fi)−αp

m(fi)|(20)

where ¯αm(fi)is the average pan value of the person in meeting mand for the VFOA fi, and

αp

m(fi)is the predicted value according to the chosen model (i.e. the pan component of µg

fi

or µl

fiin the geometric or learning approaches respectively). The tilt prediction error Etilt is

30

Method learned VFOA geometric VFOA geometric VFOA

(with cross-validation) (with κα=κβ= 0.5)

Error Epan Etilt Epan Etilt Epan Etilt

P L 6.4 5.1 5.5 6.4 5.8 6.4

P R 5.9 6.1 5.6 7.6 12.8 7.4

TABLE VII

PREDICTION ERRORS (IN DEGREES)FOR LEARNED VFOA AND GEOMETRIC VFOA MODELS (WITH GT POSE DATA). IN

THE GEOMETRIC CROSS-VALIDATED CASE,THE SAME METHODOLOGY THAN IN THE LEARNING CASE IS USED:FOR EACH

MEETING THE EMPLOYED κα(OR κβ)HAS BEEN LEARNED ON THE OTHER MEETINGS.

obtained by replacing pan angles by tilt angles in Eq. 20. As can be seen, using cross-validated

καand κβvalues provides better results than setting these parameters to the constant values

(κα, κβ) = (0.5,0.5) used in all the recognition experiments reported below. Also, we noticed

that usually the καvalues providing good prediction are lower when using tracking data than

when using the ground truth head pose data. A likely explanation is that the head tracker under-

estimates the pan angles. Thus, to account for this, a smaller καhas to be used to obtain better

prediction. Interestingly enough, however, in practice we did not ﬁnd any particular relationship

between an optimal angular prediction (as measured by Eq. 20) and the VFOA recognition results,

showing that the selection of these values is not critical. We thus relied on (κα, κβ) = (0.5,0.5)

for all our experiments.

The recognition performance is presented in Table VIII. These tables show that, when using

GT head pose data, the results are slightly worse than with the learning approach, which is

apparent in the similarity of the prediction errors. However, when using the pose estimates, the

results are better. For instance, for P L, when comparing the method setting the parameter using

the geometric approach to the method setting the parameter using the training based approach

both method with adaptation, the FRR improvement is more than 6%. It is interesting and

encouraging given that the modeling does not require any training data. Also, we notice that

the adaptation always improves the recognition, sometimes quite signiﬁcantly (see the GT data

condition for P R, or the tracking data for P L).

Comparison with Stiefelhagen et al [12]: Our results seem quite far from the 73% reported by

31

Fig. 16. Geometric VFOA Gaussian distributions for P R (left image) and P L (right): the ﬁgure displays the gaze target

direction (), the corresponding head pose contribution according to the geometric model with values (κα, κβ) = (0.5,0.5)

(△symbols), and the average head pose (from GT pose data) of individual people (+). Ellipses display the standard deviations

used in the geometric modeling. black=P L or P R, cyan=SS, blue=O1, green=O2, red=T B.

person Measure gt gt-ge gt-ad gt-ge-ad tr tr-ge tr-ad tr-ge-ad

LFRR 72.3 69.3 72.7 70.8 47.4 55.2 53.1 59.5

F-measure FE65.8 65.2 66.2 65.3 45.2 48.2 47.9 50.1

RFRR 57.3 51.8 62 58.5 38 41.1 41.8 42.7

F-measure FE59.5 53 62.7 59.2 43.8 49.1 48.8 50.1

TABLE VIII

VFOA RECOGNITION RESULTS FOR P L AND P L USING THE HMM MODEL WITH THE GEOMETRIC VFOA PARAMETER

SETTING ((κα, κβ) = (0.5,0.5)), WITH/WITHOUT ADAPTATION. FOR EASE OF COMPARISON,WE RECALL THE RESULTS

WITH THE TRAINING PARAMETER SETTING.

Stiefelhagen et al [12]7. Several factors may explain the difference. First, in [12], meeting with 4

people were studied and no other target apart from the other meeting participants was considered.

In addition, these participants were sitting at equally spaced positions around the table, optimizing

the discriminability between VFOA targets. People were recorded from a camera placed directly

in front of them. Hence, due to the table geometry, the majority of head pan lay between

[−45◦,45◦], where the tracking errors are smaller (see Table I). Ultimately, our results are more

in accordance with the 52% FRR reported by the same authors [37] when using the same

framework as in [12] but applied to a 5-person meeting, resulting in 4 possible VFOA targets.

Nevertheless, as comparing algorithm results on different setups is quite difﬁcult, we implemented

the methodology proposed in [12], [37] to recognize the VFOA solely from head pose. This

7Note that in [12], approaches to recognize the VFOA from audio, and a combination of audio and head pose are also provided.

However, for the remainder of this paper, we compare our method with their approach on recognizing the VFOA solely from

head pose, since this is the scope of our paper.

32

Method Stiefelhagen et al [12] Our model

measure gt-L tr-L gt-R tr-R gt-ge-ad-L tr-ge-ad-L gt-ge-ad-R tr-ge-ad-R

FRR 61.9 55.7 53.1 39.6 70.8 59.5 58.5 42.7

F-measure FE53.8 35.1 43.8 34.7 65.3 50.1 59.2 50.1

TABLE IX

COMPARISON OF OUR VFOA RECOGNITION APPROACH (HMM WITH GEOMETRIC MODEL AND ADAPTATION)AND [12]

(SEE FOOTNOTE 7).

methodology consists of ﬁrst clustering the head pose measurements of an individual person

using the k-means algorithm, and then using the outcome to initialize the learning of a GMM

similar to the one we presented. Finally, each component of the GMM mixture is associated

with a target focus using a set of rules. This approach clearly has several issues, especially

when the number of targets is large: how to initialize the k-means algorithm, and how to deﬁne

the association rules. As no information was given in [12] w.r.t. k-means initialization, we

experimented with different alternatives and report the best results, which were obtained using

the gaze values predicted by the geometrical model (random initialization produced on average

much worse results than those presented, around 10% less). Each component was associated

with a focus by taking the mixture with the lowest mean tilt value as the table, and other

mixtures were associated to the other VFOA targets based on their respective pan values. The

comparative results are given in Table IX. They clearly show that our method leads to signiﬁcant

improvements in all conditions. Interestingly enough, the improvement is higher when using

uncorrupted head pose measurements (i.e. the GT data). These improvements validate our use

of the MAP adaptation framework. Indeed, while in [12] full freedom is given to the data to

drive the adaptation process, our experiments show (cf Figure 15) that the optimal adaptation

parameters, selected by cross-validation, give equal importance to the data and the prior set on

the GMM parameters to obtain better models.

X. CONCLUSION AND FUTURE WORK

In this paper, we addressed the VFOA recognition of meeting participants from their head pose

in complex meeting scenarios. Head pose measurements were obtained either through magnetic

ﬁeld sensors or using a head pose tracking algorithm. Several alternative models were studied.

33

Thorough experiments on a large and challenging database made publicly available, gave the

following outcome:

•inﬂuence of the physical setup: when using head pose tracking estimates, average recog-

nition rates of 60% and and 42% were obtained for the left and right seat respectively. It

shows that good VFOA recognition can only be achieved if the visual targets of a person

are well separated in the head pose angular space, which mainly depends on the person’s

position in the meeting room.

•head pose tracking: accurate pose estimation is essential for good results. Around 11%

and 16% error decreases were observed for the left and right seat respectively when using

the pose estimates instead of the ground truth. In addition, experiments showed that there

exists some correlation between head pose tracking errors and VFOA recognition results.

•VFOA recognizer model: the HMM method is performing better than that of the GMM.

While this can not be observed with the standard Frame Recognition Rate measure, the

newly introduced event-based measure FEshows that the temporal smoothing introduced

by the HMM removes spurious detections in the VFOA estimation.

•training data vs geometric model: to avoid the need for training data, we have proposed a

novel cognitive model exploiting the room geometry to set the recognizer parameters which

links the head pose measures to the VFOA targets. Compared with the standard approach

based on training data, and with a state-of-the-art algorithm, the new approach was shown

to provide much better results when using the head pose tracking estimates as input.

•unsupervised adaptation: results show that in all conditions, automatically adapting the

VFOA recognition parameters using the unlabeled head pose measurements improves the

recognition.

From the above, there are several ways to increase performance. The ﬁrst one is to increase

the separation between the visual targets. However, in practice, this is limited by the number

of people that we want to accommodate and the activities that people are allowed to perform.

The second one is to improve the pose tracking algorithms. This can be achieved using multiple

cameras, higher resolution images, or adaptive appearance modeling techniques, preferably in a

supervised fashion, by setting up training session to acquire people’s appearance at the beginning

of a meeting.

A third way to improve VFOA recognition can only come from the prior knowledge embedded in

34

the cognitive and interactive aspects of human-to-human communication. Ambiguous situations

such as the one illustrated in Fig. 9(g) and Fig. 9(h), where the same head pose can correspond

to two different VFOA targets, could be resolved by the joint modeling of the speaking status

and VFOA of all meeting participants. The relationship between speech and VFOA, used for

instance in [12], has been shown to exhibit speciﬁc patterns in the behavioral and cognitive

literature, as already exploited by [13] to derive conversation structures.

Finally, in the case of meetings in which people are moving to the slide screen or white board

for presentations, the development of a more general approach that models the VFOA of these

moving people will be necessary.

REFERENCES

[1] J. Khan and O. Komogortsev, “A hybrid scheme for perceptual object window design with joint scene analysis and eye-gaze

tracking for media encoding based on perceptual attention,” Journal of Electronic Imaging, vol. 15, pp. 332–350, 2006.

[2] K. Smith, S. Ba, D. Gatica-Perez, and J.-M. Odobez, “Multi-person wandering focus of attention tracking,” in International

Conference on Multimodal Interfaces, Banff, Canada, Nov. 2006.

[3] O. Kulyk, J. Wang, and J. Terken, Machine Learning for Multimodal Interaction, ser. LNCS 3869. Springer Verlag, 2006,

ch. Real-Time Feedback on Nonverbal Behaviour to Enhance Social Dynamics in Small Group Meetings.

[4] J. McGrath, Groups: Interaction and Performance. Prentice-Hall, 1984.

[5] D. Heylen, “Challenges ahead head movements and other social acts in conversation,” in The Joint Symposium on Virtual

Social Agent, 2005.

[6] S. Langton, R. Watt, and V. Bruce, “Do the eyes have it? cues to the direction of social attention,” Trends in Cognitive

Sciences, vol. 4(2), pp. 50–58, 2000.

[7] J. N. Bailenson, A. Beal, J. Loomis, J. Blascovitch, and M. Turk, “Transformed social interaction, augmented gaze, and

social inﬂuence in immersive virtual environments,” Human Comm. Research, vol. 31, no. 4, pp. 511–537, Oct. 2005.

[8] N. Jovanovic and H. Op den Akker, “Towards automatic addressee identiﬁcation in multi-party dialogues,” in 5th SIGdial

Workshop on Discourse and Dialogue, 2004.

[9] S. Duncan Jr, “Some signals and rules for taking speaking turns in conversations,” Journal of Personality and Social

Psychology, vol. 23(2), pp. 283–292, 1972.

[10] D. Novick, B. Hansen, and K. Ward, “Coordinating turn taking with gaze,” in Int. Conf. on Spoken Lang. Processing,

1996.

[11] R. Rienks, D. Zhang, D. Gatica-Perez, and W. Post, “Detection and Application of Inﬂuence Rankings in Small Group

Meetings,” in ACM - Inter. Conf. on Multimodal Interfaces, Banff, Canada, Nov. 2006.

[12] R. Stiefelhagen, J. Yang, and A. Waibel, “Modeling focus of attention for meeting indexing based on multiple cues,” IEEE

Transactions on Neural Networks, vol. 13(4), pp. 928–938, 2002.

[13] K. Otsuka, Y. Takemae, J. Yamato, and H. Murase, “A probabilistic inference of multiparty-conversation structure based

on markov-switching models of gaze patterns, head directions, and utterances,” in Proc. of International Conference on

Multimodal Interface (ICMI’05), Trento, Italy, Oct. 2005, pp. 191–198.

[14] ICPR-POINTING, “Icpr: Pointing’04: Visual observation of deictic gestures workshop,” 2004.

35

[15] CLEAR, “CLEAR evaluation campaign and workshop,” 2006.

[16] E. G. Freedman and D. L. Sparks, “Eye-head coordination during head-unrestrained gaze shifts in rhesus monkeys,” Journal

of Neurophysiology, vol. 77, pp. 2328–2348, 1997.

[17] I. Malinov, J. Epelboim, A. Herst, and R. Steinman, “Characteristics of saccades and vergence in two kinds of sequential

looking tasks,” Vision Research, 2000.

[18] S. O. Ba and J. M. Odobez, “A rao-blackwellized mixed state particle ﬁlter for head pose tracking,” in ACM-ICMI Workshop

on Multi-modal Multi-party Meeting Processing (MMMP), Trento Italy, 2005, pp. 9–16.

[19] C. Morimoto and M. Mimica, “Eye gaze tracking techniques for interactive applications,” Computer Vision and Image

Understanding, vol. 98, pp. 4–24, 2005.

[20] R. Pieters, E. Rosbergen, and M. Hartog, “Visual attention to advertising: The impact of motivation and repetition,” in

Conference on Advances in Consumer Research, 1995.

[21] J.-G. Wang and E. Sung, “Study on eye gaze estimation,” IEEE Transactions on Systems, Man and Cybernetics, Part B,

vol. 32, pp. 332–350, 2002.

[22] R. Stiefelhagen and J. Zhu, “Head orientation and gaze direction in meetings,” in Conference on Human Factors in

Computing Systems, 2002.

[23] A. Gee and R. Cipolla, “Estimating gaze from a single view of a face,” in Int. Conf. on Pattern Recognition, 1994.

[24] T. Horprasert, Y. Yacoob, and L. Davis, “Computing 3d head orientation from a monocular image sequence,” in IEEE

International Conference on Automatic Face and Gesture Recognition, 1996.

[25] T. Cootes and P. Kittipanya-ngam, “Comparing variations on the active appearance model algorithm,” in British Mach.

Vis. Conf. (BMVC), 2002.

[26] S. Srinivasan and K. L. Boyer, “Head pose estimation using view based eigenspaces,” in Int. Conf. on Pat. Recognition,

2002.

[27] Y. Wu and K. Toyama, “Wide range illumination insensitive head orientation estimation,” in IEEE Conference on Automatic

Face and Gesture Recognition, 2001.

[28] L. Brown and Y. Tian, “A study of coarse head pose estimation,” in IEEE Work. on Motion and Video Computing, 2002.

[29] L. Lu, Z. Zhang, H. Shum, Z. Liu, and H. Chen, “Model and exemplar-based robust head pose tracking under occlusion

and varying expression,” in IEEE Workshop on Models versus Exemplars in Computer Vision (CVPR-MECV), Dec. 2001.

[30] M. Danninger, R. Vertegaal, D. Siewiorek, and A. Mamuji, “Using social geometry to manage interruptions and co-worker

attention in ofﬁce environments,” in Proc. of the Conf. on Graphics Interfaces, Victoria, Canada, 2005, pp. 211–218.

[31] M. Hayhoe and D. Ballard, “Eye movements in natural behavior,” TRENDS in Cog. Sciences, vol. 9(4), pp. 188–194, 2005.

[32] S. Baron-Cohen, “How to build a baby that can read minds: cognitive mechanisms in mindreading,” Cahier de psychologies

Cognitive, vol. 13, pp. 513–552, 1994.

[33] J.-M. Odobez, “Focus of attention coding guidelines,” IDIAP Reasearch Institute, Tech. Rep. IDIAP-COM-2, 2006.

[34] N. Gourier, D. Hall, and J. L. Crowley, “Estimating face orientation from robust detection of salient facial features,” in

Pointing 2004, ICPR International Workshop on Visual Observation of Deictic Gestures, 2004, pp. 183–191.

[35] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Readings in Speech

Recognition, vol. 53A(3), pp. 267–296, 1990.

[36] J. Gauvain and C. H. Lee, “Bayesian learning for hidden Markov model with Gaussian mixture state observation densities,”

Speech Communication, vol. 11, pp. 205–213, 1992.

[37] R. Stiefelhagen, “Tracking and modeling focus of attention,” Ph.D. dissertation, University of Karlsruhe, 2002.